Datasets Overview

Directory Structure
Documentation

Directory Structure
Documentation

LBCC

Directory Structure

Within a dataset, at the final stable/post-processing stage, the structure should be

LBCC
    -- dataset
        -- README.md
        -- data_dump
        -- code
            -- README
        -- BIDS
        -- rawdata
        -- derivatives
        -- README

Subdirectories

code

Any code used to download or organize this data set belongs in the code folder.

datadump

All files downloaded from a dataset

rawdata

Not always present, used for when file conversion is necessary when the contents of a data dump are compressed file archives, dicoms, or some other file type that requires conversion to nifti

BIDS

BIDSifed data and participants file + dictionary

derivatives

Output of pre-processing (Synthseg) any post-processing pipelines

Documentation

Respublica

In the README file in the dataset directory on Respublica, the downloader must include:

Information on the download (source, download date, who downloaded the data)
Data inspection (tier, inspection date)
Notes descriptive of the contents of the data and any other important information
Summary demographics including diagnoses (if applicable)
Organization process (who organized it, organized date)

The template README can be found here: /mnt/isilon/bgdlab_processing/LBCC/template_not_git/README_dataset.md

Example

For example, the README.md for the hypothetical Miniature Spork data set would read as follows:

Dataset: Miniature Spork 
**Data Source**
Data Source: theminiaturesporkproject.org
Downloaded: Dabriel Zimmerman
Download Date: 08/12/2024
**Inspection**
Data Tier: Tier A
Inspector: Dabriel Zimmerman
Inspection Date: 8/12/2024
Notes: Dataset downloaded for test purposes. Contains dummy imaging data. 
**Summary Demographics**
N Total = 5       | Mean Age (adjusted age in days) = 9131
N Male = 3        | Min Age (djusted age in days) = 8766
N Female = 2      | Max Age (djusted age in days) = 9496
**Organization Process**
Organizer: Dabriel Zimmerman
Organization Date: 8/12/2024

A table summarizing the data present is kept in the BIDS directory along with a dictionary:

participants.tsv
participants.json

Google Docs

In the Lifespan spreadsheet, each dataset is given its own entry with summary demograhics and other relevant information after dataset has been peer reviewed. The individual who downloaded the data set is responsible for adding the information.

ClickUp

This list tracks the status of LBCC datasets:

Requested: Dataset requested and added to queue
Download: Requested data downloaded from source onto Respublica
Organize: Downloaded dataset is organized and BIDS complaint
Peer Review: Organized dataset is peer reviewed to ensure dataset is up to code and documentation is complete
BABS Synthseg: Reviewed datasets are processed using BABS Synthseg and phenotypes have been extracted
Supplemental Processing: Reviewed datasets have been processed with Synthseg and now are undergoind additional processing with custom pipelines (i.e. AFNI CT, eaCSF)
Dropped: Dataset is dropped during any point of the organizing/pre-processing process

Additionally, data tier, assignee, and requestor are tracked in the list.

Example entry

Dataset of Tier A that is completely curated and has been processed

CHOP Imaging Datasets

There are currently 3 main sources of data from CHOP:

SLIP - scans requested from radiology with limited imaging pathology
Clinical Imaging Genetics - ??
Nf1 - ??

SLIP data is contained in the SLIP directory. Clinical imaging genetics data is contained in misc_clinical as YYYY_MM_genetics_patient_imaging. Nf1 data is contained ??? (seems to be both in misc_clinical and nf1). Non-SLIP CHOP data follows the same conventions/strcuture except where specified.

Directory Structure

SLIP requests are delivered into the dicoms subdirectory with a manifest YYYY_MM_requested_sessions_with_metadata.csv listing all relevant metadata needed for organization. The dataorg-arcus GitHup repo should be cloned here.

Subdirectories

SLIP
    -- dataset
        -- BIDS
        -- dataorg-arcus
        -- derivatives
        -- dicoms
        -- sourcedata
        -- rawdata-tmp
        -- YYYY_MM_requested_sessions_with_metadata.csv

dataorg-arcus

Cloned dataorg-arcus repo used for organization of the data.

dicoms

Delivered dicoms from Arcus

W/ GCP garbage string

SLIP
    -- dataset
        -- dicoms
            -- HM10RYBAA_394518280517_6061
                -- GCP*
                    -- *
                        -- *
                            -- *dcm
        -- YYYY_MM_requested_sessions_with_metadata.csv

W/O GCP garbage string

SLIP
    -- dataset
        -- dicoms
            -- HM10RYBAA_394518280517_6061
                -- *
                    -- *
                        -- *dcm
         -- YYYY_MM_requested_sessions_with_metadata.csv

In the example HM10RYBAA_394518280517_6061 string consists of the anonymized subject identifier, the procedure order identifier, and the patient’s age in days post birth. The first level down contains a dump of anonymized files associated with that set of information.

sourcedata

Sorted dicoms

rawdata-tmp

Intermediary directory for heudiconv processing

BIDS

BIDSifed data, participants file + dictionary, CuBIDS output

derivatives

Output of Synthseg and any other post-processing pipelines

heudiconv-fails

Not always present, appears if there are issues with heudiconv and it fails for specific sessions.

Documentation

Respublica

A table summarizing the data present is kept in the BIDS directory along with a dictionary:

participants.tsv
participants.json

ClickUp

This list tracks the status of CHOP imaging datasets (SLIP, CIG, NF1):

Requested: Request initiated for data to be delivered from radiology
Download: Requested data downloaded onto Respublica
Organize: Downloaded data organized and BIDS complaint
Peer Review: Organized datais peer reviewed to ensure dataset is up to code and documentation is complete
Stable: Reviewed dataset is ready for post-processing and will not undergo changes at the BIDS level.

Additionally, data tier, assignee, and requestor are tracked in the list.

References

Last updated 2024-09-04 by dabrielz

Contributors

@jmschabdach

@dabrielz