Introduction
In order to maximize the utility of the data HTAN generates, we have defined a structured schema of all data, associated metadata and their relationships. HTAN has been fortunate enough to start at a time with ample of great data modeling examples from other projects including The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC) and the NCI Genomic Data Commons (GDC). HTAN tries to follow in those footsteps and strives to provide compatibility with other relevant platforms, such as the Human Cell Atlas (HCA) Data Coordination Platform and the Human Biomolecular Atlas Program (HuBMAP).
HTAN Atlases
The HTAN data is generated by 12 different atlases. An atlas is a group of people from one our more institutes that study a specific cancer type. All data is associated to an atlas.
Tiered Data Organization for Clinical Data and BioSpecimens
The clinical metadata on Research Participants and BioSpecimens follows a tiered approach, where the first tier contain the most common metadata and the higher tiers are more specific dependent on particular use cases. See e.g. the organization of clinical data:
Levels for other types of data (Imaging, Sequencing)
The other types of data follow a leveled approach, similar to TCGA, where the raw data is level 1 and higher levels are further processed data. E.g. for single cell RNASeq:
Level 1 – Raw primary data, e.g. FASTQs and unaligned BAMs
Level 2 – Aligned primary data, e.g. aligned BAMs
Level 3 – Derived biomolecular data, i.e. gene expression matrix file
Level 4 – Sample level summary, i.e. t-SNE plot coordinates
Level 5 – Cohort level summary, i.e. significantly mutated genes
Data Model Implementation
HTAN uses bioschemas to define the data model. Bioschema extends schema.org, a community effort used by many search engines that provides a way to define information with properties. Bioschemas define profiles over types that state which properties must be used (minimum), should be used (recommended), and could be used (optional). HTAN and other consortiums, including the Human Cell Atlas and HuBMAP are working together to provide common shared schemas. One can find more info on bioschemas.org.