Input file: Omics data

Three types of input files (Omics data, Metadata, Config file) are necessary for this system. Example files are in ./test directory of the repository.



Bulk omics data

This tool allows to integrate following types of bulk omics data: Transcriptome, Microbiome (Bacterial composition) and Metabolome. The files have to follow the formats as below.

Basic structures

  • First column have to be entry names (e.g., gene name)

  • First row have to be sample names

Example

Metabolite    HSM5FZBJ        MSM5FZ9X        CSM5FZ3N
1-3-7-trimethylurate  0.0366911709150494      5.43991362683316        124.707446592178
1-methylguanine       36.70439623255  5.71735703889826        3.21072296327096
1-methylguanosine     1.11185382656899        0.738463088128547       0.264874440752025
1-methylhistamine     0.987786208017349       1.79220474268428        1.20991463790381

Option: rownames

Option rownames in config file defines which identifiers are used as rownames of the bulk omics datasets. Following identifiers are allowed. The identifiers are used as entity names in the knowledge graph.

  • Transcriptome: Gene symbol, Ensembl Gene ID (e.g., ENSG00000001)

  • Microbiome: Organism name, QIIME2-style (e.g., g__Escherichia)

  • Metabolome: Metabolite name, HMDB ID

Option: unit

Option unit in config file defines unit of the values. Values have to be normalized into the following units.

  • Transcriptome: TPM, FPKM

  • Microbiome: Composition (sum of values per sample should be 1.0)

  • Metabolome: PPM



scRNA-seq data

The following files are required to integrate single-cell RNA-seq datasets.

  • scRNA-seq expression data: Manifest file or h5ad file

  • Cell-level metadata

Expression data

Manifest file

CellRanger’s outs directories are output by Cellranger as explained here. Example is in ./test/scrna-manifest.txt. Manifest file has to be tab-separated file with two columns.

  • Column 1: DataID ID of data

  • Column 2: Path to Cellranger’s outs directories

Example

DataID        File
HC1_Cecum_CD45-_Baseline      /path/to/HC1_Cecum_CD45-_Baseline/outs
HC1_Cecum_CD45+_Baseline      /path/to/HC1_Cecum_CD45+_Baseline/outs
HC1_Cecum_Epi+_Baseline       /path/to/HC1_Cecum_Epi+_Baseline/outs
HC1_Sigma_CD45-_Baseline      /path/to/HC1_Sigma_CD45-_Baseline/outs
HC1_Sigma_CD45+_Baseline      /path/to/HC1_Sigma_CD45+_Baseline/outs
HC1_Sigma_Epi+_Baseline       /path/to/HC1_Sigma_Epi+_Baseline/outs

h5ad file

Expression data can be provided as h5ad file format. Object names (adata.obs_names) have to be Barcode in Cell-level metadata as explained in the following sections.


Cell-level metadata

This will be explained in Cell-level metadata section.



Metadata

Multi-omics study is diverse. Experimental design and data types are different among studies.

Multi-omics data have several characteristics that complicate their data integration. For example, (i) sample numbers are often not equivallent among data types, (ii) sampling sites are also sometimes different among data types (microbiome & metabolome are sometimes derived from stool samples), (iii) single-cell data has one more dimension “cell” which does not exist in bulk omics data.

To flexibly deal with the diversity of multi-omics datasets, this tool requires metadata in four different levels (Patient-level, Sample-level, Cell-level, Samplemap). You can create the metadata files as follows.

Patient-level metadata

Required columns

  • PatientID: ID of patients. Duplication is not allowed in this file.

Optional columns

  • Column names must be like NAME[category|numeric].
    If you want to include age in this file, you should name the column age[numeric] and columns will be recognized as numeric values.

Example

PatientID       Age[numeric]    Sex[category]   Disease[category]
C3001   43      Female  CD
C3002   76      Female  CD
C3003   43      Female  UC
C3004   47      Female  UC
C3005   76      Female  UC

Sample-level metadata

Required columns

  • SampleID: ID of samples. Duplication is not allowed in this file.

  • PatientID: ID of patients. The ID must be in the patient-level metadata

Optional columns

  • Column names must be like NAME[category|numeric].

Example

SampleID      PatientID       Tissue[category]        Time[category]
C3001CSC1_CD_Rectum_2 C3001   Rectum  2 weeks
C3001CSC2_CD_Ileum_2  C3001   Ileum   2 weeks
C3002CSC1_CD_Sigmoid  Colon_0 C3002   Colon   0 weeks
C3002CSC2_CD_Rectum_0 C3002   Rectum  0 weeks
C3002CSC3_CD_Ileum_0  C3002   Ileum   0 weeks

Cell-level metadata

  • Required columns

    • Barcode: Cell barcodes (correspond to barcodes.tsv.gz in Cellranger’s output)

    • DataID: ID of data. The ID must be in the manifest file

    • CellType: Manually labeled celltypes (e.g., Th17, Inflammatory monocyte)

    • CellTypeGroup: Group of celltypes (e.g., CD4T, B, Plasma)

  • Optional columns

    • Column names must be like NAME[category|numeric]

Example

Barcode       DataID  CellType        CellTypeGroup
AACCATGCACGTCTCT-1    HC1_Cecum_CD45-_Baseline        CD4 Eff T
ACACCAAGTGCCTGTG-1    HC1_Cecum_CD45-_Baseline        Th1     T
ATAAGAGTCGCGATCG-1    HC1_Cecum_CD45-_Baseline        CD8 terminal effector   T
CATCCACAGGGCACTA-1    HC1_Cecum_CD45-_Baseline        CD8 Trm T
CCACCTACACCCATTC-1    HC1_Cecum_CD45-_Baseline        Trm17/IL26      T

Samplemap

The idea of Samplemap is from R package MultiAssayExperiment. This file is used to map column names in bulk omics data and SampleID in the sample-level metadata. This file allows biological duplicates (multiple data from same origin) in the bulk omics dataset.

  • Required columns

    • Column 1: colnames in the bulk omics data

    • Column 2: SampleID in the sample-level metadata

    • Column 3: Omics data (transcriptome, microbiome, metabolome)

Example

DataID        SampleID        datatype
GSM3043377    C3002CSC1_CD_Sigmoid    Colon_0 transcriptome
GSM3043378    C3002CSC2_CD_Ileum_0    transcriptome
GSM3043379    C3002CSC3_CD_Rectum_0   transcriptome
GSM3043380    C3002CSC4_CD_Sigmoid    Colon_0 transcriptome
GSM3043381    C3003CSC1_UC_Ileum_1    transcriptome