Input file: Omics data
Three types of input files (Omics data, Metadata, Config file) are necessary for this system. Example files are in ./test
directory of the repository.
Bulk omics data
This tool allows to integrate following types of bulk omics data: Transcriptome, Microbiome (Bacterial composition) and Metabolome. The files have to follow the formats as below.
Basic structures
First column have to be entry names (e.g., gene name)
First row have to be sample names
Example
Metabolite HSM5FZBJ MSM5FZ9X CSM5FZ3N
1-3-7-trimethylurate 0.0366911709150494 5.43991362683316 124.707446592178
1-methylguanine 36.70439623255 5.71735703889826 3.21072296327096
1-methylguanosine 1.11185382656899 0.738463088128547 0.264874440752025
1-methylhistamine 0.987786208017349 1.79220474268428 1.20991463790381
Option: rownames
Option rownames
in config file defines which identifiers are used as rownames of the bulk omics datasets. Following identifiers are allowed. The identifiers are used as entity names in the knowledge graph.
Transcriptome: Gene symbol, Ensembl Gene ID (e.g., ENSG00000001)
Microbiome: Organism name, QIIME2-style (e.g., g__Escherichia)
Metabolome: Metabolite name, HMDB ID
Option: unit
Option unit
in config file defines unit of the values. Values have to be normalized into the following units.
Transcriptome: TPM, FPKM
Microbiome: Composition (sum of values per sample should be 1.0)
Metabolome: PPM
scRNA-seq data
The following files are required to integrate single-cell RNA-seq datasets.
scRNA-seq expression data:
Manifest file
orh5ad file
Cell-level metadata
Expression data
Manifest file
CellRanger’s outs
directories are output by Cellranger as explained here.
Example is in ./test/scrna-manifest.txt
. Manifest file has to be tab-separated file with two columns.
Column 1:
DataID
ID of dataColumn 2: Path to Cellranger’s
outs
directories
Example
DataID File
HC1_Cecum_CD45-_Baseline /path/to/HC1_Cecum_CD45-_Baseline/outs
HC1_Cecum_CD45+_Baseline /path/to/HC1_Cecum_CD45+_Baseline/outs
HC1_Cecum_Epi+_Baseline /path/to/HC1_Cecum_Epi+_Baseline/outs
HC1_Sigma_CD45-_Baseline /path/to/HC1_Sigma_CD45-_Baseline/outs
HC1_Sigma_CD45+_Baseline /path/to/HC1_Sigma_CD45+_Baseline/outs
HC1_Sigma_Epi+_Baseline /path/to/HC1_Sigma_Epi+_Baseline/outs
h5ad file
Expression data can be provided as h5ad file format. Object names (adata.obs_names
) have to be Barcode
in Cell-level metadata as explained in the following sections.
Cell-level metadata
This will be explained in Cell-level metadata section.
Metadata
Multi-omics study is diverse. Experimental design and data types are different among studies.
Multi-omics data have several characteristics that complicate their data integration. For example, (i) sample numbers are often not equivallent among data types, (ii) sampling sites are also sometimes different among data types (microbiome & metabolome are sometimes derived from stool samples), (iii) single-cell data has one more dimension “cell” which does not exist in bulk omics data.
To flexibly deal with the diversity of multi-omics datasets, this tool requires metadata in four different levels (Patient-level, Sample-level, Cell-level, Samplemap). You can create the metadata files as follows.
Patient-level metadata
Required columns
PatientID
: ID of patients. Duplication is not allowed in this file.
Optional columns
- Column names must be like
NAME[category|numeric]
.If you want to include age in this file, you should name the columnage[numeric]
and columns will be recognized as numeric values.
Example
PatientID Age[numeric] Sex[category] Disease[category]
C3001 43 Female CD
C3002 76 Female CD
C3003 43 Female UC
C3004 47 Female UC
C3005 76 Female UC
Sample-level metadata
Required columns
SampleID
: ID of samples. Duplication is not allowed in this file.PatientID
: ID of patients. The ID must be in the patient-level metadata
Optional columns
Column names must be like
NAME[category|numeric]
.
Example
SampleID PatientID Tissue[category] Time[category]
C3001CSC1_CD_Rectum_2 C3001 Rectum 2 weeks
C3001CSC2_CD_Ileum_2 C3001 Ileum 2 weeks
C3002CSC1_CD_Sigmoid Colon_0 C3002 Colon 0 weeks
C3002CSC2_CD_Rectum_0 C3002 Rectum 0 weeks
C3002CSC3_CD_Ileum_0 C3002 Ileum 0 weeks
Cell-level metadata
Required columns
Barcode
: Cell barcodes (correspond tobarcodes.tsv.gz
in Cellranger’s output)DataID
: ID of data. The ID must be in the manifest fileCellType
: Manually labeled celltypes (e.g., Th17, Inflammatory monocyte)CellTypeGroup
: Group of celltypes (e.g., CD4T, B, Plasma)
Optional columns
Column names must be like
NAME[category|numeric]
Example
Barcode DataID CellType CellTypeGroup
AACCATGCACGTCTCT-1 HC1_Cecum_CD45-_Baseline CD4 Eff T
ACACCAAGTGCCTGTG-1 HC1_Cecum_CD45-_Baseline Th1 T
ATAAGAGTCGCGATCG-1 HC1_Cecum_CD45-_Baseline CD8 terminal effector T
CATCCACAGGGCACTA-1 HC1_Cecum_CD45-_Baseline CD8 Trm T
CCACCTACACCCATTC-1 HC1_Cecum_CD45-_Baseline Trm17/IL26 T
Samplemap
The idea of Samplemap is from R package MultiAssayExperiment
. This file is used to map column names in bulk omics data and SampleID
in the sample-level metadata. This file allows biological duplicates (multiple data from same origin) in the bulk omics dataset.
Required columns
Column 1: colnames in the bulk omics data
Column 2:
SampleID
in the sample-level metadataColumn 3: Omics data (transcriptome, microbiome, metabolome)
Example
DataID SampleID datatype
GSM3043377 C3002CSC1_CD_Sigmoid Colon_0 transcriptome
GSM3043378 C3002CSC2_CD_Ileum_0 transcriptome
GSM3043379 C3002CSC3_CD_Rectum_0 transcriptome
GSM3043380 C3002CSC4_CD_Sigmoid Colon_0 transcriptome
GSM3043381 C3003CSC1_UC_Ileum_1 transcriptome