======================= Input file: Omics data ======================= Three types of input files (*Omics data*, *Metadata*, *Config file*) are necessary for this system. Example files are in ``./test`` directory of the repository. .. raw:: html

--------------- Bulk omics data --------------- This tool allows to integrate following types of bulk omics data: *Transcriptome*, *Microbiome (Bacterial composition)* and *Metabolome*. The files have to follow the formats as below. Basic structures ================ - First column have to be entry names (e.g., gene name) - First row have to be sample names **Example** .. code-block:: Metabolite HSM5FZBJ MSM5FZ9X CSM5FZ3N 1-3-7-trimethylurate 0.0366911709150494 5.43991362683316 124.707446592178 1-methylguanine 36.70439623255 5.71735703889826 3.21072296327096 1-methylguanosine 1.11185382656899 0.738463088128547 0.264874440752025 1-methylhistamine 0.987786208017349 1.79220474268428 1.20991463790381 ++++ Option: ``rownames`` ================== Option ``rownames`` in config file defines which identifiers are used as rownames of the bulk omics datasets. Following identifiers are allowed. The identifiers are used as entity names in the knowledge graph. - Transcriptome: Gene symbol, Ensembl Gene ID (e.g., ENSG00000001) - Microbiome: Organism name, QIIME2-style (e.g., g__Escherichia) - Metabolome: Metabolite name, HMDB ID Option: ``unit`` ================== Option ``unit`` in config file defines unit of the values. Values have to be normalized into the following units. - Transcriptome: TPM, FPKM - Microbiome: Composition (sum of values per sample should be 1.0) - Metabolome: PPM .. raw:: html

--------------- scRNA-seq data --------------- The following files are required to integrate single-cell RNA-seq datasets. - scRNA-seq expression data: ``Manifest file`` or ``h5ad file`` - Cell-level metadata Expression data =============== Manifest file ------------- CellRanger's ``outs`` directories are output by Cellranger as explained `here `__. Example is in ``./test/scrna-manifest.txt``. Manifest file has to be tab-separated file with two columns. - Column 1: ``DataID`` ID of data - Column 2: Path to Cellranger's ``outs`` directories **Example** .. code-block:: DataID File HC1_Cecum_CD45-_Baseline /path/to/HC1_Cecum_CD45-_Baseline/outs HC1_Cecum_CD45+_Baseline /path/to/HC1_Cecum_CD45+_Baseline/outs HC1_Cecum_Epi+_Baseline /path/to/HC1_Cecum_Epi+_Baseline/outs HC1_Sigma_CD45-_Baseline /path/to/HC1_Sigma_CD45-_Baseline/outs HC1_Sigma_CD45+_Baseline /path/to/HC1_Sigma_CD45+_Baseline/outs HC1_Sigma_Epi+_Baseline /path/to/HC1_Sigma_Epi+_Baseline/outs h5ad file --------- Expression data can be provided as h5ad file format. Object names (``adata.obs_names``) have to be ``Barcode`` in Cell-level metadata as explained in the following sections. ++++ Cell-level metadata =================== This will be explained in :ref:`Cell-level metadata` section. .. raw:: html

---------- Metadata ---------- Multi-omics study is diverse. Experimental design and data types are different among studies. Multi-omics data have several characteristics that complicate their data integration. For example, (i) sample numbers are often not equivallent among data types, (ii) sampling sites are also sometimes different among data types (microbiome & metabolome are sometimes derived from stool samples), (iii) single-cell data has one more dimension "cell" which does not exist in bulk omics data. To flexibly deal with the diversity of multi-omics datasets, this tool requires metadata in four different levels (Patient-level, Sample-level, Cell-level, Samplemap). You can create the metadata files as follows. Patient-level metadata ====================== Required columns ---------------- - ``PatientID``: ID of patients. Duplication is not allowed in this file. Optional columns ---------------- - | Column names must be like ``NAME[category|numeric]``. | If you want to include age in this file, you should name the column ``age[numeric]`` and columns will be recognized as numeric values. **Example** .. code-block:: PatientID Age[numeric] Sex[category] Disease[category] C3001 43 Female CD C3002 76 Female CD C3003 43 Female UC C3004 47 Female UC C3005 76 Female UC ++++ Sample-level metadata ====================== Required columns ---------------- - ``SampleID``: ID of samples. Duplication is not allowed in this file. - ``PatientID``: ID of patients. The ID must be in the patient-level metadata Optional columns ---------------- - Column names must be like ``NAME[category|numeric]``. **Example** .. code-block:: SampleID PatientID Tissue[category] Time[category] C3001CSC1_CD_Rectum_2 C3001 Rectum 2 weeks C3001CSC2_CD_Ileum_2 C3001 Ileum 2 weeks C3002CSC1_CD_Sigmoid Colon_0 C3002 Colon 0 weeks C3002CSC2_CD_Rectum_0 C3002 Rectum 0 weeks C3002CSC3_CD_Ileum_0 C3002 Ileum 0 weeks ++++ Cell-level metadata =================== - Required columns - ``Barcode``: Cell barcodes (correspond to ``barcodes.tsv.gz`` in Cellranger's output) - ``DataID``: ID of data. The ID must be in the manifest file - ``CellType``: Manually labeled celltypes (e.g., Th17, Inflammatory monocyte) - ``CellTypeGroup``: Group of celltypes (e.g., CD4T, B, Plasma) - Optional columns - Column names must be like ``NAME[category|numeric]`` **Example** .. code-block:: Barcode DataID CellType CellTypeGroup AACCATGCACGTCTCT-1 HC1_Cecum_CD45-_Baseline CD4 Eff T ACACCAAGTGCCTGTG-1 HC1_Cecum_CD45-_Baseline Th1 T ATAAGAGTCGCGATCG-1 HC1_Cecum_CD45-_Baseline CD8 terminal effector T CATCCACAGGGCACTA-1 HC1_Cecum_CD45-_Baseline CD8 Trm T CCACCTACACCCATTC-1 HC1_Cecum_CD45-_Baseline Trm17/IL26 T ++++ Samplemap =================== The idea of Samplemap is from R package ``MultiAssayExperiment``. This file is used to map column names in bulk omics data and ``SampleID`` in the sample-level metadata. This file allows biological duplicates (multiple data from same origin) in the bulk omics dataset. - Required columns - Column 1: colnames in the bulk omics data - Column 2: ``SampleID`` in the sample-level metadata - Column 3: Omics data (transcriptome, microbiome, metabolome) **Example** .. code-block:: DataID SampleID datatype GSM3043377 C3002CSC1_CD_Sigmoid Colon_0 transcriptome GSM3043378 C3002CSC2_CD_Ileum_0 transcriptome GSM3043379 C3002CSC3_CD_Rectum_0 transcriptome GSM3043380 C3002CSC4_CD_Sigmoid Colon_0 transcriptome GSM3043381 C3003CSC1_UC_Ileum_1 transcriptome