=======================
Input file: Omics data
=======================

Three types of input files (*Omics data*, *Metadata*, *Config file*) are necessary for this system. Example files are in ``./test`` directory of the repository. 


.. raw:: html

   <br>
   <hr style="border:1px solid gray">


---------------
Bulk omics data
---------------

This tool allows to integrate following types of bulk omics data: *Transcriptome*, *Microbiome (Bacterial composition)* and *Metabolome*. The files have to follow the formats as below.

Basic structures
================
- First column have to be entry names (e.g., gene name)
- First row have to be sample names

**Example**

.. code-block:: 
  
  Metabolite	HSM5FZBJ	MSM5FZ9X	CSM5FZ3N
  1-3-7-trimethylurate	0.0366911709150494	5.43991362683316	124.707446592178
  1-methylguanine	36.70439623255	5.71735703889826	3.21072296327096
  1-methylguanosine	1.11185382656899	0.738463088128547	0.264874440752025
  1-methylhistamine	0.987786208017349	1.79220474268428	1.20991463790381

++++

Option: ``rownames``
==================
Option ``rownames`` in config file defines which identifiers are used as rownames of the bulk omics datasets. Following identifiers are allowed. The identifiers are used as entity names in the knowledge graph.

- Transcriptome: Gene symbol, Ensembl Gene ID (e.g., ENSG00000001)
- Microbiome: Organism name, QIIME2-style (e.g., g__Escherichia)
- Metabolome: Metabolite name, HMDB ID
  
Option: ``unit``
==================
Option ``unit`` in config file defines unit of the values. Values have to be normalized into the following units. 

- Transcriptome: TPM, FPKM
- Microbiome: Composition (sum of values per sample should be 1.0)
- Metabolome: PPM
  

.. raw:: html

   <br>
   <hr style="border:1px solid gray">


---------------
scRNA-seq data
---------------

The following files are required to integrate single-cell RNA-seq datasets. 

- scRNA-seq expression data: ``Manifest file`` or ``h5ad file``
- Cell-level metadata

Expression data
===============

Manifest file
-------------

CellRanger's ``outs`` directories are output by Cellranger as explained `here <https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/overview>`__.
Example is in ``./test/scrna-manifest.txt``. Manifest file has to be tab-separated file with two columns. 

- Column 1: ``DataID`` ID of data
- Column 2: Path to Cellranger's ``outs`` directories

**Example**

.. code-block:: 
  
  DataID	File
  HC1_Cecum_CD45-_Baseline	/path/to/HC1_Cecum_CD45-_Baseline/outs
  HC1_Cecum_CD45+_Baseline	/path/to/HC1_Cecum_CD45+_Baseline/outs
  HC1_Cecum_Epi+_Baseline	/path/to/HC1_Cecum_Epi+_Baseline/outs
  HC1_Sigma_CD45-_Baseline	/path/to/HC1_Sigma_CD45-_Baseline/outs
  HC1_Sigma_CD45+_Baseline	/path/to/HC1_Sigma_CD45+_Baseline/outs
  HC1_Sigma_Epi+_Baseline	/path/to/HC1_Sigma_Epi+_Baseline/outs


h5ad file
---------

Expression data can be provided as h5ad file format. Object names (``adata.obs_names``) have to be ``Barcode`` in Cell-level metadata as explained in the following sections.


++++


Cell-level metadata
===================

This will be explained in :ref:`Cell-level metadata` section.


.. raw:: html

   <br>
   <hr style="border:1px solid gray">


----------
Metadata
----------

Multi-omics study is diverse. Experimental design and data types are different among studies. 

Multi-omics data have several characteristics that complicate their data integration. For example, (i) sample numbers are often not equivallent among data types, (ii) sampling sites are also sometimes different among data types (microbiome & metabolome are sometimes derived from stool samples), (iii) single-cell data has one more dimension "cell" which does not exist in bulk omics data.

To flexibly deal with the diversity of multi-omics datasets, this tool requires metadata in four different levels (Patient-level, Sample-level, Cell-level, Samplemap). You can create the metadata files as follows.


Patient-level metadata
======================

Required columns
----------------
- ``PatientID``: ID of patients. Duplication is not allowed in this file.
  
Optional columns
----------------
- | Column names must be like ``NAME[category|numeric]``.  
  | If you want to include age in this file, you should name the column ``age[numeric]`` and columns will be recognized as numeric values.


**Example**

.. code-block:: 
  
  PatientID       Age[numeric]    Sex[category]   Disease[category]
  C3001   43      Female  CD
  C3002   76      Female  CD
  C3003   43      Female  UC
  C3004   47      Female  UC
  C3005   76      Female  UC
  
++++

Sample-level metadata
======================

Required columns
----------------
- ``SampleID``: ID of samples. Duplication is not allowed in this file.
- ``PatientID``: ID of patients. The ID must be in the patient-level metadata
  
Optional columns
----------------
- Column names must be like ``NAME[category|numeric]``.  

**Example**

.. code-block:: 

  SampleID	PatientID	Tissue[category]	Time[category]
  C3001CSC1_CD_Rectum_2	C3001	Rectum	2 weeks
  C3001CSC2_CD_Ileum_2	C3001	Ileum	2 weeks
  C3002CSC1_CD_Sigmoid	Colon_0	C3002	Colon	0 weeks
  C3002CSC2_CD_Rectum_0	C3002	Rectum	0 weeks
  C3002CSC3_CD_Ileum_0	C3002	Ileum	0 weeks


++++


Cell-level metadata
===================

- Required columns

  - ``Barcode``: Cell barcodes (correspond to ``barcodes.tsv.gz`` in Cellranger's output)  
  - ``DataID``: ID of data. The ID must be in the manifest file
  - ``CellType``: Manually labeled celltypes (e.g., Th17, Inflammatory monocyte)
  - ``CellTypeGroup``: Group of celltypes (e.g., CD4T, B, Plasma)
  
- Optional columns

  - Column names must be like ``NAME[category|numeric]``

**Example**

.. code-block:: 
  
  Barcode	DataID	CellType	CellTypeGroup
  AACCATGCACGTCTCT-1	HC1_Cecum_CD45-_Baseline	CD4 Eff	T
  ACACCAAGTGCCTGTG-1	HC1_Cecum_CD45-_Baseline	Th1	T
  ATAAGAGTCGCGATCG-1	HC1_Cecum_CD45-_Baseline	CD8 terminal effector	T
  CATCCACAGGGCACTA-1	HC1_Cecum_CD45-_Baseline	CD8 Trm	T
  CCACCTACACCCATTC-1	HC1_Cecum_CD45-_Baseline	Trm17/IL26	T


++++


Samplemap
===================

The idea of Samplemap is from R package ``MultiAssayExperiment``. This file is used to map column names in bulk omics data and ``SampleID`` in the sample-level metadata. This file allows biological duplicates (multiple data from same origin) in the bulk omics dataset.

- Required columns

  - Column 1: colnames in the bulk omics data
  - Column 2: ``SampleID`` in the sample-level metadata
  - Column 3: Omics data (transcriptome, microbiome, metabolome)
  

**Example**

.. code-block:: 
  
  DataID	SampleID	datatype
  GSM3043377	C3002CSC1_CD_Sigmoid	Colon_0	transcriptome
  GSM3043378	C3002CSC2_CD_Ileum_0	transcriptome
  GSM3043379	C3002CSC3_CD_Rectum_0	transcriptome
  GSM3043380	C3002CSC4_CD_Sigmoid	Colon_0	transcriptome
  GSM3043381	C3003CSC1_UC_Ileum_1	transcriptome