=================================
Input file: Config file
=================================

Config file includes three types of information: (i) what data (omics & metadata), (ii) what analyses and (iii) what public resources to be used. Graph modeling is automatically performed based on the analysis types and the public resources defined in the configuration file.

.. .. figure:: _static/config.png

Configuration file is a yaml file with four sections: (i) ``input``, (ii) ``metadata``, (iii) ``pipeline`` and (iv) ``public``.

.. raw:: html

   <hr style="border:1px solid gray">
   <br>    

-----------------------
Section: ``input``
-----------------------

Example
================

.. code-block:: 

  input:
    transcriptome:
      path: transcriptome.TPM.txt
      format: tsv
      rownames: Symbol
      unit: TPM
    microbiome:
      path: microbiome.Ratio.txt
      format: tsv
      rownames: Genus
      unit: Ratio
    metabolome:
      path: metabolome.PPM.txt
      format: tsv
      rownames: Name
      unit: PPM
    single-cell: 
      path: single-cell.manifest.txt
      format: tsv
      genes: symbol
      deconvolution: nusvr

++++

Options
================

This is a section for input multi-omics data. Four data types (transcriptome, microbiome, metabolome, single-cell) are accceptable. Options exist as below for each data type. 

single-cell
-------------------------------------

- ``path``: Relative path to h5ad file or the manifest file (described at :ref:`Input files: Omics data`)
- ``format``: *h5ad* or *tsv* (tab-separated) or *csv* (comma-separated)
- ``genes``: Used gene identifiers *Symbol* or *ENSG*
- ``deconvolution``: *no* (not perform), *nusvr* (nu-Support vector machine), *nnls* (Non-negative least square regression)

transcriptome, microbiome, metabolome
-------------------------------------

- ``path``: Relative path to the table-format omics data
- ``format``: *tsv* (tab-separated) or *csv* (comma-separated)
- ``rownames``: Identifiers used as rownames of the table

  - transcriptome: *Symbol*, *ENSG*
  - microbiome: *Species*, *Genus*
  - metabolome: *Name*, *HMDB*
  
- ``unit``: Unit of the values

  - transcriptome: *TPM*, *FPKM*
  - microbiome: *Ratio*
  - metabolome: *Name*

.. raw:: html

   <hr style="border:1px solid gray">
   <br>    

-----------------------
Section: ``metadata``
-----------------------

Example
================

.. code-block:: 

  metadata: 
    patient: 
      path: metadata.Patient.txt
      format: tsv
    sample: 
      path: metadata.Sample.txt
      format: tsv
    cell: 
      path: metadata.Cell.txt
      format: tsv
    samplemap:
      path: metadata.SampleMap.txt
      format: tsv
      duplicated_samples: mean

++++

Options
================

This is a section for metadata. As explained in :ref:`Input files: Omics data`, four metadata files (patient-, sample-, cell-level metadata and samplemap) are required. Following options are necessary for each metadata.

For all metadata
-------------------------------------

- ``path``: Relative path to the file of metadata  
- ``format``: *tsv* (tab-separated) or *csv* (comma-separated)

samplemap
----------------------

- ``duplicated_samples``: How to merge multiple values from identical samples. *mean* or *max*


.. raw:: html

   <hr style="border:1px solid gray">
   <br>    


-----------------------
Section: ``pipeline``
-----------------------

Example
================

.. code-block:: 

  pipeline:
    Cell-Cell:
      CORRELATE_WITH:
        methods: [Pearson, Spearman]
        level: Sample
        min_requierd_data: 20
        min_detected_ratio: 0.2
        min_correlation: 0.2
      LIGAND_RECEPTOR_COUNT:
        methods: [NATMI, LogFC]
        subsampling: 0
        top_perc: 0.01
      PHYSICALLY_INTERACT:
        methods: [Neighborseq]
        threshold: 0
    Cell-Microbe:
      CORRELATE_WITH:
        methods: [Pearson, Spearman]
        level: Sample
        min_requierd_data: 20
        min_detected_ratio: 0.2
        min_correlation: 0.2
      INTRACELLULAR_MICROBE:
        methods: [SAHMI]
        threshold: 0
    Cell-Gene:
      SPECIFICALLY_EXPRESS:
        methods: [wilcoxon]
        fdr_threshold: 0.01
        fc_threshold: 2
        rank_threshold: 3
    Cell-Metabolite:
      CORRELATE_WITH:
        methods: [Pearson, Spearman]
        level: Sample
        min_requierd_data: 20
        min_detected_ratio: 0.2
        min_correlation: 0.2
    Microbe-Microbe
      CORRELATE_WITH:
        methods: [Pearson, Spearman]
        level: Sample
        min_requierd_data: 20
        min_detected_ratio: 0.2
        min_correlation: 0.2
    Microbe-Metabolite:
      CORRELATE_WITH:
        methods: [Pearson, Spearman]
        level: Sample
        min_requierd_data: 20
        min_detected_ratio: 0.2
        min_correlation: 0.2
        
++++

Available pipelines
===================

=========================  ==========================  ==========================  =====================
Pipelines (RELATION TYPE)  Entity1 (FROM)              Entity2 (TO)                Directed
=========================  ==========================  ==========================  =====================
CORRELATE_WITH             Cell, Metabolite, Microbe   Cell, Metabolite, Microbe   No         
LIGAND_RECEPTOR_COUNT      Cell                        Cell                        Yes
SPECIFICALLY_EXPRESS       Cell                        Gene                        No        
DIFFERENTIAL_ABUNDANCE     Cell, Metabolite, Microbe   State*                      No
DIFFERENTIAL_EXPRESSION    Cell                        State*                      No
PHYSICALLY_INTERACT        Cell                        Cell                        No
INTRACELLULAR_MICROBE      Cell                        Microbe                     No       
=========================  ==========================  ==========================  =====================
 
++++

Options
================

This section defines what analyess are performed for extraction of relationships from multi-omics data. Following options are available for each pipeline. Each pipeline returns results as edges/relationships in the knowledge graph. 

CORRELATE_WITH
-------------------------------------

*CORRELATE_WITH* is a relationship that indicates quantities of *entity X* and *entity Y* are correlated.  

- ``methods``: List of methods to calculate correlation. *Pearson*, *Spearman*
- ``level``: *Sample*-level correlation or *Patient*-level correlation. *Patient* is recommended if two entties are derived from different sample types of same patients (e.g., Microbiome from stool & Cell from tissue)
- ``min_required_data``: Minimun required number of data for correlation calculation. Calculation is skipped if number of data is below this value.
- ``min_detected_ratio``: Calculation is skipped if there are too many NAs (data with zeros). 20% is the threshold when the value is 0.2
- ``min_correlation``: Threshold for correlation coefficients. Correlations weaker than this value are not included in the result.


LIGAND_RECEPTOR
-------------------------------------

*LIGAND_RECEPTOR* is a relationship that indicates many ligand-receptor pairs are significantly expressed in *celltype X* and *celltype Y*. 

- ``methods``: List of ligand-receptor analysis methods. *NATMI*, *LogFC*, *CellPhoneDB*
- ``subsampling``: Subsample N cells from each celltypes to analyze more efficiently. Subsampling is not performed if this is 0.
- ``top_perc``: Return top N % of significant pairs of celltypes. Top 10 % will be returned if this is 0.1.


SPECIFICALLY_EXPRESS
-------------------------------------

*SPECIFICALLY_EXPRESS* is a relationship that indicates that *gene Y* is highly expressed in *celltype X* than other cells. 

- ``methods``: List of statistical tests. *wilcoxon*, *t*
- ``fdr_threshold``: Threshold for false discovery rate (FDR)
- ``fc_threshold``: Threshold for fold change between average in *celltype X* and average in all other cells
- ``rank_threshold``: 


DIFFRENTIAL_ABUNDANCE
-------------------------------------

*DIFFERENTIAL_ABUNDANCE* is a relationship that indicates that *entity X* is significantly abundant in *state Y*. This relationship is represented as *(X:)-[:DIFFERENTIAL_ABUNDANCE]-(:DifferentialTest)-[:COMPARATOR]-(Y:State)* in the knowledge graph.

- ``methods``: List of statistical tests. *wilcoxon*, *t*
- ``fdr_threshold``: Threshold for false discovery rate (FDR)
- ``fc_threshold``: Threshold for fold change between average in *celltype X* and average in all other cells


DIFFRENTIAL_EXPRESSION
-------------------------------------

*DIFFERENTIAL_EXPRESSION* is a relationship that indicates that *gene Z* is differentially expressed in *cell X* at *state Y*. This relationship is represented as *(X:Cell)-[:DIFFERENTIAL_EXPRESSION]-(d:DifferentialTest)-[:COMPARATOR]-(Y:State) AND (d)-[:TESTED]-(Z:Gene)* in the knowledge graph.

- ``methods``: List of statistical tests. *wilcoxon*, *t*
- ``fdr_threshold``: Threshold for false discovery rate (FDR)
- ``fc_threshold``: Threshold for fold change between average in *celltype X* and average in all other cells


PHYSICALLY_INTERACT
-------------------------------------

*PHYSICALLY_INTERACT* is a relationship that indicates that *celltype X* and *celltype Y* has physical interaction 

- ``methods``: List of methods. *Neighbor-seq*


INTRACELLULAR_MICROBE
-------------------------------------

*INTRACELLULAR_MICROBE* is a relationship that indicates that *microbe Y* is frequencly detected in *celltype X* than other cells. 

- ``methods``: List of methods. *SAHMI*
 
.. raw:: html

   <hr style="border:1px solid gray">
   <br>    

-----------------------
Section: ``public``
-----------------------

Example
===================

.. code-block:: 
  
  public:
    Microbe-Metabolite:
      PRODUCE:
        sources: [gutMGene, NJC19, AGORA2]
    Metabolite-Microbe:
      CONSUME:
        sources: [gutMGene, NJC19, AGORA2]
    Gene-Metabolite:
      RECEPTOR:
        sources: [HMDB, GPCRdb]
    Microbe-Gene:
      MOLECULAR_MIMICRY:
        sources: [HMI-PRED, HPIDB]
    Gene-Gene:
      LIGAND_RECEPTOR:
        sources: [LIANA]

++++

Available datasets
===================

=========================  ================  ================  ================  =====================================
RELATION TYPE              Entity1 (FROM)    Entity2 (TO)      Directed          Source
=========================  ================  ================  ================  =====================================
*PRODUCE*                  Microbe           Metabolite        Yes               gutMGene, NJC19, AGORA2, Text_minning
*UPTAKE*                   Metabolite        Microbe           Yes               NJC19, AGORA2, Text_mining
*RECEPTOR*                 Gene              Metabolite        Yes               HMDB, GPCRdb
*ENZYME*                   Gene              Metabolite        Yes               HMDB, GPCRdb
*MOLECULAR_MIMICRY*        Microbe           Metabolite        Yes               HMI-PRED, HPIDB
*LIGAND_RECEPTOR*          Gene              Gene              Yes               LIANA
=========================  ================  ================  ================  =====================================

PRODUCE/UPTAKE
---------------

*PRODUCE* and *UPTAKE* are relationships between *Microbe* and *Metabolite*. The information is collected by two ways: (i) Metabolic modeling and (ii) Literature-based evidence.

**Metabolic modeling**

We predicted bacterial production and consumption of metabolites by flux variability analysis (FVA) as explained in [Magnusdottir2017]_.
We used AGORA2 ([Heinken2023]_), collection of genome-scale metabolic models, to predict metabolic potential of >7500 human gut microbes. 

**Literature-based evidence**

We collected literature-based information of bacterial metabolic potential from two public databases gutMGene ([Cheng2022]_) and NJC19 ([Lim2020]_). 


RECEPTOR/ENZYME
---------------

*RECEPTOR* and *ENZYME* are relationships between *Gene* and *Metabolite*.
A relationship ``(:Gene)<-[RECEPTOR]-(Metabolite)`` denotes that the gene codes receptor of the metabolite. 
We collected information of genes associated with metabolic reactions from public databases HMDB ([Wishart2022]_) and GPCRdb ([Gaspar2023]_).


.. raw:: html

   <hr style="border:1px solid gray">
   <br>