Input file: Config file

Config file includes three types of information: (i) what data (omics & metadata), (ii) what analyses and (iii) what public resources to be used. Graph modeling is automatically performed based on the analysis types and the public resources defined in the configuration file.

Configuration file is a yaml file with four sections: (i) input, (ii) metadata, (iii) pipeline and (iv) public.

Section: `input`

Example

input:
  transcriptome:
    path: transcriptome.TPM.txt
    format: tsv
    rownames: Symbol
    unit: TPM
  microbiome:
    path: microbiome.Ratio.txt
    format: tsv
    rownames: Genus
    unit: Ratio
  metabolome:
    path: metabolome.PPM.txt
    format: tsv
    rownames: Name
    unit: PPM
  single-cell:
    path: single-cell.manifest.txt
    format: tsv
    genes: symbol
    deconvolution: nusvr

Options

This is a section for input multi-omics data. Four data types (transcriptome, microbiome, metabolome, single-cell) are accceptable. Options exist as below for each data type.

single-cell

path: Relative path to h5ad file or the manifest file (described at Input files: Omics data)
format: h5ad or tsv (tab-separated) or csv (comma-separated)
genes: Used gene identifiers Symbol or ENSG
deconvolution: no (not perform), nusvr (nu-Support vector machine), nnls (Non-negative least square regression)

transcriptome, microbiome, metabolome

path: Relative path to the table-format omics data
format: tsv (tab-separated) or csv (comma-separated)
rownames: Identifiers used as rownames of the table
- transcriptome: Symbol, ENSG
- microbiome: Species, Genus
- metabolome: Name, HMDB
unit: Unit of the values
- transcriptome: TPM, FPKM
- microbiome: Ratio
- metabolome: Name

Section: `metadata`

Example

metadata:
  patient:
    path: metadata.Patient.txt
    format: tsv
  sample:
    path: metadata.Sample.txt
    format: tsv
  cell:
    path: metadata.Cell.txt
    format: tsv
  samplemap:
    path: metadata.SampleMap.txt
    format: tsv
    duplicated_samples: mean

Options

This is a section for metadata. As explained in Input files: Omics data, four metadata files (patient-, sample-, cell-level metadata and samplemap) are required. Following options are necessary for each metadata.

For all metadata

path: Relative path to the file of metadata
format: tsv (tab-separated) or csv (comma-separated)

samplemap

duplicated_samples: How to merge multiple values from identical samples. mean or max

Section: `pipeline`

Example

pipeline:
  Cell-Cell:
    CORRELATE_WITH:
      methods: [Pearson, Spearman]
      level: Sample
      min_requierd_data: 20
      min_detected_ratio: 0.2
      min_correlation: 0.2
    LIGAND_RECEPTOR_COUNT:
      methods: [NATMI, LogFC]
      subsampling: 0
      top_perc: 0.01
    PHYSICALLY_INTERACT:
      methods: [Neighborseq]
      threshold: 0
  Cell-Microbe:
    CORRELATE_WITH:
      methods: [Pearson, Spearman]
      level: Sample
      min_requierd_data: 20
      min_detected_ratio: 0.2
      min_correlation: 0.2
    INTRACELLULAR_MICROBE:
      methods: [SAHMI]
      threshold: 0
  Cell-Gene:
    SPECIFICALLY_EXPRESS:
      methods: [wilcoxon]
      fdr_threshold: 0.01
      fc_threshold: 2
      rank_threshold: 3
  Cell-Metabolite:
    CORRELATE_WITH:
      methods: [Pearson, Spearman]
      level: Sample
      min_requierd_data: 20
      min_detected_ratio: 0.2
      min_correlation: 0.2
  Microbe-Microbe
    CORRELATE_WITH:
      methods: [Pearson, Spearman]
      level: Sample
      min_requierd_data: 20
      min_detected_ratio: 0.2
      min_correlation: 0.2
  Microbe-Metabolite:
    CORRELATE_WITH:
      methods: [Pearson, Spearman]
      level: Sample
      min_requierd_data: 20
      min_detected_ratio: 0.2
      min_correlation: 0.2

Available pipelines

Pipelines (RELATION TYPE)	Entity1 (FROM)	Entity2 (TO)	Directed
CORRELATE_WITH	Cell, Metabolite, Microbe	Cell, Metabolite, Microbe	No
LIGAND_RECEPTOR_COUNT	Cell	Cell	Yes
SPECIFICALLY_EXPRESS	Cell	Gene	No
DIFFERENTIAL_ABUNDANCE	Cell, Metabolite, Microbe	State*	No
DIFFERENTIAL_EXPRESSION	Cell	State*	No
PHYSICALLY_INTERACT	Cell	Cell	No
INTRACELLULAR_MICROBE	Cell	Microbe	No

Options

This section defines what analyess are performed for extraction of relationships from multi-omics data. Following options are available for each pipeline. Each pipeline returns results as edges/relationships in the knowledge graph.

CORRELATE_WITH

CORRELATE_WITH is a relationship that indicates quantities of entity X and entity Y are correlated.

methods: List of methods to calculate correlation. Pearson, Spearman
level: Sample-level correlation or Patient-level correlation. Patient is recommended if two entties are derived from different sample types of same patients (e.g., Microbiome from stool & Cell from tissue)
min_required_data: Minimun required number of data for correlation calculation. Calculation is skipped if number of data is below this value.
min_detected_ratio: Calculation is skipped if there are too many NAs (data with zeros). 20% is the threshold when the value is 0.2
min_correlation: Threshold for correlation coefficients. Correlations weaker than this value are not included in the result.

LIGAND_RECEPTOR

LIGAND_RECEPTOR is a relationship that indicates many ligand-receptor pairs are significantly expressed in celltype X and celltype Y.

methods: List of ligand-receptor analysis methods. NATMI, LogFC, CellPhoneDB
subsampling: Subsample N cells from each celltypes to analyze more efficiently. Subsampling is not performed if this is 0.
top_perc: Return top N % of significant pairs of celltypes. Top 10 % will be returned if this is 0.1.

SPECIFICALLY_EXPRESS

SPECIFICALLY_EXPRESS is a relationship that indicates that gene Y is highly expressed in celltype X than other cells.

methods: List of statistical tests. wilcoxon, t
fdr_threshold: Threshold for false discovery rate (FDR)
fc_threshold: Threshold for fold change between average in celltype X and average in all other cells
rank_threshold:

DIFFRENTIAL_ABUNDANCE

DIFFERENTIAL_ABUNDANCE is a relationship that indicates that entity X is significantly abundant in state Y. This relationship is represented as (X:)-[:DIFFERENTIAL_ABUNDANCE]-(:DifferentialTest)-[:COMPARATOR]-(Y:State) in the knowledge graph.

methods: List of statistical tests. wilcoxon, t
fdr_threshold: Threshold for false discovery rate (FDR)
fc_threshold: Threshold for fold change between average in celltype X and average in all other cells

DIFFRENTIAL_EXPRESSION

DIFFERENTIAL_EXPRESSION is a relationship that indicates that gene Z is differentially expressed in cell X at state Y. This relationship is represented as (X:Cell)-[:DIFFERENTIAL_EXPRESSION]-(d:DifferentialTest)-[:COMPARATOR]-(Y:State) AND (d)-[:TESTED]-(Z:Gene) in the knowledge graph.

methods: List of statistical tests. wilcoxon, t
fdr_threshold: Threshold for false discovery rate (FDR)
fc_threshold: Threshold for fold change between average in celltype X and average in all other cells

PHYSICALLY_INTERACT

PHYSICALLY_INTERACT is a relationship that indicates that celltype X and celltype Y has physical interaction

methods: List of methods. Neighbor-seq

INTRACELLULAR_MICROBE

INTRACELLULAR_MICROBE is a relationship that indicates that microbe Y is frequencly detected in celltype X than other cells.

methods: List of methods. SAHMI

Section: `public`

Example

public:
  Microbe-Metabolite:
    PRODUCE:
      sources: [gutMGene, NJC19, AGORA2]
  Metabolite-Microbe:
    CONSUME:
      sources: [gutMGene, NJC19, AGORA2]
  Gene-Metabolite:
    RECEPTOR:
      sources: [HMDB, GPCRdb]
  Microbe-Gene:
    MOLECULAR_MIMICRY:
      sources: [HMI-PRED, HPIDB]
  Gene-Gene:
    LIGAND_RECEPTOR:
      sources: [LIANA]

Available datasets

RELATION TYPE	Entity1 (FROM)	Entity2 (TO)	Directed	Source
PRODUCE	Microbe	Metabolite	Yes	gutMGene, NJC19, AGORA2, Text_minning
UPTAKE	Metabolite	Microbe	Yes	NJC19, AGORA2, Text_mining
RECEPTOR	Gene	Metabolite	Yes	HMDB, GPCRdb
ENZYME	Gene	Metabolite	Yes	HMDB, GPCRdb
MOLECULAR_MIMICRY	Microbe	Metabolite	Yes	HMI-PRED, HPIDB
LIGAND_RECEPTOR	Gene	Gene	Yes	LIANA

PRODUCE/UPTAKE

PRODUCE and UPTAKE are relationships between Microbe and Metabolite. The information is collected by two ways: (i) Metabolic modeling and (ii) Literature-based evidence.

Metabolic modeling

We predicted bacterial production and consumption of metabolites by flux variability analysis (FVA) as explained in [Magnusdottir2017]. We used AGORA2 ([Heinken2023]), collection of genome-scale metabolic models, to predict metabolic potential of >7500 human gut microbes.

Literature-based evidence

We collected literature-based information of bacterial metabolic potential from two public databases gutMGene ([Cheng2022]) and NJC19 ([Lim2020]).

RECEPTOR/ENZYME

RECEPTOR and ENZYME are relationships between Gene and Metabolite. A relationship (:Gene)<-[RECEPTOR]-(Metabolite) denotes that the gene codes receptor of the metabolite. We collected information of genes associated with metabolic reactions from public databases HMDB ([Wishart2022]) and GPCRdb ([Gaspar2023]).

Input file: Config file

Section: input

Example

Options

single-cell

transcriptome, microbiome, metabolome

Section: metadata

Example

Options

For all metadata

samplemap

Section: pipeline

Example

Available pipelines

Options

CORRELATE_WITH

LIGAND_RECEPTOR

SPECIFICALLY_EXPRESS

DIFFRENTIAL_ABUNDANCE

DIFFRENTIAL_EXPRESSION

PHYSICALLY_INTERACT

INTRACELLULAR_MICROBE

Section: public

Example

Available datasets

PRODUCE/UPTAKE

RECEPTOR/ENZYME

Section: `input`

Section: `metadata`

Section: `pipeline`

Section: `public`