# Integrative Network Fusion (INF) ![INF pipeline ](figs/INF_pipeline.jpeg) Repository attached to the article "Integrative Network Fusion: a multi-omics approach in molecular profiling". **Authors**: Marco Cherici*, Nicole Bussola*, Alessia Marcolini*, Margherita Francescatto, Alessandro Zandonà, Lucia Trastulla, Claudio Agostinelli, Giuseppe Jurman, Cesare Furlanello. ## Setup ```bash git clone https://gitlab.fbk.eu/MPBA/INF cd INF conda env create -f env.yml -n inf conda activate inf ``` ### Additional dependencies #### R dependencies To install the R dependencies (not in conda channels), run the following command via the R prompt: ```bash install.packages("TunePareto") ``` #### MLPY To install `mlpy` follow this instructions: `mlpy` package is required for some operations included in the DAP procedure. The `mlpy` package available on PyPI is outdated and not working on OSX platforms. These are the steps to follow: Let `` be your anaconda path (e.g., `/home/user/anaconda3`). Adjust these environmental variables: ```bash export LD_LIBRARY_PATH=/envs//lib:${LD_LIBRARY_PATH} export CPATH=/envs//include:${CPATH} ``` and then install `mlpy` from GitLab: ```bash pip install git+https://gitlab.fbk.eu/MPBA/mlpy.git ``` ## Data availability All the data used to perform the experiments has been published on [Figshare](dx.doi.org/10.6084/m9.figshare.12052995.v1), both raw data (`original.tar.gz`) and preprocessed data (`tcga_aml.tar.gz`, `tcga_brca.tar.gz`, `tcga_kirc.tar.gz`). ## Usage #### Quick start To quickly reproduce our results, first download preprocessed data from [Figshare](dx.doi.org/10.6084/m9.figshare.12052995.v1). Follow instructions in the Setup section and then in a shell: ```bash for filename in *.tar.gz; do; tar zxfv $filename; done mkdir data mv tcga* data ./runner.sh ``` #### Data splits generation To recreate the 10 data splits, first run the following commands in a shell: ```bash Rscript scripts/prepare_ACGT.R --tumor aml --suffix 03 --datadir data/original/Shamir_lab --outdir data/tcga_aml Rscript scripts/prepare_ACGT.R --tumor kidney --suffix 01 --datadir data/original/Shamir_lab --outdir data/tcga_kirc Rscript scripts/prepare_BRCA.R --task ER --datadir data/original --outdir data/tcga_brca Rscript scripts/prepare_BRCA.R --task subtypes --datadir data/original --outdir data/tcga_brca ``` This creates 10 TR/TS partitions, with ID 0 to 9. To further partition into the 10 TR/TS/TS2 splits described in the paper, with ID 50 to 59 (you can use any other IDs), run in a shell: ```bash for dataset in tcga_aml tcga_kirc; do python resplitter.py --datafolder data/$dataset --target OS --n_splits_start 0 --n_splits_end 10 --split_offset 50 done for target in ER subtypes; do python resplitter.py --datafolder data/tcga_breast --target $target --n_splits_start 0 --n_splits_end 10 --split_offset 50 done ``` #### Input files * Omics layer files: samples x features, tab-separated, with row & column names * Labels file: one column, just the labels, no header (**same order as the data files**) #### Example run The INF pipeline is implemented with a [Snakefile](https://snakemake.readthedocs.io/en/stable/index.html). The following directory tree is required: * `{datafolder}/{dataset}/{target}/{split_id}/{layer}_{tr,ts,ts2}.txt` * `{datafolder}/{dataset}/{split_id}/labels_{target}_{tr,ts,ts2}.txt` * `{outfolder}/{dataset}/{target}/{model}/{split_id}/{juxt,rSNF,rSNFi,single}` _(these will be created if not present)_ All the {variables} can be specified either in a config.yaml file or on the command line. Example: ```bash snakemake --config datafolder=data outfolder=results dataset=tcga_brca target=ER \ layer1=gene layer2=cnv layer3=prot model=randomForest random=false split_id=0 -p ``` This example showed an example pipeline using three omics layers from BRCA-ER dataset. You can use an arbitrary number of omics layers by adding or removing `layer` arguments accordingly. A maximum number of cores can also be set (default is 1): ```bash snakemake [--config etc.] --cores 12 ``` The pipeline can be "dry-run" using the `-n` flag: ```bash snakemake --cores 12 -n ``` *A bash script (`runner.sh`) is provided for convenience, in order to run the pipeline for each split, to compute Borda of Bordas and to average metrics for all the splits.*