README.md 4.67 KB
Newer Older
1
# Integrative Network Fusion (INF)
Nicole Bussola's avatar
Nicole Bussola committed
2
![INF pipeline ](figs/INF_pipeline.jpeg)
3

4
Repository attached to the article "Integrative Network Fusion: a multi-omics approach in molecular profiling".
5
6
7

**Authors**: Marco Cherici*, Nicole Bussola*, Alessia Marcolini*, Margherita Francescatto, Alessandro Zandonà, Lucia Trastulla, Claudio Agostinelli, Giuseppe Jurman, Cesare Furlanello.

Alessia Marcolini's avatar
Alessia Marcolini committed
8
## Setup
Alessia Marcolini's avatar
Alessia Marcolini committed
9
```bash
Alessia Marcolini's avatar
Alessia Marcolini committed
10
11
git clone https://gitlab.fbk.eu/MPBA/INF
cd INF
Alessia Marcolini's avatar
Alessia Marcolini committed
12
13
14
15
conda env create -f env.yml -n inf
conda activate inf
```

16
17
18
### Additional dependencies

#### R dependencies
Alessia Marcolini's avatar
Alessia Marcolini committed
19
20
21
22
23
To install the R dependencies (not in conda channels), run the following command via the R prompt:
```bash
install.packages("TunePareto")
```

24
25
#### MLPY

Marco Chierici's avatar
Marco Chierici committed
26
The `mlpy` package is required for some operations included in the DAP procedure. The version available on PyPI is outdated and not working on OSX platforms.
27

Marco Chierici's avatar
Marco Chierici committed
28
To install `mlpy` these are the steps to follow:
Alessia Marcolini's avatar
Alessia Marcolini committed
29

30
31
32
33
34
35
36
37
38
39
40
41
42
Let `<ANACONDA>` be your anaconda path (e.g., `/home/user/anaconda3`).

Adjust these environmental variables:
```bash
export LD_LIBRARY_PATH=<ANACONDA>/envs/<ENV>/lib:${LD_LIBRARY_PATH}
export CPATH=<ANACONDA>/envs/<ENV>/include:${CPATH}
```

and then install `mlpy` from GitLab:
```bash
pip install git+https://gitlab.fbk.eu/MPBA/mlpy.git
```

43
## Data availability
Marco Chierici's avatar
Marco Chierici committed
44
All the data used to perform the experiments has been published on [Figshare](http://dx.doi.org/10.6084/m9.figshare.12052995.v1), both raw data (`original.tar.gz`) and preprocessed data (`tcga_aml.tar.gz`, `tcga_brca.tar.gz`, `tcga_kirc.tar.gz`).
45

46

47
48
## Usage

Alessia Marcolini's avatar
Alessia Marcolini committed
49
50
#### Quick start

Marco Chierici's avatar
Marco Chierici committed
51
To quickly reproduce our results, first download preprocessed data from [Figshare](http://dx.doi.org/10.6084/m9.figshare.12052995.v1). Follow instructions in the Setup section and then in a shell:
Alessia Marcolini's avatar
Alessia Marcolini committed
52
53
54
55
56

```bash
for filename in *.tar.gz; do; tar zxfv $filename; done
mkdir data
mv tcga* data
57
./runner.sh full
Alessia Marcolini's avatar
Alessia Marcolini committed
58
59
``` 

Marco Chierici's avatar
Marco Chierici committed
60
61
#### Data splits generation

Marco Chierici's avatar
Marco Chierici committed
62
To recreate the 10 data splits, first run the following commands in a shell:
Marco Chierici's avatar
Marco Chierici committed
63
64
65
66
67

```bash
Rscript scripts/prepare_ACGT.R --tumor aml --suffix 03 --datadir data/original/Shamir_lab --outdir data/tcga_aml
Rscript scripts/prepare_ACGT.R --tumor kidney --suffix 01 --datadir data/original/Shamir_lab --outdir data/tcga_kirc
Rscript scripts/prepare_BRCA.R --task ER --datadir data/original --outdir data/tcga_brca
Marco Chierici's avatar
Marco Chierici committed
68
69
70
71
72
73
74
Rscript scripts/prepare_BRCA.R --task subtypes --datadir data/original --outdir data/tcga_brca
```

This creates 10 TR/TS partitions, with ID 0 to 9. To further partition into the 10 TR/TS/TS2 splits described in the paper, with ID 50 to 59 (you can use any other IDs), run in a shell:

```bash
for dataset in tcga_aml tcga_kirc; do
Alessia Marcolini's avatar
Alessia Marcolini committed
75
    python resplitter.py --datafolder data --dataset $dataset --target OS --n_splits_start 0 --n_splits_end 10 --split_offset 50
Marco Chierici's avatar
Marco Chierici committed
76
77
78
done

for target in ER subtypes; do
Alessia Marcolini's avatar
Alessia Marcolini committed
79
    python resplitter.py --datafolder data --dataset tcga_breast --target $target --n_splits_start 0 --n_splits_end 10 --split_offset 50
Marco Chierici's avatar
Marco Chierici committed
80
done
Marco Chierici's avatar
Marco Chierici committed
81
82
```

83
#### Input files
84

85
86
* Omics layer files: samples x features, tab-separated, with row & column names
* Labels file: one column, just the labels, no header (**same order as the data files**)
87

88
#### Example run
89

Alessia Marcolini's avatar
Alessia Marcolini committed
90
The INF pipeline is implemented with a [Snakefile](https://snakemake.readthedocs.io/en/stable/index.html).
Marco Chierici's avatar
Marco Chierici committed
91
92
93

The following directory tree is required:

Alessia Marcolini's avatar
Alessia Marcolini committed
94
95
96
* `{datafolder}/{dataset}/{target}/{split_id}/{layer}_{tr,ts,ts2}.txt`
* `{datafolder}/{dataset}/{split_id}/labels_{target}_{tr,ts,ts2}.txt`
* `{outfolder}/{dataset}/{target}/{model}/{split_id}/{juxt,rSNF,rSNFi,single}` _(these will be created if not present)_
97

Alessia Marcolini's avatar
Alessia Marcolini committed
98
All the {variables} can be specified either in a config.yaml file or on the command line. 
99

Alessia Marcolini's avatar
Alessia Marcolini committed
100
101
102
Example:

```bash
103
snakemake -s Snakefile_full --config datafolder=data outfolder=results dataset=tcga_brca target=ER \
104
    layer1=gene layer2=cnv layer3=prot model=randomForest random=false split_id=0 -p 
105
```
Marco Chierici's avatar
Marco Chierici committed
106

Alessia Marcolini's avatar
Alessia Marcolini committed
107
108
109
This example showed an example pipeline using three omics layers from BRCA-ER dataset. You can use an arbitrary number of omics layers by adding or removing `layer` arguments accordingly.

A maximum number of cores can also be set (default is 1):
Marco Chierici's avatar
Marco Chierici committed
110

Alessia Marcolini's avatar
Alessia Marcolini committed
111
```bash
Marco Chierici's avatar
Marco Chierici committed
112
snakemake [--config etc.] --cores 12
113
114
```

Marco Chierici's avatar
Marco Chierici committed
115
The pipeline can be "dry-run" using the `-n` flag:
116

Alessia Marcolini's avatar
Alessia Marcolini committed
117
```bash
Marco Chierici's avatar
Marco Chierici committed
118
119
snakemake --cores 12 -n
```
Alessia Marcolini's avatar
Alessia Marcolini committed
120

121
*A bash script (`runner.sh`) is provided for convenience, in order to run the pipeline for each split, to compute Borda of Bordas and to average metrics for all the splits.*
122
123
124
125
126
127
128
129
130
131
132
133

Depending on what kind of SNF-DAP you want (either a "full DAP" or an "accelerated DAP" - please see the article for the details), run:
```bash
./runner.sh full
``` 
to execute a "full DAP" and:
```bash
./runner.sh accelerated
```
to excecute an "accelerated DAP".

Please note that if you don't provide any argument, by default a "full DAP" will be executed.