README.md 4.3 KB
Newer Older
1
# Integrative Network Fusion (INF)
Nicole Bussola's avatar
Nicole Bussola committed
2
![INF pipeline ](figs/INF_pipeline.jpeg)
3

4
Repository attached to the article "Integrative Network Fusion: a multi-omics approach in molecular profiling".
5
6
7

**Authors**: Marco Cherici*, Nicole Bussola*, Alessia Marcolini*, Margherita Francescatto, Alessandro Zandonà, Lucia Trastulla, Claudio Agostinelli, Giuseppe Jurman, Cesare Furlanello.

Alessia Marcolini's avatar
Alessia Marcolini committed
8
## Setup
Alessia Marcolini's avatar
Alessia Marcolini committed
9
```bash
Alessia Marcolini's avatar
Alessia Marcolini committed
10
11
git clone https://gitlab.fbk.eu/MPBA/INF
cd INF
Alessia Marcolini's avatar
Alessia Marcolini committed
12
13
14
15
conda env create -f env.yml -n inf
conda activate inf
```

16
17
18
### Additional dependencies

#### R dependencies
Alessia Marcolini's avatar
Alessia Marcolini committed
19
20
21
22
23
To install the R dependencies (not in conda channels), run the following command via the R prompt:
```bash
install.packages("TunePareto")
```

24
25
26
27
28
29
#### MLPY
To install `mlpy` follow this instructions:

`mlpy` package is required for some operations included in the DAP procedure.

The `mlpy` package available on PyPI is outdated and not working on OSX platforms.
30

31
These are the steps to follow:
Alessia Marcolini's avatar
Alessia Marcolini committed
32

33
34
35
36
37
38
39
40
41
42
43
44
45
Let `<ANACONDA>` be your anaconda path (e.g., `/home/user/anaconda3`).

Adjust these environmental variables:
```bash
export LD_LIBRARY_PATH=<ANACONDA>/envs/<ENV>/lib:${LD_LIBRARY_PATH}
export CPATH=<ANACONDA>/envs/<ENV>/include:${CPATH}
```

and then install `mlpy` from GitLab:
```bash
pip install git+https://gitlab.fbk.eu/MPBA/mlpy.git
```

46
## Data availability
Alessia Marcolini's avatar
Alessia Marcolini committed
47
All the data used to perform the experiments has been published on [Figshare](dx.doi.org/10.6084/m9.figshare.12052995.v1), both raw data (`original.tar.gz`) and preprocessed data (`tcga_aml.tar.gz`, `tcga_brca.tar.gz`, `tcga_kirc.tar.gz`).
48

49

50
51
## Usage

Alessia Marcolini's avatar
Alessia Marcolini committed
52
53
#### Quick start

Alessia Marcolini's avatar
Alessia Marcolini committed
54
To quickly reproduce our results, first download preprocessed data from [Figshare](dx.doi.org/10.6084/m9.figshare.12052995.v1). Follow instructions in the Setup section and then in a shell:
Alessia Marcolini's avatar
Alessia Marcolini committed
55
56
57
58
59
60
61
62

```bash
for filename in *.tar.gz; do; tar zxfv $filename; done
mkdir data
mv tcga* data
./runner.sh
``` 

Marco Chierici's avatar
Marco Chierici committed
63
64
#### Data splits generation

Marco Chierici's avatar
Marco Chierici committed
65
To recreate the 10 data splits, first run the following commands in a shell:
Marco Chierici's avatar
Marco Chierici committed
66
67
68
69
70

```bash
Rscript scripts/prepare_ACGT.R --tumor aml --suffix 03 --datadir data/original/Shamir_lab --outdir data/tcga_aml
Rscript scripts/prepare_ACGT.R --tumor kidney --suffix 01 --datadir data/original/Shamir_lab --outdir data/tcga_kirc
Rscript scripts/prepare_BRCA.R --task ER --datadir data/original --outdir data/tcga_brca
Marco Chierici's avatar
Marco Chierici committed
71
72
73
74
75
76
77
78
79
80
81
82
83
Rscript scripts/prepare_BRCA.R --task subtypes --datadir data/original --outdir data/tcga_brca
```

This creates 10 TR/TS partitions, with ID 0 to 9. To further partition into the 10 TR/TS/TS2 splits described in the paper, with ID 50 to 59 (you can use any other IDs), run in a shell:

```bash
for dataset in tcga_aml tcga_kirc; do
    python resplitter.py --datafolder data/$dataset --target OS --n_splits_start 0 --n_splits_end 10 --split_offset 50
done

for target in ER subtypes; do
    python resplitter.py --datafolder data/tcga_breast --target $target --n_splits_start 0 --n_splits_end 10 --split_offset 50
done
Marco Chierici's avatar
Marco Chierici committed
84
85
```

86
#### Input files
87

88
89
* Omics layer files: samples x features, tab-separated, with row & column names
* Labels file: one column, just the labels, no header (**same order as the data files**)
90

91
#### Example run
92

Alessia Marcolini's avatar
Alessia Marcolini committed
93
The INF pipeline is implemented with a [Snakefile](https://snakemake.readthedocs.io/en/stable/index.html).
Marco Chierici's avatar
Marco Chierici committed
94
95
96

The following directory tree is required:

Alessia Marcolini's avatar
Alessia Marcolini committed
97
98
99
* `{datafolder}/{dataset}/{target}/{split_id}/{layer}_{tr,ts,ts2}.txt`
* `{datafolder}/{dataset}/{split_id}/labels_{target}_{tr,ts,ts2}.txt`
* `{outfolder}/{dataset}/{target}/{model}/{split_id}/{juxt,rSNF,rSNFi,single}` _(these will be created if not present)_
100

Alessia Marcolini's avatar
Alessia Marcolini committed
101
All the {variables} can be specified either in a config.yaml file or on the command line. 
102

Alessia Marcolini's avatar
Alessia Marcolini committed
103
104
105
Example:

```bash
106
107
snakemake --config datafolder=data outfolder=results dataset=tcga_brca target=ER \
    layer1=gene layer2=cnv layer3=prot model=randomForest random=false split_id=0 -p 
108
```
Marco Chierici's avatar
Marco Chierici committed
109

Alessia Marcolini's avatar
Alessia Marcolini committed
110
111
112
This example showed an example pipeline using three omics layers from BRCA-ER dataset. You can use an arbitrary number of omics layers by adding or removing `layer` arguments accordingly.

A maximum number of cores can also be set (default is 1):
Marco Chierici's avatar
Marco Chierici committed
113

Alessia Marcolini's avatar
Alessia Marcolini committed
114
```bash
Marco Chierici's avatar
Marco Chierici committed
115
snakemake [--config etc.] --cores 12
116
117
```

Marco Chierici's avatar
Marco Chierici committed
118
The pipeline can be "dry-run" using the `-n` flag:
119

Alessia Marcolini's avatar
Alessia Marcolini committed
120
```bash
Marco Chierici's avatar
Marco Chierici committed
121
122
snakemake --cores 12 -n
```
Alessia Marcolini's avatar
Alessia Marcolini committed
123

124
*A bash script (`runner.sh`) is provided for convenience, in order to run the pipeline for each split, to compute Borda of Bordas and to average metrics for all the splits.*