Simulate TCGA ovarian cancer data


In this tutorial, we will show how to simulate multi-omics data using OmicsSIMLA based on the ovarian cancer (OV) data from the TCGA project. The same process was used to simulate the OV dataset in the OmicsSIMLA manuscript. Assume you have downloaded the OV profiles from here and installed the profiles in /home/user/profile. There should be four folders cnv, gene_expression, methylation, and protein as well as a pd_sites.txt file under the profile folder.


Command: ./OmicsSIMLA -eqtm_site_block 2 -profile_path /home/user/profile -mregion 1:10000-250000000 -mtype OV -batch 50 -cc 500,500 -deeqtm_fc 0.6,0.6,0.6 -geneexpfile /home/user/profile/gene_expression/gene_exp_profile.txt -num_ee_genes 12000 -eQTM_DE_p 1 -max_eqtm_per_block 1 -beta_var 0 -eQTM_par 50,20 -num_de_genes 1 -fc 2 -cnvfile /home/user/profile/cnv/cnv_profile.txt -proteinfile /home/user/profile/protein/protein_profile.txt -p_diff_phase_meth 0 -p_diff_phase_unmeth 0 -p_diff_phase_fuzzy 0 -pdsitefile /home/user/profile/pd_sites.txt --no-shuffle -folder sim -thread 20

This will simulate CNVs, CpGs with proportions of methylated reads, RNA-seq read counts, and normalized protein expression levels in 50 batches of samples, where each batch contains 500 cases (i.e., short-term survival) and 500 controls (i.e., long-term survival). Our simulated data using the above command which we analyzed in the OmicsSIMLA paper can be download here.

Details about the command:
General options:
-folder sim
The simulation results will be generated under a folder named "sim."

-batch 50
50 batches of simulated files will be generated.

-cc 500,500
500 unrelated cases and 500 unrelated controls will be simulated in each batch.

-thread 20
OmicsSIMLA will perform the simulations in parallel using 20 threads.

CNV related options:
-cnvfile /home/user/profile/cnv/cnv_profile.txt
The option specifies the path to the cnv profile, which contains the frequencies of deletions and duplications of 2,884 focal CNVs.

Methylation related options:
-profile_path /home/user/profile
The path to the methylation folder where the methylation profiles are placed.

-mregion 1:10000-250000000
CpGs on chromosome 1 will be simulated.

-mtype OV
The tissue type for the methylation data is specified as OV (ovarian cancer).

-beta_var 0
No biological and technical variations will be simulated for the methylation rates. The rates will be simulated based on the rates specified in the profile.

-eqtm_site_block 2
CpG(s) in the second methylation block in the methylation profile will be specified as eQTM(s).

-max_eqtm_per_block 1
The second methylation block will have a maximum of 1 CpG as an eQTM.

-eQTM_par 50,20
The β parameter to model the relationship between the eQTM and gene expression levels is assumed to have a mean of 50 and standard deviation of 20.

-eQTM_DE_p 1
The probability of the eQTM which will cause differential gene expression is 1.

-deeqtm_fc 0.6,0.6,0.6
The eQTM will cause differential gene expression at three genes. The fold changes of the expression levels for the three genes are all 0.6 in cases relative to those in controls.

-pdsitefile /home/user/profile/pd_sites.txt
The eQTM (the third CpG in the profile) will also have differential methylation rates between cases and controls. The difference is specified as 0.003 in pd_sites.txt.

-p_diff_phase_meth 0
-p_diff_phase_unmeth 0
-p_diff_phase_fuzzy 0
Since we just want the eQTM to be the only CpG with differential methylation rates, the three probabilities are specified as 0 so that no other CpGs will have differential methylation rates.

Gene expression related options:
-geneexpfile /home/user/profile/gene_expression/gene_exp_profile.txt
The option specifies the path to the gene expression profile.

-num_de_genes 1
We will simulate a gene with differential expression between cases and controls. This gene will exhibit an independent effect (not regulated by an eQTM).

-fc 2
The fold change for the expression of the gene specified in -num_de_genes.

-num_ee_genes 12000
We will simulate 12,000 genes with no difference in expression between cases and controls (i.e., the null genes).

--no-shuffle
The gene expression profiles will not be shuffled. That means the gene expression levels will be simulated based on the profiles of genes in the same order as specified in the profile file.

Protein expression related options:
-proteinfile /home/user/profile/protein/protein_profile.txt
The option specifies the path to the protein expression profile.

Note that the same number of genes (i.e., 12,004 genes) with protein expression levels as the number of genes with gene expression levels will be simulated. In our analysis with ATHENA, we extracted the first 200 genes in the simulated protein expression files.

Output files:
*.cnv file:
The cnv file should contain the CNV information for 2,884 CNVs.

Methylation_summary.txt and *.methy files:
2,753 CpG sites on chromosome 1 are simulated. The 3rd site is the eQTM, which can be seen in the fourth column in Methylation_summary.txt. The difference in the methylation rate between case and controls can be seen in the fifth column in Methylation_summary.txt. *.methy contain the methylated and unmethylated read counts for each of the 2,753 CpGs.

*.exp file:
The header of a *.exp file should look like this:
FAM IID FID MID SEX AFF DE eQTM_DE eQTM_DE eQTM_DE EE EE EE EE ….

The DE column corresponds to the expression of the gene generated by the -num_de_genes option. The three dQTM_DE columns correspond to the expression of the three genes influenced by the eQTM.

*.nor_exp file:
This file has the normalized gene expression levels and has the same format as the *.exp file. We used the normalized gene expression levels in the ATHENA analysis.

*.protein file:
The file has the normalized protein expression levels for the 12,004 genes, corresponding to the genes simulated from gene expression. We extracted the first 200 genes in the simulated protein expression files in the ATHENA analysis.