Nucleic Acids Rearch,20151
doi:10.1093/nar/gku1393
Patient-specific driver gene prediction and risk
asssment through integrated network analysis of cancer omics profiles
Denis Bertrand 1,†,Kern Rei Chng 1,†,Faranak Ghazi Sherbaf 2,†,Anja Kiel 1,Burton K.H.Chia 1,Yee Yen Sia 2,Sharon K.Huang 3,Dave S.B.Hoon 3,Edison T.Liu 2,4,Axel Hillmer 2and Niranjan Nagarajan 1,*
1
Computational and Systems Biology,Genome Institute of Singapore,Singapore 138672,Singapore,2Cancer
Therapeutics and Stratified Oncology,Genome Institute of Singapore,Singapore 138672,Singapore,3Department of Molecular Oncology,John Wayne Cancer Institute,Santa Monica,CA 90404,USA and 4The Jackson Laboratory for Genomic Medicine,Farmington,CT 06030,USA
Received September 18,2014;Revid December 24,2014;Accepted December 24,2014
ABSTRACT
Extensive and multi-dimensional data ts generated from recent cancer omics profiling projects have pre-nted new challenges and opportunities for unravel-ing the complexity of cancer genome landscapes.In particular,distinguishing the unique complement of genes that drive tumorigenesis in each patient from a a of pasnger mutations is necessary for trans-lating the full benefit of cancer genome quencing into the clinic.We address this need by prenting a data integration framework (OncoIMPACT)to nomi-nate patient-specific driver genes bad on their phe-notypic impact.Extensive in silico and in vitro vali-dation helped establish OncoIMPACT’s robustness,improved precision over competing approaches and verifiable patient and cell line specific predictions (2/2and 6/7true positives and negatives,respec-tively).In particular,we computationally predicted and experimentally validated the gene TRIM24as a putative novel amplified driver in a melanoma patient.Applying OncoIMPACT to more than 1000tumor sam-ples,we generated patient-specific driver gene lists in five different cancer types to identify modes of synergistic action.We also provide the first demon-stration that computationally derived driver mutation signatures can be overall superior to single gene and gene expression bad signatures in
enabling pa-tient stratification and prognostication.Source code and executables for OncoIMPACT are freely available from sourceforge/projects/oncoimpact .
INTRODUCTION
In recent years,advances in genomic technologies have en-abled the systematic generation of clinical cancer omics data at an unprecedented scale and rate,interrogating tumor bi-ology at multiple levels––genomic,transcriptomic as well as epigenomic (1,2).Integrative mining of the clinically char-acterized,information-rich data ts is expected to provide deep insights into tumor biology and guide new efforts to develop cancer diagnostics and therapeutics (3,4).Recent studies have,however,highlighted the complexity of can-cer genome landscapes in terms of somatic mutations,tran-scriptomic changes and epigenetic alterations,potentially confounding modeling,mining and integrative analysis of cancer omics data (5,6).While the complexity of cellular process that link the different levels of changes in can-cer cells may suggest the u of sophisticated systems biol-ogy (mechanistic or probabilistic)models for data integra-tion,their utility can be hampered by the need to learn a large number of parameters from a limited number of pa-tient samples (7).On the other hand,it is unclear if sim-pler models can adequately capture key features of the data and be ud to obtain biologically relevant insights.Corre-spondingly,despite its i
mportance,relatively few methods have been propod that can model and integrate cancer omics data (8–11)and limitations in mining and interpre-tation continue to be a major barrier for their exploitation in clinical applications (3,12).
家庭和睦One of the fundamental challenges in the analysis and in-terpretation of cancer genomic data is to identify and distin-guish functional (driver)mutations from the numerous non-functional (pasnger)mutations that are found to popu-late cancer genomes (13,14).This problem has relevance not only for an understanding of tumor biology (in terms
*To
whom correspondence should be addresd.Tel:+6568088071;Fax:+6568088292;Email:nagarajann@gis.a-star.edu.sg
†
The authors contributed equally to the paper as first authors.
C The Author(s)2015.Published by Oxford University Press on behalf of Nucleic Acids Rearch.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licen (creativecommons /licens /by /4.0/),which permits unrestricted reu,distribution,and reproduction in any medium,provided the original work is properly cited.
Nucleic Acids Rearch Advance Access published January 8, 2015 at XidianUniversity on April 19, 2015
2Nucleic Acids Rearch,2015
of characterizing oncogenes and tumor suppressors)but also from a clinical perspective where patient-specific driver genes hold significant value for defining therapeutic tar-gets.While recent studies that have cataloged the frequency of mutations in genes bad on a large number of patient samples have been quite successful in identifying the ma-jor oncogenes and tumor suppressors in a cancer subtype (15,16),the approaches are not well-suited for identify-ing rare drivers or patient-specific driver genes(14,17),even with the u of more sophisticated statistical approaches (18,19).An orthogonal approach that has been ud with some success relies on the direct evaluation of evolutionary conrvation and physiochemical properties to infer func-tional mutations(
20,21),but the methods are restricted to point mutations and were found to lack in accuracy due to a dependence on high-quality training data(22).Integration of mutation data with gene interaction networks has also been propod as an approach to identify rare drivers,rely-ing on the assumption that they will cluster on the network, but limited to the analysis of point mutations(23–25).
A natural and powerful approach to asss the functional impact of mutations is to measure changes in gene expres-sion patterns that can be attributed to them.When done without prior information about which genes interact,this association analysis requires a large number of samples and can potentially lead to many fal positives(9,17).Alterna-tively,reconstructed interaction networks bad on gene co-expression(26)or known molecular networks(8,10)have been exploited to better define informative associations. The methods come clor to integrative modeling of can-cer omics data and have the potential advantage of provid-ing biologically plausible hypothes for candidate driver genes.In addition,the methods can be applied to a range of mutation class,unlike veral popular mutation-type restricted CHASM(20),OncodriveFM(18) and PARADIGM-SHIFT(11)),thus allowing for a joint asssment of driver events and genes.They are,however, currently still limited to making aggregate predictions for a data t and are not designed to support the sample-specific analysis that would be key for defining personalized can-cer management and th
erapy.An additional limitation in the field is that existing methods have not been shown to robustly analyze data from cancer cell lines,which are fre-quently ud as in vitro models for pharmacological investi-gations(3)and can form the basis of a framework for per-sonalized cancer therapy.
西游记书评
Tumor stratification and prognostication is another im-portant end-goal for cancer genomic profiling and analy-sis(27)that is often considered independent of driver gene prediction,despite being potentially related objectives.A commonly ud approach for tumor stratification is bad on the clustering of gene expression profiles,even though its prognostic value has appeared limited at times and de-pends greatly on the adopted gene signature(28).Improved driver gene prediction should,in principle,be informative for tumor stratification as the identified mutated genes are likely causative events for carcinogenesis and metastasis. However,to our knowledge,this application has yet to be demonstrated by driver gene prediction algorithms,despite a report on whole-exome mutation profiles being uful for tumor stratification(29).
Advances in the capability to identify oncogenic drivers
and to stratify tumors can potentially revolutionize person-
alized cancer therapy(3,27).To address existing method-
ological limitations,we developed a first-in-class algorith-
mic framework(OncoIMPACT)that nominates patient-
specifi c driver genes by integratively modeling genomic mu-
tations(point,structural and copy-number)and the result-
ing perturbations in transcriptional programs via defined molecular networks.Our benchmark analysis on large pub-
licly available data ts from The Cancer Genome Atlas (TCGA)for veral cancer subtypes revealed notable im-provements over existing approaches in terms of precision
and robustness for identifying driver genes.Furthermore, OncoIMPACT’s robustness on cell line data ts was con-
firmed using data from the Cancer Cell Line Encyclope-
dia(CCLE)(30)and we additionally provide direct exper-
imental evidence using a patient-derived cancer cell line to showca its potential in personalized medicine.Finally,we
prent the first demonstration for the u of a t of com-putationally identified driver genes as a mutational-status-
bad signature for tumor stratification and prognostica-
tion.Taken together,our results highlight the potential of computational methods in integrative modeling of cancer
omics data for uncovering new insights into tumor biology,
and their application in a clinical tting for stratification
and personalized therapy.
MATERIALS AND METHODS
Design of a robust framework for patient-specific data inte-
gration
A natural framework to asss the impact of candidate
driver mutations(genomic and epigenomic)is to u gene interaction networks to associate mutations with changes
in cell anscriptome(8),proteome,epigenome or metabolome)and this is the approach adopted in the design
of OncoIMPACT.For the sake of simplicity and due to its
wide availability,we consider only transcriptomic changes
in this study,though similar ideas as propod here apply
to other omics information as well.A key consideration in
the design of OncoIMPACT is the ability to characterize the
impact of mutations(non-synonymous Single Nucleotide Variantions(SNV),indels and Copy Number
Variations (CNV))at a patient-specific level and for that purpo we
苏轼浣溪沙
propo an approach that associates mutations with mod-
ules of patient-specific deregulated genes on the network. Specifically,given a mutation in a patient we consider a deregulated gene in the patient as being explained if there is
a small path(length less than a parameter L)of deregulated
genes in the patient that connect it to the mutated gene in
the interaction network.To account for promiscuous asso-ciations,we disallow paths that go through hub genes(with
degree greater than a parameter D)in the network and iden-
tify deregulated genes as tho that are significantly differ-
entially expresd in cancer versus normal cells(fal dis-
covery rate corrected P-value<0.05)and with a strong fold
change(greater than a parameter F).The parameters in this framework(L,D and F)are directly determined using a sta-
tistical approach bad on the interaction network and data
ts ud,as discusd in the next ction.In order to clus-
ter mutations and deregulated genes into relevant modules,
at XidianUniversity on April 19, 2015
Downloaded from
Nucleic Acids Rearch,20153
we then define the notion of a phenotype gene as frequently
explained(default≥5%of patients)deregulated genes for
a cancer subtype,where the phenotype genes rve to rep-
rent and nucleate modules as described in a later ction.
Finally,OncoIMPACT distinguishes pasnger mutations
from potential driver mutations by identifying tho that
explain phenotype genes and thus have a significant impact
on the associated modules.
A systematic approach to determine model parameters While the parameters L and D are largely determined by
the properties of the network,the fold-change parame-
ter F could potentially interact with them to increa the
number of spurious associations to a mutation.Under the
assumption that with a suitable t of parameters,real
data ts should have many more associations than ran-
dom data ts,we u the following permutation-bad ap-
proach to t parameters:(i)We generate random data ts
by permuting gene labels for mutation and transcriptome
data ts independently.Note that this procedure main-
tains the frequency distributions of mutated and deregu-
lated genes across patients and within a patient,while de-
stroying the association between mutated and deregulated
genes.(ii)For each random data t(which contains the
same number of patients as the real data t),we identify
explained genes on the network and compute the distri-
bution of the frequency with which a gene is explained.
(iii)Aggregating this information across data ts,we com-
pare it to the distribution for the real data t using the
Jenn–Shannon divergence metric.(iv)A grid arch over
suitable ranges of parameter values is then ud to t
the parameters bad on the choice that maximizes the
Jenn–Shannon divergence from random data ts(default四好老师
ttings:L∈{2,4,...,20},D∈{10,15,...,100}and F∈{1,1.5,...,3}).To avoid extreme parameter choices,we ig-nore choices for which the median number of deregulated
genes(across samples)is more than half the genes in the net-
work or less than300genes.Our experiments with subts
of patients confirmed the robustness of the parameter infer-
ence procedure and the feasibility to do it with small data
ts to reduce overall running time(Supplementary Figure
S1).
Asssing the significance of phenotype genes
In order to identify statistically significant phenotype genes,
we adopted a permutation-bad testing framework to test
each candidate.Specifically,we permuted gene labels for the
mutations for each sample independently.The random data
ts were then ud to obtain an empirical null distribution
(default=500data ts)for the frequency with which a gene
is explained and compute P-values for obrved frequencies
运动损伤的原因(=probability of obrving frequencies that exceed the ob-
rved frequency by chance).Corrections for multiple hy-
pothesis testing were done using the method of Benjamini
and Hochberg and a significance threshold of0.1was ud
群邑in addition to the frequency threshold(default=5%)to
identify significant and meaningful phenotype genes for nu-
cleating modules.Distinguishing driver mutations from back-at driver muta-
tions
While the approach to individually asss the impact of mu-tations and to u their association with phenotype genes
for distinguishing potential drivers from pasngers works reasonably well,in situations where a strong driver deregu-
lates many genes in the network,extraneous mutated genes
in the neighborhood can get associated with a module.In
order to distinguish such back-at driver mutations,we applied a parsimony principle to identify a minimal t
of drivers associated with phenotype genes.Encoding the patient-specific association of mutations with phenotype genes as t(also implicitly defining a bipartite graph),this problem can be formulated as the classical Minimum Set
Cover problem,a well-known NP-complete problem with
a greedy O(log n)approximation algorithm.In OncoIM-PACT,we implemented a version of this algorithm that iter-atively lects the gene covering the most number of uncov-
ered phenotype genes,breaking ties by choosing the gene predicted as a driver in the most number of patients.In
the patient-specific mode,a mutated gene is considered as
a driver in a patient only if it aided in covering a patient-specific phenotype gene(stringent mode),while in a more re-
laxed tting(nsitive mode;default)OncoIMPACT marks
a potential driver gene as a back-at driver only if it is so
in all patients.Note that the stringent mode is particularly
well suited for analyzing data ts where there is a high rate
of fal-positive mutations.
Construction of patient-specific gene modules for asssing mutational impact
The construction of patient-specific gene modules in On-coIMPACT allows us to obtain a more comprehensive mea-
sure of the impact/importance of a putative driver gene.
To coalesce mutated genes and phenotype genes into mod-
ules we employ the following steps:(i)For each patient a
driver gene defines a module compod of the t of ex-plained genes associated with it.(ii)Modules of the same patient that share a phenotype gene are merged together.
(iii)Deregulated genes that do not belong to paths between
driver and phenotype genes are trimmed from modules.
(iv)The patient-specific impact of a driver gene is com-
puted as the sum of fold change of genes that belong to
its module and the overall impact is defined as the average patient-specific impact.Finally,OncoIMPACT orders pre-
dicted driver genes bad on their impact value.
U of pre-computed information from large public data ts OncoIMPACT is configured to run in two modes:(i)a databa mode that allows it to determine parameter t-
tings(L,D and F)and significant phenotype genes from the
data ts provided and(ii)a discovery mode where informa-
tion in the provided databa is ud to predict driver genes
for each sample in an additional data t(which can be the
same as the one ud to create the databa).In the discov-
ery mode,identification of back-at drivers is done by com-
bining the databa data ts with discovery data ts.Note
that OncoIMPACT can be run by default in a combined databa-plus-discovery tting on an input data t,while
at XidianUniversity on April 19, 2015
Downloaded from
4Nucleic Acids Rearch,2015
the discovery mode is uful to avoid computations when a pre-computed databa is available.As part of the OncoIM-PACT package,we provide databas constructed from all the TCGA data ts analyzed in this study,to enable easy integration with custom,in-hou data ts for the cancer subtypes.New releas of OncoIMPACT will include addi-tional subtype databas as well.
Patient stratification and survival analysis
Clusterings of driver gene profiles(binary1–0vectors)were computed using non-negative matrix factorization(NMF) bad on connsus clustering using the R package‘nmf’(31).In order to produce robust clustering the connsus clustering was obtain using200random runs of the NMF optimization algorithm.Kaplan–Meier curves were drawn for the clusters and log rank P-values computed using the R package‘survival’(29).
Data ts and networks
All TCGA data ts were downloaded from the TCGA data v/tcga/).OncoIM-PACT analysis was restricted to samples for which infor-mation on point mutations,copy-number alterations and gene expression was available.Cell line data ts(47ovar-ian and41glioma lines)were downloaded from the CCLE data portal(www.broadins
titute/ccle/home)and shRNA data from the Achilles data portal(www. broadinstitute/achilles;24ovarian cell lines with ge-nomic and shRNA data).A detailed description of data parsing and pre-processing steps can be found in the Sup-plementary Text.
By default,OncoIMPACT us the gene interaction net-work constructed by Wu et al.(32)(covering nearly50% of the human proteome)for its analysis.This interaction network integrates information from known KEGG,NCI-Nature)as well as interactions derived from computational co-expression,protein domain interactions and shared gene ontology(GO)bio-logical process).However,OncoIMPACT can u other net-works as input as well and our experiments with a manually curated network(33)suggest that while a less complete net-work can reduce its predictive power,its predictions are still typically better than a frequency-bad approach(Supple-mentary Figure S2).
Genomic analysis of melanoma sample and functional valida-tion using patient-derived cell line
Distant metastasis melanoma samples and the correspond-ing patient-derived cancer cell line were provided and estab-lished by the John Wayne Cancer Institute as previously de-scribed(34).Details of genome and transcriptome quenc-ing and analysis of the melanoma samples can be found in th
e Supplementary Text.Driver genes in the cell line de-rived from distant metastasis were validated using siRNA-mediated knockdown.Briefly,the patient-derived cell line was cultured in complete RPMI culture medium containing 10%fetal bovine rum and was kept at37◦C with5%CO2. For the knockdown experiment,cells were incubated with 25nM siRNA and lipofectamine RNAimax(Life Tech-nologies)at37◦C for72h.Active cell proliferation was
detected using Click-iT EdU Alexa Fluor488HTS Assay
(Life Technologies).Fixation and staining of cells was per-
formed according to manufacturer’s instructions.The siR-
NAs ud are tabulated in Supplementary Table S1.Dhar-
macon ON-TARGETplus Non-Targeting Control Pool was
ud as a negative control.The TaqMan primers for quan-
titative polymera chain reaction are designed by and or-
dered comerically from Life Technologies.
RESULTS
An overview of OncoIMPACT’s algorithmic framework OncoIMPACT is designed to integrate information re-
garding mutations(genomic and epigenomic),changes
in cell anscriptome,proteome,epigenome or metabolome)and gene interaction networks to nominate
and rank driver cancer mutations in a patient-specific man-
driver predictions are made for each patient;Figure
1a and Materials and Methods).Briefly,it does so by evalu-
ating the impact of a mutation by associating them to mod-
大学毕业后
ules of patient-specific deregulated genes through the gene interaction network(step3in Figure1a).A key step in this
process is the identification of ntinel phenotype genes fre-
quently deregulated in a cancer subtype(but not typically mutated)and rve to distinguish relevant driver mutations
from pasngers(step2in Figure1a).The association of mutations to phenotype genes is controlled by three param-
eters(maximum path length L,maximum gene connectiv-
ity D and a perturbation threshold F)that are determined
in a data-driven fashion using a statistical maximization ap-
proach(step1in Figure1a,b and Materials and Methods).
To further differentiate true drivers from back-at drivers, OncoIMPACT employs the parsimony principle to iden-
tify a minimal t of driver mutations for each patient(Fig-
ure1c).Finally,the nominated patient-specific drivers are
ranked bad on their impact on associated modules.A de-
tailed description for each of the steps in OncoIMPACT can
be found in the Materials and Methods ction.
OncoIMPACT nominates cancer drivers accurately and con-
sistently
As existing methods for identifying driver genes are bad
on aggregate analysis over a large number of patients,we
begin by comparing OncoIMPACT’s performance for this
task against an aggregate network approach(DriverNet(8))
as well as a commonly ud mutation frequency-bad ap-
proach for ordering candidate drivers(35–39)(Frequency).
Our experiments using large TCGA data ts(328sam-
ples for Glioblastoma multiforme or GBM(1)and316for
Ovarian Cancer(40))indicate that OncoIMPACT can suc-
cessfully integrate information regarding copy-number as
well as point mutations and indels to highlight key driver
genes across categories(Supplementary Tables S2and S3).
In contrast,a naive frequency-bad approach ems to en-
rich for less known cancer driver genes(Supplementary Ta-
bles S2and S3and Supplementary File S1),e.g.the top
gene on the Glioblastoma list is JARID1D instead of EGFR
and both lists omit PIK3CA from the top10.While results
at XidianUniversity on April 19, 2015
自己生日的说说
Downloaded from
Nucleic Acids Rearch,2015
5
Figure 1.A schematic reprentation of OncoIMPACT’s algorithmic framework.(a)Overview of OncoIMPACT’s workflow involving three main stages of data-processing.(b)Depiction of OncoIMPACT’s arch through a multi-dimensional space to t network and expression parameters (F,fold change of genes;L,length of path;D,degree of nodes).(c)Parsimony-bad matching of potential driver and phenotype genes in a bipartite graph to eliminate back-at drivers.Solid and dashed lines indicate the association of potential driver genes to phenotype genes that were accepted and rejected,respectively.
for DriverNET were more comparable to tho from On-coIMPACT,DriverNET failed to identify veral known oncogenes,such as NF1and RB1,in ovarian cancer (40)and MDM4in Glioblastoma (41)among others (Supple-mentary Tables S4and Supplementary File S1).To per-form a more systematic comparison across methods,we ud genes in the cancer gene census (CGC)(42)and a previously compiled pan-cancer driver list (43)as a proxy for potential drivers to asss the concordance /precision of the top driver genes reported for five different cancer types (GBM,Melanoma,Ovarian,Prostate and Bladder)(Figure 2a,Supplementary Figure S3).The results indi-cate a strong enrichment for potential true positive driver genes in OncoIMPACT’s predictions (Supplementary Fig-ure S3b).For example,among the top 20predictions in Glioblastoma,OncoIMPA
CT’s concordance is above 60%while the frequency-bad approach and DriverNet are be-low 40%,suggesting that it is generally more accurate and less likely to be influenced by frequently mutated pasn-gers.This trend was en in all cancer types,except for the Melanoma data t where the lack of sufficient normal con-trols likely affected OncoIMPACT’s results in relation to DriverNet (Supplementary Figure S3).
We further tested the robustness of OncoIMPACT using a subsampling-bad approach to compare predictions to tho on the full data t of patients.Our results suggest that OncoIMPACT’s predictions are extremely stable even with very small sample sizes (∼20patients),with more than 90%of reported drivers being found on the full data t (Figure 2b).In addition,OncoIMPACT can recover a siz-able proportion of drivers using a relatively small subt of the data t (>70%with 50patients;Figure 2b).Although both common drivers (>5%mutational frequency)and rare drivers (<5%mutational frequency)have high recovery,a higher fraction of common drivers is generally recovered,possibly due to the bias in the pasnger filtering step in OncoIMPACT (Figure 2b;Materials and Methods).How-ever,the recovery rates for rare and common drivers con-verge as the number of samples increas.The stability at-tribute of OncoIMPACT is likely to be a uful feature in the analysis of rare forms of cancers (e.g.cardiac tumors (44)),where the availability of samples is limited.In partic-
at XidianUniversity on April 19, 2015