A comparison of survival analysis methods for cancer gene expression RNA-Sequencing data

Pichai Raman; Samuel Zimmerman; Komal S Rathi; Laurence de Torrenté; Mahdi Sarmady; Chao Wu; Jeremy Leipzig; Deanne M Taylor; Aydin Tozeren; Jessica C Mar

doi:10.1016/j.cancergen.2019.04.004

A comparison of survival analysis methods for cancer gene expression RNA-Sequencing data

Cancer Genet. 2019 Jun:235-236:1-12. doi: 10.1016/j.cancergen.2019.04.004. Epub 2019 Apr 12.

Authors

Affiliations

¹ School of Biomedical Engineering, Sciences and Health Systems, Drexel University, Philadelphia, PA, United States; Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia, PA, United States; Center for Data-Driven Discovery in Biomedicine, Children's Hospital of Philadelphia, Philadelphia, PA, United States. Electronic address: ramanp@email.chop.edu.
² Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, United States. Electronic address: sezimmer@einstein.yu.edu.
³ Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia, PA, United States; Center for Data-Driven Discovery in Biomedicine, Children's Hospital of Philadelphia, Philadelphia, PA, United States. Electronic address: rathik@email.chop.edu.
⁴ Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, United States. Electronic address: ldetorrente@nygenome.org.
⁵ Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, United States; Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. Electronic address: sarmadym@email.chop.edu.
⁶ Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, United States. Electronic address: wuc8@email.chop.edu.
⁷ Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA, United States; College of Computing and Informatics, Drexel University, Philadelphia, PA, United States. Electronic address: leipzig@panoramamedicine.com.
⁸ Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia, PA, United States; The Department of Pediatrics, The University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA. Electronic address: taylordm@email.chop.edu.
⁹ School of Biomedical Engineering, Sciences and Health Systems, Drexel University, Philadelphia, PA, United States. Electronic address: at62@drexel.edu.
¹⁰ Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, United States; Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY, United States; Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, Australia. Electronic address: j.mar@uq.edu.au.

PMID: 31296308
DOI: 10.1016/j.cancergen.2019.04.004

Abstract

Identifying genetic biomarkers of patient survival remains a major goal of large-scale cancer profiling studies. Using gene expression data to predict the outcome of a patient's tumor makes biomarker discovery a compelling tool for improving patient care. As genomic technologies expand, multiple data types may serve as informative biomarkers, and bioinformatic strategies have evolved around these different applications. For categorical variables such as a gene's mutation status, biomarker identification to predict survival time is straightforward. However, for continuous variables like gene expression, the available methods generate highly-variable results, and studies on best practices are lacking. We investigated the performance of eight methods that deal specifically with continuous data. K-means, Cox regression, concordance index, D-index, 25th-75th percentile split, median-split, distribution-based splitting, and KaplanScan were applied to four RNA-sequencing (RNA-seq) datasets from the Cancer Genome Atlas. The reliability of the eight methods was assessed by splitting each dataset into two groups and comparing the overlap of the results. Gene sets that had been identified from the literature for a specific tumor type served as positive controls to assess the accuracy of each biomarker using receiver operating characteristic (ROC) curves. Artificial RNA-Seq data were generated to test the robustness of these methods under fixed levels of gene expression noise. Our results show that methods based on dichotomizing tend to have consistently poor performance while C-index, D-index, and k-means perform well in most settings. Overall, the Cox regression method had the strongest performance based on tests of accuracy, reliability, and robustness.

Keywords: Cancer; Gene expression; Kaplan–Meier; Survival analysis; TCGA.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Base Sequence
Biomarkers, Tumor / genetics
Data Interpretation, Statistical
Gene Expression Profiling / methods*
Gene Expression Regulation, Neoplastic / genetics*
Humans
Kaplan-Meier Estimate
Neoplasms / genetics*
Neoplasms / mortality*
Prognosis
Proportional Hazards Models
ROC Curve
Sequence Analysis, RNA / methods
Survival Analysis

Substances

Biomarkers, Tumor