Stable solution to l 2,1-based robust inductive matrix completion and its application in linking long noncoding RNAs to human diseases

BMC Med Genomics. 2017 Dec 28;10(Suppl 5):77. doi: 10.1186/s12920-017-0310-1.

Abstract

Backgrounds: A large number of long intergenic non-coding RNAs (lincRNAs) are linked to a broad spectrum of human diseases. The disease association with many other lincRNAs still remain as puzzle. Validation of such links between the two entities through biological experiments are expensive. However, a plethora lincRNA-data are available now, thanks to the High Throughput Sequencing (HTS) platforms, Genome Wide Association Studies (GWAS), etc, which opens the opportunity for cutting-edge machine learning and data mining approaches to extract meaningful relationships among lincRNAs and diseases. However, there are only a few in silico lincRNA-disease association inference tools available to date, and none of them utilizes side information of both the entities simultaneously in a single framework.

Methods: The recently developed Inductive Matrix Completion (IMC) technique provides a recommendation platform among two entities considering respective side information about them. However, the formulation of IMC is incapable of handling noise and outliers that may be present in the datasets, while data sparsity consideration is another issue with the standard IMC method. Thus, a robust version of IMC is needed that can solve the two issues. As a remedy, in this paper, we propose Stable Robust Inductive Matrix Completion (SRIMC) that utilizes the l 2,1 norm based regularization to optimize the objective function with a unique 2-step stable solution approach.

Results: We applied SRIMC to the available association data between human lincRNAs and OMIM disease phenotypes as well as a diverse set of side information about the lincRNAs and the diseases. The method performs better than the state-of-the-art methods in terms of p r e c i s i o n @ k and r e c a l l @ k at the top-k disease prioritization to the subject lincRNAs. We also demonstrate that SRIMC is equally effective for querying about novel lincRNAs, as well as predicting rank of a newly known disease for a set of well-characterized lincRNAs.

Conclusions: With the experimental results and computational evaluation, we show that SRIMC is robust in handling datasets with noise and outliers as well as dealing with novel lincRNAs and disease phenotypes.

Keywords: Association inference; Human disease phenotypes; Inductive learning; Long noncoding RNA; Matrix completion.

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Disease / genetics*
  • Genome-Wide Association Study
  • Humans
  • RNA, Long Noncoding / genetics*

Substances

  • RNA, Long Noncoding