Big data in medical science--a biostatistical view

Dtsch Arztebl Int. 2015 Feb 27;112(9):137-42. doi: 10.3238/arztebl.2015.0137.

Abstract

Background: Inexpensive techniques for measurement and data storage now enable medical researchers to acquire far more data than can conveniently be analyzed by traditional methods. The expression "big data" refers to quantities on the order of magnitude of a terabyte (1012 bytes); special techniques must be used to evaluate such huge quantities of data in a scientifically meaningful way. Whether data sets of this size are useful and important is an open question that currently confronts medical science.

Methods: In this article, we give illustrative examples of the use of analytical techniques for big data and discuss them in the light of a selective literature review. We point out some critical aspects that should be considered to avoid errors when large amounts of data are analyzed.

Results: Machine learning techniques enable the recognition of potentially relevant patterns. When such techniques are used, certain additional steps should be taken that are unnecessary in more traditional analyses; for example, patient characteristics should be differentially weighted. If this is not done as a preliminary step before similarity detection, which is a component of many data analysis operations, characteristics such as age or sex will be weighted no higher than any one out of 10 000 gene expression values. Experience from the analysis of conventional observational data sets can be called upon to draw conclusions about potential causal effects from big data sets.

Conclusion: Big data techniques can be used, for example, to evaluate observational data derived from the routine care of entire populations, with clustering methods used to analyze therapeutically relevant patient subgroups. Such analyses can provide complementary information to clinical trials of the classic type. As big data analyses become more popular, various statistical techniques for causality analysis in observational data are becoming more widely available. This is likely to be of benefit to medical science, but specific adaptations will have to be made according to the requirements of the applications.

Publication types

  • Review

MeSH terms

  • Biomedical Research / methods*
  • Data Interpretation, Statistical*
  • Database Management Systems*
  • Databases, Factual*
  • Datasets as Topic*
  • Information Storage and Retrieval / methods*
  • Machine Learning