Resampling nucleotide sequences with closest-neighbor trimming and its comparison to other methods

Kouki Yonezawa; Manabu Igarashi; Keisuke Ueno; Ayato Takada; Kimihito Ito

doi:10.1371/journal.pone.0057684

Resampling nucleotide sequences with closest-neighbor trimming and its comparison to other methods

PLoS One. 2013;8(2):e57684. doi: 10.1371/journal.pone.0057684. Epub 2013 Feb 27.

Authors

Kouki Yonezawa¹, Manabu Igarashi, Keisuke Ueno, Ayato Takada, Kimihito Ito

Affiliation

¹ Department of Computer Bioscience, Nagahama Institute of Bio-science and Technology, Nagahama, Shiga-pref, Japan.

Abstract

A large number of nucleotide sequences of various pathogens are available in public databases. The growth of the datasets has resulted in an enormous increase in computational costs. Moreover, due to differences in surveillance activities, the number of sequences found in databases varies from one country to another and from year to year. Therefore, it is important to study resampling methods to reduce the sampling bias. A novel algorithm-called the closest-neighbor trimming method-that resamples a given number of sequences from a large nucleotide sequence dataset was proposed. The performance of the proposed algorithm was compared with other algorithms by using the nucleotide sequences of human H3N2 influenza viruses. We compared the closest-neighbor trimming method with the naive hierarchical clustering algorithm and [Formula: see text]-medoids clustering algorithm. Genetic information accumulated in public databases contains sampling bias. The closest-neighbor trimming method can thin out densely sampled sequences from a given dataset. Since nucleotide sequences are among the most widely used materials for life sciences, we anticipate that our algorithm to various datasets will result in reducing sampling bias.

Publication types

Comparative Study
Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Base Sequence
Databases, Nucleic Acid
Genetic Variation
Hemagglutinin Glycoproteins, Influenza Virus / genetics
Humans
Influenza A Virus, H3N2 Subtype / genetics
Sequence Analysis, DNA / methods*
Time Factors

Substances

Hemagglutinin Glycoproteins, Influenza Virus

Grants and funding

This work was supported by the Global COE Program "Establishment of International Collaboration Centers for Zoonosis Control", the Japan Initiative for Global Research Network on Infectious Diseases (J-GRID), KAKENHI 24700289, all from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), Japan, and PRESTO and SORST from Japan Science and Technology Agency (JST) Basic Research Programs. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.