Petabyte-scale innovations at the European Nucleotide Archive

Guy Cochrane; Ruth Akhtar; James Bonfield; Lawrence Bower; Fehmi Demiralp; Nadeem Faruque; Richard Gibson; Gemma Hoad; Tim Hubbard; Christopher Hunter; Mikyung Jang; Szilveszter Juhos; Rasko Leinonen; Steven Leonard; Quan Lin; Rodrigo Lopez; Dariusz Lorenc; Hamish McWilliam; Gaurab Mukherjee; Sheila Plaister; Rajesh Radhakrishnan; Stephen Robinson; Siamak Sobhany; Petra Ten Hoopen; Robert Vaughan; Vadim Zalunin; Ewan Birney

doi:10.1093/nar/gkn765

Petabyte-scale innovations at the European Nucleotide Archive

Nucleic Acids Res. 2009 Jan;37(Database issue):D19-25. doi: 10.1093/nar/gkn765. Epub 2008 Oct 31.

Affiliation

¹ EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. cochrane@ebi.ac.uk

Abstract

Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.

MeSH terms

Databases, Nucleic Acid*
Internet
Sequence Analysis / trends*
Systems Integration

Petabyte-scale innovations at the European Nucleotide Archive

Authors

Affiliation

Abstract

MeSH terms

Grants and funding