In the last decades, the field of metagenomics aided by NGS technologies has grown exponentially and is now a cornerstone tool in medicine. However, even with the current technologies, obtaining a conclusive identification of an organism can be challenging due to using reference-based methods. Consequently, when releasing a new repository of genomic data that contains de-novo sequences, it is problematic to characterize its content. In this paper, we propose a novel method for organism identification and the creation and characterization of genomic databases. For identification, we propose a three-step pipeline for reference-free reconstruction, reference-based classification and features-based classification. On the other hand, for content exposition and extraction, the sequences and their identification are aggregated into a web database catalogue.
Keywords: Compression; Feature-based classification; Genomic population characterization; Metagenomics catalogue; Taxonomic identification.