Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome

Rebecca Louise Smith* (Corresponding Author), Laura Glendinning, Alan Walker, Mick Watson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)
3 Downloads (Pure)


Microbiome analysis is quickly moving towards high-throughput methods such as metagenomic sequencing. Accurate taxonomic classification of metagenomic data relies on reference sequence databases, and their associated taxonomy. However, for understudied environments such as the rumen microbiome many sequences will be derived from novel or uncultured microbes that are not
present in reference databases. As a result, taxonomic classification of metagenomic data from understudied environments may be inaccurate. To assess the accuracy of taxonomic read classification, this study classified metagenomic data that had been simulated from cultured rumen microbial genomes from the Hungate collection. To assess the impact of reference databases on the accuracy taxonomic classification, the data was classified with Kraken 2 using several reference databases. We found that the choice and composition of reference database significantly impacted on taxonomic classification results, and accuracy. In particular, NCBI RefSeq proved to be a poor choice of database. Our results indicate that inaccurate read classification is likely to be significant problem, affecting all studies that use insufficient reference databases. We observed that adding cultured reference genomes from the rumen to the reference database greatly improved classification rate and accuracy. We also demonstrated that metagenome-assembled genomes
(MAGs) have the potential to further enhance classification accuracy by representing uncultivated microbes, sequences of which would otherwise be unclassified or incorrectly classified. However, classification accuracy was strongly dependent on the taxonomic labels assigned to these MAGs. We therefore highlight the importance of accurate reference taxonomic information and suggest that, with formal taxonomic lineages, MAGs have the potential to improve classification rate and accuracy, particularly in environments such as the rumen that are understudied or contain many novel genomes.
Original languageEnglish
Article number57
JournalAnimal Microbiome
Early online date18 Nov 2022
Publication statusPublished - 18 Nov 2022

Bibliographical note

The Roslin Institute forms part of the Royal (Dick) School of Veterinary Studies, University of Edinburgh. This project was supported by the Biotechnology and Biological Sciences Research Council (BBSRC; BB/S006680/1, BB/R015023/1), including institute strategic program grant BBS/E/D/30002276. R.H.S. is supported by an EASTBIO studentship funded by BBSRC (BB/M010996/1). A.W.W. and the Rowett Institute receive core financial support from the Scottish Government Rural and Environmental Sciences and Analytical Services (SG-RESAS).

We would like to thank all of those who were involved in creating and publicly sharing both the Hungate Collection data and the RUG data.

Data Availability Statement

The data used in this study was simulated using genomes from the Hungate Collection (see The simulated metagenomic data is available at The metagenomic assemblies (MAGs) used to create the RUG and RefRUG databases can be found in ENA under accession PRJEB31266 ( information about the MAGs used to create the RUG database, such as genome metrics, can be found in the Stewart et al. publication [17].


  • Metagenome-assembled genomes
  • Metagenome
  • Rumen
  • Microbiome
  • Reference databases
  • Read classification
  • Taxonomy


Dive into the research topics of 'Investigating the impact of database choice on the accuracy of metagenomic read classification for the rumen microbiome'. Together they form a unique fingerprint.

Cite this