Title : ( Analyzing the performance of short-read classification tools on metagenomic samples toward proper diagnosis of diseases )
Authors: leili irankhah , Babak Khorsand , Mahmoud Naghibzadeh , Abdorreza Savadi ,Access to full-text not allowed by authors
Abstract
Accurate knowledge of the genome, virus and bacteria that have invaded our bodies is crucial for diagnosing many human diseases. The field of bioinformatics encompasses the complex computational methods required for this purpose. Metagenomics employs next-generation sequencing (NGS) technology to study and identify microbial communities in environmental samples. This technique allows for the measurement of the relative abundance of different microbes. Various tools are available for detecting bacterial species in sequenced metagenomic samples. In this study, we focus on well-known taxonomic classification tools such as MetaPhlAn4, Centrifuge, Kraken2, and Bracken, and evaluate their performance at the species level using synthetic and real datasets. The results indicate that MetaPhlAn4 exhibited high precision in identifying species in the simulated dataset, while Kraken2 had the best area under the precision-recall curve (AUPR) performance. Centrifuge, Kraken2, and Bracken showed accurate estimation of species abundances, unlike MetaPhlAn4, which had a higher L2 distance. In the real dataset analysis with samples from an inflammatory bowel disease (IBD) research, MetaPhlAn4, and Kraken2 had faster execution times, with differences in performance at family and species levels among the tools. Enterobacteriaceae and Pasteurellaceae were highlighted as the most abundant families by Centrifuge, Kraken2, and MetaPhlAn4, with variations in abundance among ulcerative colitis (UC), Crohn\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\'s disease (CD), and control non-IBD (CN) groups. Escherichia coli (E. coli) has the highest abundance among Enterobacteriaceae species in the CD and UC groups in comparison with the CN group. Bracken overestimated E. coli abundance, emphasizing result interpretation caution. The findings of this research can assist in selecting the appropriate short-read classifier, thereby aiding in the diagnosis of target diseases.
Keywords
, Metagenomics; bioinformatics; microbial detection; next, generation sequencing; species level classification; taxonomic classification@article{paperid:1100395,
author = {Irankhah, Leili and بابک خورسند and Naghibzadeh, Mahmoud and Savadi, Abdorreza},
title = {Analyzing the performance of short-read classification tools on metagenomic samples toward proper diagnosis of diseases},
journal = {Journal of Bioinformatics and Computational Biology},
year = {2024},
month = {September},
issn = {0219-7200},
keywords = {Metagenomics; bioinformatics; microbial detection; next-generation sequencing; species level classification; taxonomic classification},
}
%0 Journal Article
%T Analyzing the performance of short-read classification tools on metagenomic samples toward proper diagnosis of diseases
%A Irankhah, Leili
%A بابک خورسند
%A Naghibzadeh, Mahmoud
%A Savadi, Abdorreza
%J Journal of Bioinformatics and Computational Biology
%@ 0219-7200
%D 2024