Skip to content


Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at An accompanying web version is available at

This study was supported by the:
  • BMBF (Award 031A533)
    • Principle Award Recipient: AlexanderGoesmann
  • BMBF (Award 031L0209A)
    • Principle Award Recipient: AlexanderGoesmann
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License.

Article metrics loading...

Loading full text...

Full text loading...



  1. Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T et al. GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003; 31:2187–2195 [View Article] [PubMed]
    [Google Scholar]
  2. Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A et al. BASys: a web server for automated bacterial genome annotation. Nucleic Acids Res 2005; 33:W455–W459 [View Article]
    [Google Scholar]
  3. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T et al. The RAST server: Rapid Annotations using Subsystems Technology. BMC Genomics 2008; 9:75 [View Article]
    [Google Scholar]
  4. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 2018; 46:D851–D860 [View Article]
    [Google Scholar]
  5. Dong Y, Li C, Kim K, Cui L, Liu X. Genome annotation of disease-causing microorganisms. Brief Bioinform 2021; 22:845–854 [View Article]
    [Google Scholar]
  6. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30:2068–2069 [View Article]
    [Google Scholar]
  7. Tanizawa Y, Fujisawa T, Nakamura Y. DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 2018; 34:1037–1039 [View Article]
    [Google Scholar]
  8. Quijada NM, Rodríguez-Lázaro D, Eiros JM, Hernández M. TORMES: an automated pipeline for whole bacterial genome analysis. Bioinformatics 2019; 35:4207–4212 [View Article]
    [Google Scholar]
  9. Schwengers O, Hoek A, Fritzenwanker M, Falgenhauer L, Hain T et al. ASA3P: an automatic and scalable pipeline for the assembly, annotation and higher-level analysis of closely related bacterial isolates. PLoS Comput Biol 2020; 16:e1007134
    [Google Scholar]
  10. Petit RA, Read TD. Bactopia: a flexible pipeline for complete analysis of bacterial genomes. mSystems 2020; 5:e00190-20 [View Article]
    [Google Scholar]
  11. Seemann T. Nullarbor. Github. accessed 25 Sep 2020
  12. Lobb B, Tremblay BJ-M, Moreno-Hagelsieb G, Doxey AC. An assessment of genome annotation coverage across the bacterial tree of life. Microb Genom 2020; 6:000341 [View Article]
    [Google Scholar]
  13. Wassarman KM, Repoila F, Rosenow C, Storz G, Gottesman S. Identification of novel small RNAs using comparative genomics and microarrays. Genes Dev 2001; 15:1637–1651 [View Article]
    [Google Scholar]
  14. Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res 2008; 15:387–396 [View Article]
    [Google Scholar]
  15. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010; 11:119 [View Article]
    [Google Scholar]
  16. Li W, O’Neill KR, Haft DH, DiCuccio M, Chetvernin V et al. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res 2021; 49:D1020–D1028 [View Article]
    [Google Scholar]
  17. UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021; 49:D480–D489 [View Article]
    [Google Scholar]
  18. Chan PP, Lin BY, Mak AJ, Lowe TM. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. bioRxiv 2019614032
    [Google Scholar]
  19. Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res 2004; 32:11–16 [View Article]
    [Google Scholar]
  20. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 2013; 29:2933–2935 [View Article]
    [Google Scholar]
  21. Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 2021; 49:D192–D200 [View Article]
    [Google Scholar]
  22. Edgar RC. PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics 2007; 8:18 [View Article]
    [Google Scholar]
  23. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J et al. BLAST+: architecture and applications. BMC Bioinformatics 2009; 10:421 [View Article]
    [Google Scholar]
  24. Luo H, Gao F. DoriC 10.0: an updated database of replication origins in prokaryotic genomes including chromosomes and plasmids. Nucleic Acids Res 2019; 47:D74–D77 [View Article]
    [Google Scholar]
  25. Robertson J, Bessonov K, Schonfeld J, Nash JHE. Universal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microb Genom 2020; 6:e000435 [View Article]
    [Google Scholar]
  26. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009; 25:1422–1423 [View Article]
    [Google Scholar]
  27. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol 2011; 7:e1002195 [View Article]
    [Google Scholar]
  28. Eberhardt RY, Haft DH, Punta M, Martin M, O’Donovan C et al. Antifam: a tool to help identify spurious ORFs in protein annotation. Database 2012; 2012:bas003 [View Article]
    [Google Scholar]
  29. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods 20151259–60 [View Article]
    [Google Scholar]
  30. Galperin MY, Wolf YI, Makarova KS, Vera Alvarez R, Landsman D et al. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res 2021; 49:D274–D281 [View Article]
    [Google Scholar]
  31. Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G et al. ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res 2012; 40:W597–W603 [View Article]
    [Google Scholar]
  32. Gene Ontology Consortium The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res 2021; 49:D325–D334 [View Article]
    [Google Scholar]
  33. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ et al. Validating the AMRFinder tool and resistance gene database by using antimicrobial resistance genotype-phenotype correlations in a collection of isolates. Antimicrob Agents Chemother 2019; 63:e00483-19 [View Article]
    [Google Scholar]
  34. Liu B, Zheng D, Jin Q, Chen L, Yang J. VFDB 2019: a comparative pathogenomic platform with an interactive web interface. Nucleic Acids Res 2019; 47:D687–D692 [View Article]
    [Google Scholar]
  35. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A et al. The Pfam protein families database in 2019. Nucleic Acids Res 2019; 47:D427–D432 [View Article]
    [Google Scholar]
  36. Robertson J, Nash JHE. MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microb Genom 2018; 4:e000206 [View Article]
    [Google Scholar]
  37. Kamoun C, Payen T, Hua-Van A, Filée J, Delihas N et al. Improving prokaryotic transposable elements identification using a combination of de novo and profile HMM methods. BMC Genomics 2013; 14:700 [View Article]
    [Google Scholar]
  38. Kämpfer P, Fuglsang-Damgaard D, Overballe-Petersen S, Hasman H, Hammerum AM et al. Taxonomic reassessment of the genus Pseudocitrobacter using whole genome sequencing: Pseudocitrobacter anthropi is a later heterotypic synonym of Pseudocitrobacter faecalis and description of Pseudocitrobacter vendiensis sp. nov. Int J Syst Evol Microbiol 2020; 70:1315–1320 [View Article]
    [Google Scholar]
  39. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018; 34:i884–i890 [View Article]
    [Google Scholar]
  40. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017; 13:e1005595 [View Article]
    [Google Scholar]
  41. Storz G, Wolf YI, Ramamurthi KS. Small proteins can no longer be ignored. Annu Rev Biochem 2014; 83:753–777 [View Article]
    [Google Scholar]
  42. Berube BJ, Sampedro GR, Otto M, Bubeck Wardenburg J. The psmα locus regulates production of Staphylococcus aureus alpha-toxin during infection. Infect Immun 2014; 82:3350–3358 [View Article]
    [Google Scholar]
  43. Cheung GYC, Joo H-S, Chatterjee SS, Otto M. Phenol-soluble modulins--critical determinants of staphylococcal virulence. FEMS Microbiol Rev 2014; 38:698–719 [View Article]
    [Google Scholar]
  44. Ebmeier SE, Tan IS, Clapham KR, Ramamurthi KS. Small proteins link coat and cortex assembly during sporulation in Bacillus subtilis. Mol Microbiol [Internet] 2012; 84:682–696 [View Article]
    [Google Scholar]
  45. Chen L-X, Anantharaman K, Shaiber A, Eren AM, Banfield JF. Accurate and complete genomes from metagenomes. Genome Res 2020; 30:315–333 [View Article]
    [Google Scholar]
  46. Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2017; 2:1533–1542 [View Article]
    [Google Scholar]
  47. Gaio D, DeMaere MZ, Anantanawat K, Chapman TA, Djordjevic SP et al. Post-weaning shifts in microbiome composition and metabolism revealed by over 25 000 pig gut metagenome-assembled genomes. Microb Genom 2021; 7:e000501 [View Article]
    [Google Scholar]
  48. Nayfach S, Roux S, Seshadri R, Udwary D, Varghese N et al. A genomic catalog of Earth’s microbiomes. Nat Biotechnol 2021; 39:499–509
    [Google Scholar]
  49. Xie F, Jin W, Si H, Yuan Y, Tao Y et al. An integrated gene catalog and over 10000 metagenome-assembled genomes from the gastrointestinal microbiome of ruminants. Microbiome 2021; 9:137
    [Google Scholar]
  50. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 2015; 25:1043–1055 [View Article]
    [Google Scholar]
  51. Harrison PW, Alako B, Amid C, Cerdeño-Tárraga A, Cleland I et al. The European Nucleotide Archive in 2018. Nucleic Acids Res 2019; 47:D84–D88 [View Article]
    [Google Scholar]
  52. Yachdav G, Goldberg T, Wilzbach S, Dao D, Shih I et al. Anatomy of BioJS, an open source community for the life sciences. elife 2015; 4:e07009 [View Article]
    [Google Scholar]
  53. Robinson JT, Thorvaldsdóttir H, Turner D, Mesirov JP. igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). bioRxiv 2020075499
    [Google Scholar]
  54. Durrant MG, Bhatt AS. Automated prediction and annotation of small open reading frames in microbial genomes. Cell Host Microbe 2021; 29:121–31
    [Google Scholar]
  55. Li L, Chao Y. sPepFinder expedites genome-wide identification of small proteins in bacteria. bioRxiv 2020079178
    [Google Scholar]
  56. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016; 3:160018
    [Google Scholar]

Data & Media loading...


Loading data from figshare Loading data from figshare
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error