Methods and Software

The Microbial Genomics Methods and Software collection will bring together articles describing novel experimental, bioinformatics, modelling, and statistical approaches to the analysis of microbial genomics data, including databases or the integration of genomics with other data streams; as well as systematic comparisons or benchmarking of existing methodologies used in the field of microbial genomics. Guest-edited by Dr Zamin Iqbal (European Bioinformatics Institute) and Dr Caroline Colijn (Simon Fraser University), the collection aims to provide the microbial genomics community with new and systematically validated tools to advance their research.
The cover image for this collection brings together figures from two of retrospective articles in the collection: a phylogeny richly annotated with insertion sequence sites from the article on ISseeker by Adams et al. 2016 (bottom left); and a genome assembly graph from the article on completing bacterial genomes by Wick et al. 2017 (top right).
This collection is now open for submissions. Submit your article here, stating that your manuscript is part of the Methods and Software collection.
Collection Contents
1 - 20 of 49 results
-
-
DiSCo: a sequence-based type-specific predictor of Dsr-dependent dissimilatory sulphur metabolism in microbial data
More LessCurrent methods in comparative genomic analyses for metabolic potential prediction of proteins involved in, or associated with the Dsr (dissimilatory sulphite reductase)-dependent dissimilatory sulphur metabolism are both time-intensive and computationally challenging, especially when considering metagenomic data. We developed DiSCo, a Dsr-dependent dissimilatory sulphur metabolism classification tool, which automatically identifies and classifies the protein type from sequence data. It takes user-supplied protein sequences and lists the identified proteins and their classification in terms of protein family and predicted type. It can also extract the sequence data from user-input to serve as basis for additional downstream analyses. DiSCo provides the metabolic functional prediction of proteins involved in Dsr-dependent dissimilatory sulphur metabolism with high levels of accuracy in a fast manner. We ran DiSCo against a dataset composed of over 190 thousand (meta)genomic records and efficiently mapped Dsr-dependent dissimilatory sulphur proteins in 1798 lineages across both prokaryotic domains. This allowed the identification of new micro-organisms belonging to Thaumarchaeota and Spirochaetes lineages with the metabolic potential to use the Dsr-pathway for energy conservation. DiSCo is implemented in Perl 5 and freely available under the GNU GPLv3 at https://github.com/Genome-Evolution-and-Ecology-Group-GEEG/DiSCo.
-
-
-
Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses
More LessMetagenomics and marker gene approaches, coupled with high-throughput sequencing technologies, have revolutionized the field of microbial ecology. Metagenomics is a culture-independent method that allows the identification and characterization of organisms from all kinds of samples. Whole-genome shotgun sequencing analyses the total DNA of a chosen sample to determine the presence of micro-organisms from all domains of life and their genomic content. Importantly, the whole-genome shotgun sequencing approach reveals the genomic diversity present, but can also give insights into the functional potential of the micro-organisms identified. The marker gene approach is based on the sequencing of a specific gene region. It allows one to describe the microbial composition based on the taxonomic groups present in the sample. It is frequently used to analyse the biodiversity of microbial ecosystems. Despite its importance, the analysis of metagenomic sequencing and marker gene data is quite a challenge. Here we review the primary workflows and software used for both approaches and discuss the current challenges in the field.
-
-
-
AB_SA: Accessory genes-Based Source Attribution – tracing the source of Salmonella enterica Typhimurium environmental strains
The partitioning of pathogenic strains isolated in environmental or human cases to their sources is challenging. The pathogens usually colonize multiple animal hosts, including livestock, which contaminate the food-production chain and the environment (e.g. soil and water), posing an additional public-health burden and major challenges in the identification of the source. Genomic data opens up new opportunities for the development of statistical models aiming to indicate the likely source of pathogen contamination. Here, we propose a computationally fast and efficient multinomial logistic regression source-attribution classifier to predict the animal source of bacterial isolates based on ‘source-enriched’ loci extracted from the accessory-genome profiles of a pangenomic dataset. Depending on the accuracy of the model’s self-attribution step, the modeller selects the number of candidate accessory genes that best fit the model for calculating the likelihood of (source) category membership. The Accessory genes-Based Source Attribution (AB_SA) method was applied to a dataset of strains of Salmonella enterica Typhimurium and its monophasic variant ( S . enterica 1,4,[5],12:i:-). The model was trained on 69 strains with known animal-source categories (i.e. poultry, ruminant and pig). The AB_SA method helped to identify 8 genes as predictors among the 2802 accessory genes. The self-attribution accuracy was 80 %. The AB_SA model was then able to classify 25 of the 29 S . enterica Typhimurium and S . enterica 1,4,[5],12:i:- isolates collected from the environment (considered to be of unknown source) into a specific category (i.e. animal source), with more than 85 % of probability. The AB_SA method herein described provides a user-friendly and valuable tool for performing source-attribution studies in only a few steps. AB_SA is written in R and freely available at https://github.com/lguillier/AB_SA.
-
-
-
Next-generation sequencing of dsRNA is greatly improved by treatment with the inexpensive denaturing reagent DMSO
More LessdsRNA is the genetic material of important viruses and a key component of RNA interference-based immunity in eukaryotes. Previous studies have noted difficulties in determining the sequence of dsRNA molecules that have affected studies of immune function and estimates of viral diversity in nature. DMSO has been used to denature dsRNA prior to the reverse-transcription stage to improve reverse transcriptase PCR and Sanger sequencing. We systematically tested the utility of DMSO to improve the sequencing yield of a dsRNA virus (Φ6) in a short-read next-generation sequencing platform. DMSO treatment improved sequencing read recovery by over two orders of magnitude, even when RNA and cDNA concentrations were below the limit of detection. We also tested the effects of DMSO on a mock eukaryotic viral community and found that dsRNA virus reads increased with DMSO treatment. Furthermore, we provide evidence that DMSO treatment does not adversely affect recovery of reads from a ssRNA viral genome (influenza A/California/07/2009). We suggest that up to 50 % DMSO treatment be used prior to cDNA synthesis when samples of interest are composed of or may contain dsRNA.
-
-
-
PANINI: Pangenome Neighbour Identification for Bacterial Populations
The standard workhorse for genomic analysis of the evolution of bacterial populations is phylogenetic modelling of mutations in the core genome. However, a notable amount of information about evolutionary and transmission processes in diverse populations can be lost unless the accessory genome is also taken into consideration. Here, we introduce panini (Pangenome Neighbour Identification for Bacterial Populations), a computationally scalable method for identifying the neighbours for each isolate in a data set using unsupervised machine learning with stochastic neighbour embedding based on the t-SNE (t-distributed stochastic neighbour embedding) algorithm. panini is browser-based and integrates with the Microreact platform for rapid online visualization and exploration of both core and accessory genome evolutionary signals, together with relevant epidemiological, geographical, temporal and other metadata. Several case studies with single- and multi-clone pneumococcal populations are presented to demonstrate the ability to identify biologically important signals from gene content data. panini is available at http://panini.pathogen.watch and code at http://gitlab.com/cgps/panini.
-
-
-
rPinecone: Define sub-lineages of a clonal expansion via a phylogenetic tree
The ability to distinguish different circulating pathogen clones from each other is a fundamental requirement to understand the epidemiology of infectious diseases. Phylogenetic analysis of genomic data can provide a powerful platform to identify lineages within bacterial populations, and thus inform outbreak investigation and transmission dynamics. However, resolving differences between pathogens associated with low-variant (LV) populations carrying low median pairwise single nucleotide variant (SNV) distances remains a major challenge. Here we present rPinecone, an R package designed to define sub-lineages within closely related LV populations. rPinecone uses a root-to-tip directional approach to define sub-lineages within a phylogenetic tree according to SNV distance from the ancestral node. The utility of this software was demonstrated using both simulated outbreaks and real genomic data of two LV populations: a hospital outbreak of methicillin-resistant Staphylococcus aureus and endemic Salmonella Typhi from rural Cambodia. rPinecone identified the transmission branches of the hospital outbreak and geographically confined lineages in Cambodia. Sub-lineages identified by rPinecone in both analyses were phylogenetically robust. It is anticipated that rPinecone can be used to discriminate between lineages of bacteria from LV populations where other methods fail, enabling a deeper understanding of infectious disease epidemiology for public health purposes.
-
-
-
PhasomeIt: an ‘omics’ approach to cataloguing the potential breadth of phase variation in the genus Campylobacter
More LessHypermutable simple sequence repeats (SSRs) are drivers of phase variation (PV) whose stochastic, high-frequency, reversible switches in gene expression are a common feature of several pathogenic bacterial species, including the human pathogen Campylobacter jejuni. Here we examine the distribution and conservation of known and putative SSR-driven phase variable genes – the phasome – in the genus Campylobacter. PhasomeIt, a new program, was specifically designed for rapid identification of SSR-mediated PV. This program detects the location, type and repeat number of every SSR. Each SSR is linked to a specific gene and its putative expression state. Other outputs include conservation of SSR-driven phase-variable genes and the ‘core phasome’ – the minimal set of PV genes in a phylogenetic grouping. Analysis of 77 complete Campylobacter genome sequences detected a ‘core phasome’ of conserved PV genes in each species and a large number of rare PV genes with few, or no, homologues in other genome sequences. Analysis of a set of partial genome sequences, with food-chain-associated metadata, detected evidence of a weak link between phasome and source host for disease-causing isolates of sequence type (ST)-828 but not the ST-21 or ST-45 complexes. Investigation of the phasomes in the genus Campylobacter provided evidence of overlapping but distinctive mechanisms of PV-mediated adaptation to specific niches. This suggests that the phasome could be involved in host adaptation and spread of campylobacters. Finally, this tool is malleable and will have utility for studying the distribution and genic effects of other repetitive elements in diverse bacterial species.
-
-
-
SynerClust: a highly scalable, synteny-aware orthologue clustering tool
Accurate orthologue identification is a vital component of bacterial comparative genomic studies, but many popular sequence-similarity-based approaches do not scale well to the large numbers of genomes that are now generated routinely. Furthermore, most approaches do not take gene synteny into account, which is useful information for disentangling paralogues. Here, we present SynerClust, a user-friendly synteny-aware tool based on synergy that can process thousands of genomes. SynerClust was designed to analyse genomes with high levels of local synteny, particularly prokaryotes, which have operon structure. SynerClust’s run-time is optimized by selecting cluster representatives at each node in the phylogeny; thus, avoiding the need for exhaustive pairwise similarity searches. In benchmarking against Roary, Hieranoid2, PanX and Reciprocal Best Hit, SynerClust was able to more completely identify sets of core genes for datasets that included diverse strains, while using substantially less memory, and with scalability comparable to the fastest tools. Due to its scalability, ease of installation and use, and suitability for a variety of computing environments, orthogroup clustering using SynerClust will enable many large-scale prokaryotic comparative genomics efforts.
-
-
-
mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species
Assembly of bacterial short-read whole-genome sequencing data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Complete genomes resolved by long-read sequencing can be used to generate and label short-read contigs. These were used to train several popular machine learning methods to classify the origin of contigs from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. We selected support-vector machine (SVM) models as the best classifier for all three bacterial species (F1-score E. faecium=0.92, F1-score K. pneumoniae=0.90, F1-score E. coli=0.76), which outperformed other existing plasmid prediction tools using a benchmarking set of isolates. We demonstrated the scalability of our models by accurately predicting the plasmidome of a large collection of 1644 E. faecium isolates and illustrate its applicability by predicting the location of antibiotic-resistance genes in all three species. The SVM classifiers are publicly available as an R package and graphical-user interface called ‘mlplasmids’. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.
-
-
-
Genetic diversity, mobilisation and spread of the yersiniabactin-encoding mobile element ICEKp in Klebsiella pneumoniae populations
Mobile genetic elements (MGEs) that frequently transfer within and between bacterial species play a critical role in bacterial evolution, and often carry key accessory genes that associate with a bacteria’s ability to cause disease. MGEs carrying antimicrobial resistance (AMR) and/or virulence determinants are common in the opportunistic pathogen Klebsiella pneumoniae, which is a leading cause of highly drug-resistant infections in hospitals. Well-characterised virulence determinants in K. pneumoniae include the polyketide synthesis loci ybt and clb (also known as pks), encoding the iron-scavenging siderophore yersiniabactin and genotoxin colibactin, respectively. These loci are located within an MGE called ICEKp, which is the most common virulence-associated MGE of K. pneumoniae, providing a mechanism for these virulence factors to spread within the population. Here we apply population genomics to investigate the prevalence, evolution and mobility of ybt and clb in K. pneumoniae populations through comparative analysis of 2498 whole-genome sequences. The ybt locus was detected in 40 % of K. pneumoniae genomes, particularly amongst those associated with invasive infections. We identified 17 distinct ybt lineages and 3 clb lineages, each associated with one of 14 different structural variants of ICEKp. Comparison with the wider population of the family Enterobacteriaceae revealed occasional ICEKp acquisition by other members. The clb locus was present in 14 % of all K. pneumoniae and 38.4 % of ybt+ genomes. Hundreds of independent ICEKp integration events were detected affecting hundreds of phylogenetically distinct K. pneumoniae lineages, including at least 19 in the globally-disseminated carbapenem-resistant clone CG258. A novel plasmid-encoded form of ybt was also identified, representing a new mechanism for ybt dispersal in K. pneumoniae populations. These data indicate that MGEs carrying ybt and clb circulate freely in the K. pneumoniae population, including among multidrug-resistant strains, and should be considered a target for genomic surveillance along with AMR determinants.
-
-
-
PlaScope: a targeted approach to assess the plasmidome from genome assemblies at the species level
More LessPlasmid prediction may be of great interest when studying bacteria of medical importance such as Enterobacteriaceae as well as Staphylococcus aureus or Enterococcus. Indeed, many resistance and virulence genes are located on such replicons with major impact in terms of pathogenicity and spreading capacities. Beyond strain outbreak, plasmid outbreaks have been reported in particular for some extended-spectrum beta-lactamase- or carbapenemase-producing Enterobacteriaceae. Several tools are now available to explore the ‘plasmidome’ from whole-genome sequences with various approaches, but none of them are able to combine high sensitivity and specificity. With this in mind, we developed PlaScope, a targeted approach to recover plasmidic sequences in genome assemblies at the species or genus level. Based on Centrifuge, a metagenomic classifier, and a custom database containing complete sequences of chromosomes and plasmids from various curated databases, PlaScope classifies contigs from an assembly according to their predicted location. Compared to other plasmid classifiers, PlasFlow and cBar, it achieves better recall (0.87), specificity (0.99), precision (0.96) and accuracy (0.98) on a dataset of 70 genomes of Escherichia coli containing plasmids. In a second part, we identified 20 of the 21 chromosomal integrations of the extended-spectrum beta-lactamase coding gene in a clinical dataset of E. coli strains. In addition, we predicted virulence gene and operon locations in agreement with the literature. We also built a database for Klebsiella and correctly assigned the location for the majority of resistance genes from a collection of 12 Klebsiella pneumoniae strains. Similar approaches could also be developed for other well-characterized bacteria.
-
-
-
MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies
More LessLarge-scale bacterial population genetics studies are now routine due to cost-effective Illumina short-read sequencing. However, analysing plasmid content remains difficult due to incomplete assembly of plasmids. Bacterial isolates can contain any number of plasmids and assembly remains complicated due to the presence of repetitive elements. Numerous tools have been developed to analyse plasmids but the performance and functionality of the tools are variable. The MOB-suite was developed as a set of modular tools for reconstruction and typing of plasmids from draft assembly data to facilitate characterization of plasmids. Using a set of closed genomes with publicly available Illumina data, the MOB-suite identified contigs of plasmid origin with both high sensitivity and specificity (95 and 88 %, respectively). In comparison, plasmidfinder demonstrated high specificity (99 %) but limited sensitivity (50 %). Using the same dataset of 377 known plasmids, MOB-recon accurately reconstructed 207 plasmids so that they were assigned to a single grouping without other plasmid or chromosomal sequences, whereas plasmidSPAdes was only able to accurately reconstruct 102 plasmids. In general, plasmidSPAdes has a tendency to merge different plasmids together, with 208 plasmids undergoing merge events. The MOB-suite reduces the number of errors but produces more hybrid plasmids, with 84 plasmids undergoing both splits and merges. The MOB-suite also provides replicon typing similar to plasmidfinder but with the inclusion of relaxase typing and prediction of conjugation potential. The MOB-suite is written in Python 3 and is available from https://github.com/phac-nml/mob-suite.
-
-
-
ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping
More LessThe genus Escherichia is composed of Escherichia albertii, E. fergusonii, five cryptic Escherichia clades and E. coli sensu stricto. Furthermore, the E. coli species can be divided into seven main phylogroups termed A, B1, B2, C, D, E and F. As specific lifestyles and/or hosts can be attributed to these species/phylogroups, their identification is meaningful for epidemiological studies. Classical phenotypic tests fail to identify non-sensu stricto E. coli as well as phylogroups. Clermont and colleagues have developed PCR assays that allow the identification of most of these species/phylogroups, the triplex/quadruplex PCR for E. coli phylogroup determination being the most popular. With the growing availability of whole genome sequences, we have developed the ClermonTyping method and its associated web-interface, the ClermonTyper, that allows a given strain sequence to be assigned to E. albertii, E. fergusonii, Escherichia clades I–V, E. coli sensu stricto as well as to the seven main E. coli phylogroups. The ClermonTyping is based on the concept of in vitro PCR assays and maintains the principles of ease of use and speed that prevailed during the development of the in vitro assays. This in silico approach shows 99.4 % concordance with the in vitro PCR assays and 98.8 % with the Mash genome-clustering tool. The very few discrepancies result from various errors occurring mainly from horizontal gene transfers or SNPs in the primers. We propose the ClermonTyper as a freely available resource to the scientific community at: http://clermontyping.iame-research.center/.
-
-
-
SeroBA: rapid high-throughput serotyping of Streptococcus pneumoniae from whole genome sequence data
Streptococcus pneumoniae is responsible for 240 000–460 000 deaths in children under 5 years of age each year. Accurate identification of pneumococcal serotypes is important for tracking the distribution and evolution of serotypes following the introduction of effective vaccines. Recent efforts have been made to infer serotypes directly from genomic data but current software approaches are limited and do not scale well. Here, we introduce a novel method, SeroBA, which uses a k-mer approach. We compare SeroBA against real and simulated data and present results on the concordance and computational performance against a validation dataset, the robustness and scalability when analysing a large dataset, and the impact of varying the depth of coverage on sequence-based serotyping. SeroBA can predict serotypes, by identifying the cps locus, directly from raw whole genome sequencing read data with 98 % concordance using a k-mer-based method, can process 10 000 samples in just over 1 day using a standard server and can call serotypes at a coverage as low as 15–21×. SeroBA is implemented in Python3 and is freely available under an open source GPLv3 licence from: https://github.com/sanger-pathogens/seroba
-
-
-
SuperDCA for genome-wide epistasis analysis
The potential for genome-wide modelling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has previously been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 104–105 polymorphisms, representing the amount of core genomic variation observed in analyses of many bacterial species. Here, we introduce a novel inference method (SuperDCA) that employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 105 polymorphisms. Using two large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make additional significant biological findings about this major human pathogen. We also show that our method can uncover signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA, thus, holds considerable potential in building understanding about numerous organisms at a systems biological level.
-
-
-
Assembly of highly repetitive genomes using short reads: the genome of discrete typing unit III Trypanosoma cruzi strain 231
Next-generation sequencing (NGS) methods are low-cost high-throughput technologies that produce thousands to millions of sequence reads. Despite the high number of raw sequence reads, their short length, relative to Sanger, PacBio or Nanopore reads, complicates the assembly of genomic repeats. Many genome tools are available, but the assembly of highly repetitive genome sequences using only NGS short reads remains challenging. Genome assembly of organisms responsible for important neglected diseases such as Trypanosoma cruzi, the aetiological agent of Chagas disease, is known to be challenging because of their repetitive nature. Only three of six recognized discrete typing units (DTUs) of the parasite have their draft genomes published and therefore genome evolution analyses in the taxon are limited. In this study, we developed a computational workflow to assemble highly repetitive genomes via a combination of de novo and reference-based assembly strategies to better overcome the intrinsic limitations of each, based on Illumina reads. The highly repetitive genome of the human-infecting parasite T. cruzi 231 strain was used as a test subject. The combined-assembly approach shown in this study benefits from the reference-based assembly ability to resolve highly repetitive sequences and from the de novo capacity to assemble genome-specific regions, improving the quality of the assembly. The acceptable confidence obtained by analyzing our results showed that our combined approach is an attractive option to assemble highly repetitive genomes with NGS short reads. Phylogenomic analysis including the 231 strain, the first representative of DTU III whose genome was sequenced, was also performed and provides new insights into T. cruzi genome evolution.
-
-
-
PlasmidTron: assembling the cause of phenotypes and genotypes from NGS data
Increasingly rich metadata are now being linked to samples that have been whole-genome sequenced. However, much of this information is ignored. This is because linking this metadata to genes, or regions of the genome, usually relies on knowing the gene sequence(s) responsible for the particular trait being measured and looking for its presence or absence in that genome. Examples of this would be the spread of antimicrobial resistance genes carried on mobile genetic elements (MGEs). However, although it is possible to routinely identify the resistance gene, identifying the unknown MGE upon which it is carried can be much more difficult if the starting point is short-read whole-genome sequence data. The reason for this is that MGEs are often full of repeats and so assemble poorly, leading to fragmented consensus sequences. Since mobile DNA, which can carry many clinically and ecologically important genes, has a different evolutionary history from the host, its distribution across the host population will, by definition, be independent of the host phylogeny. It is possible to use this phenomenon in a genome-wide association study to identify both the genes associated with the specific trait and also the DNA linked to that gene, for example the flanking sequence of the plasmid vector on which it is encoded, which follows the same patterns of distribution as the marker gene/sequence itself. We present PlasmidTron, which utilizes the phenotypic data normally available in bacterial population studies, such as antibiograms, virulence factors, or geographical information, to identify traits that are likely to be present on DNA that can randomly reassort across defined bacterial populations. It is also possible to use this methodology to associate unknown genes/sequences (e.g. plasmid backbones) with a specific molecular signature or marker (e.g. resistance gene presence or absence) using PlasmidTron. PlasmidTron uses a k-mer-based approach to identify reads associated with a phylogenetically unlinked phenotype. These reads are then assembled de novo to produce contigs in a fast and scalable-to-large manner. PlasmidTron is written in Python 3 and is available under the open source licence GNU GPL3 from https://github.com/sanger-pathogens/plasmidtron.
-
-
-
chewBBACA: A complete suite for gene-by-gene schema creation and strain identification
Gene-by-gene approaches are becoming increasingly popular in bacterial genomic epidemiology and outbreak detection. However, there is a lack of open-source scalable software for schema definition and allele calling for these methodologies. The chewBBACA suite was designed to assist users in the creation and evaluation of novel whole-genome or core-genome gene-by-gene typing schemas and subsequent allele calling in bacterial strains of interest. chewBBACA performs the schema creation and allele calls on complete or draft genomes resulting from de novo assemblers. The chewBBACA software uses Python 3.4 or higher and can run on a laptop or in high performance clusters making it useful for both small laboratories and large reference centers. ChewBBACA is available at https://github.com/B-UMMI/chewBBACA.
-
-
-
Comprehensive assessment of the quality of Salmonella whole genome sequence data available in public sequence databases using the Salmonella in silico Typing Resource (SISTR)
Public health and food safety institutions around the world are adopting whole genome sequencing (WGS) to replace conventional methods for characterizing Salmonella for use in surveillance and outbreak response. Falling costs and increased throughput of WGS have resulted in an explosion of data, but questions remain as to the reliability and robustness of the data. Due to the critical importance of serovar information to public health, it is essential to have reliable serovar assignments available for all of the Salmonella records. The current study used a systematic assessment and curation of all Salmonella in the sequence read archive (SRA) to assess the state of the data and their utility. A total of 67 758 genomes were assembled de novo and quality-assessed for their assembly metrics as well as species and serovar assignments. A total of 42 400 genomes passed all of the quality criteria but 30.16 % of genomes were deposited without serotype information. These data were used to compare the concordance of reported and predicted serovars for two in silico prediction tools, multi-locus sequence typing (MLST) and the Salmonella in silico Typing Resource (SISTR), which produced predictions that were fully concordant with 87.51 and 91.91 % of the tested isolates, respectively. Concordance of in silico predictions increased when serovar variants were grouped together, 89.25 % for MLST and 94.98 % for SISTR. This study represents the first large-scale validation of serovar information in public genomes and provides a large validated set of genomes, which can be used to benchmark new bioinformatics tools.
-
-
-
Coupling next-generation sequencing to dominant positive screens for finding antibiotic cellular targets and resistance mechanisms in Escherichia coli
More LessIn order to expedite the discovery of genes coding for either drug targets or antibiotic resistance, we have developed a functional genomic strategy termed Plas-Seq. This technique involves coupling a multicopy suppressor library to next-generation sequencing. We generated an Escherichia coli plasmid genomic library that was transformed into E. coli. These transformants were selected step by step using 0.25× to 2× minimum inhibitory concentrations for ceftriaxone, gentamicin, levofloxacin, tetracycline or trimethoprim. Plasmids were isolated at each selection step and subjected to Illumina sequencing. By searching for genomic loci whose sequencing coverage increased with antibiotic pressure we were able to detect 48 different genomic loci that were enriched by at least one antibiotic. Fifteen of these loci were studied functionally, and we showed that 13 can decrease the susceptibility of E. coli to antibiotics when overexpressed. These genes coded for drug targets, transcription factors, membrane proteins and resistance factors. The technique of Plas-Seq is expediting the discovery of genes associated with the mode of action or resistance to antibiotics and led to the isolation of a novel gene influencing drug susceptibility. It has the potential for being applied to novel molecules and to other microbial species.
-