Is there a rule for the annotation of the basepair or gene locus origin in bacterial genomes?

Since all bacteria have one circular chromosome, there is no designated origin to start the systematic gene locus numbering. Therefore, what is the rule to start the numbering at a particular point (i.e. why is in the E.coli K12 strain the thrL gene given the ordered locus number b0001? If the answer is: It is the first Gene downstream of basepair-1, then why is basepair-1 at this particular position)

Why the question? Since bacteria show frequently functional synteny between genes, should I be surprised to find a gene frequently associated with the (arbitrary?) annotation origin? If the annotation origin is i.e. the origin of replication that would be an interesting case of synteny, but if no such rule exists, then such a case would most likely just reflect the historic sequence of Genome Sequencing

[I'm sure we have a skilled microbiologist somewhere here, so he would correct my answer.]

The necessity to designate gene locations in classical bacterial model species predated the genomic era, so traditional maps were normally neither bound to origin of replication, nor were they necessarily expressed in base-pairs. In E. coli, for example the traditional map was created based on the F-pili-mediated DNA transfer experiments.

When whole genome sequences became available it was only natural to continue the traditionally defined numeration in terms of placing the "base-pair one". Thus, in E. coli "the traditional and physical maps… linkage map are closely correlated".

My understanding of the current situation with bacteria without genetic prehistory is that their genomes are numerated either respective to classic closely-related species (if any) or according to the presumed location of ori. Just to give an example of how the dilemma was solved for the Buchnera genome: it was "assumed the DnaA box upstream of the gidA gene is the replication origin" and the authors decided to designate "the start of the DnaA box as base pair one".

As you see, still, even if the authors try to define "base-pair one" based on the position of ori, the problem is that this point depends on the ori-searching tool one uses and on the contemporary understanding of how ori-sites look like (direct determination of origins of replications are in many cases not easy). See e.g. this table with a revision of ori locations in a set of genomes: while most of the genomes received their numeration irrespective of ori, some of them were indeed bound to it, but the new analysis showed that these locations were inaccurate.

Dynamics of transcription–translation coordination tune bacterial indole signaling

Indole signaling is an important cross-species communication pathway in the mammalian gut. In bacteria, upon induction by tryptophan, the molecular sensor (tnaC) controls indole biosynthesis by precisely coordinating dynamics of the corresponding macromolecular machineries during its transcription and translation. Our understanding of this regulatory program is still limited owing to its rapid dynamic nature. To address this shortcoming, we adopted a massively parallel profiling method to quantify the responses of 1,450 synthetic tnaC variants in the presence of three concentrations of tryptophan in living bacterial cells. The resultant dataset enabled us to comprehensively probe the key intermediate states of macromolecular machineries during the transcription and translation of tnaC. We also used modeling to provide a systems-level understanding of how these critical states collectively shape the output of this regulatory program quantitatively. A similar methodology will likely apply to other poorly understood dynamics-dependent cis-regulatory elements.

Expert annotation of gene function¶

How to fill the Gene Annotation form?¶

As shown in the figure below, not all fields can be modified by the annotator. Furthermore, some of them are required and other are optional. These fields have to be filled after the careful analysis of the different methods results. If your are working on other object than CDS, you may have a different form, if a required field for CDS appear in your form, it’s still required.

If one of the required field is missing or wrongly filled a warning will appear in the window.

What are the different annotation “Status”?¶

  • inProgress : the annotator has not finished the expert annotation
  • finished : the annotator has finished the expert annotation
  • Curated : the expert annotation has been reviewed by a specialist of the functional process in which the CDS product is involved
  • Artefact : An artefactual CDS corresponds to a false prediction by the gene detection program. An artefactual CDS should never be similar to any proteins from the databanks (except if the same erroneous annotation has been made in another genomes)
  • chkSeq : this status is used by the annotator to flag potential sequencing errors in the sequence. When the sequencing is performed at Genoscope, these chkSeq sequences will be sent to the people working in the finishing team. They will then check the assembly to see if the sequence quality is good or not. If needed they can perform some additional PCRs to enhance the data.
  • chkStart : the annotator suspects that a start position readjustment might be needed for the CDS, but hasn’t done it yet.

How to identify artefacts?¶

What are the different “Type” categories?¶

  • CDS
  • fCDS
  • tRNA
  • rRNA
  • misc_RNA
  • tmRNA
  • ncRNA
  • IS
  • misc_feature
  • promoter

How to fill the “Mutation” field?¶

  • no => Normal CDS
  • frameshift => CDS for which a true frame-shift has been biologically demonstrated
  • pseudo => the CDS is part of a pseudogene
  • partial => the CDS is a gene fragment
  • gene remnant => the CDS is a highly degraded gene fragment
  • selenocysteine => the CDS contains a Selenocysteine in its sequence
  • pyrrolysine => the CDS contains a pyrrolysine in its sequence

What are the different “Product type” categories?¶

  • u : unknown
  • n : RNA
  • e : enzyme
  • f : factor
  • r : regulator
  • c : carrier
  • t : transporter
  • rc : receptor
  • s : structure
  • l : leader peptide
  • m : membrane component
  • lp : lipoprotein
  • cp : cell process
  • ph : phenotype
  • h : extrachromosomal origin

How to use the “MetaCyc reaction” field?¶

This field allows user to link one ore more metabolic reactions from MetaCyc (BioCyc) to the current edited gene.

  • a: Reactions presented at the top of the field have been manually curated by an annotator.
  • b: A multiple selection list gives quick access to all predicted (unselected) or curated (selected) reactions linked to this gene.
  • c: A search box allows one to quickly access MetaCyc reactions corresponding to either EC numbers from previous EC number field or a given keyword.

Search box :

Clicking on the “EC” button will search all MetaCyc reactions corresponding to the EC number from the “EC number” field.

The keyword search will look for all MetaCyc reactions having an identifier, a name or involving a compound similar to the given keyword.

Search result :

The search returns a list of MetaCyc reactions, with :

Genes of the organism already linked to this reaction (eg. first row of the example). Genes are flagged with :

  • “validated” : reaction has been manually linked to this gene by users.
  • “annotated” : reaction has been linked to homologous gene and transferred here from a close genome.
  • “predicted” : reaction has been linked to this gene by the pathway-tools algorithm.

If the reaction has no known coding genes but belongs to a pathway predicted to exist in the current organism, a clickable link to the MetaCyc pathway description is given (eg. fourth row of the example).

The “Reset” button deletes all results.

How to use the “Rhea reaction” field?¶

This field allows user to link one ore more metabolic reactions from Rhea to the current edited gene.

  • a: Reactions presented at the top of the field have been manually curated by an annotator.
  • b: A multiple selection list gives quick access to all curated reactions linked to this gene.
  • c: A search box allows one to quickly access Rhea reactions corresponding to either EC numbers from previous EC number field or a given keyword.

Search box :

Clicking on the “EC” button will search all Rhea reactions corresponding to the EC number from the “EC number” field.

The keyword search will look for all Rhea reactions having an identifier, a name, involving a compound name or Chebi identifier similar to the given keyword.

Search result :

Rhea reactions are present in 4 exemplary according to the direction :

  • bidirectional : <=>
  • left to right : =>
  • right to left : <=
  • *unknown (master reaction) : <?>

The search returns a list of Rhea reactions, with :

  • the reaction identifier and name. Identifier is clickable and open the Rhea reaction card. By default, the master reaction is presented. Select the direction wanted in the “direction-select”.

Genes of the organism already linked to this reaction (eg. first row of the example). Genes are flagged with :

The “Reset” button deletes all results

How to link a new reaction :

For each reaction in the result set, check-box allows to add a reaction from the result set to the selected element. All reactions selected in the multiple selection list will be saved as validated and linked to this gene. Unselecting a reaction in this list will remove this link from the curated data.

What are the different “Localization” categories?¶

  • 1 : Unknown
  • 2 : Cytoplasmic
  • 3 : Fimbrial
  • 4 : Flagellar
  • 5 : Inner membrane protein
  • 6 : Inner membrane-associated
  • 7 : Outer membrane protein
  • 8 : Outer membrane-associated
  • 9 : Periplasmic
  • 10 : Secreted
  • 11 : Membrane

What is the “BioProcess” classification?¶

This functional classification is based on the CMR JCVI Role IDs.

This field is optionally filled in during the expert annotation process.

What is the “Roles” classification?¶

This functional classification corresponds to the MultiFun classification which has been developed by Monica Riley for E. coli.

This field is optionally filled in during the expert annotation process.

How to use the “PubMedID” field?¶

The PubMedID or PMID correspond to the index of a publication on the PubMed section of the NCBI website. You can fill this field when you want to link a publication to your annotation. If you want to enter several publications, you simply have to write the PMIDs separated by commas.

You will find the PMID of a publication directly on Pubmed as shown on the figure below. You can also find PMIDs in the “References” section of the UniProt entries.

If this field is filled you will have a direct access to the publications on PubMed by clicking on the PubMed button on top of the Gene annotation editor window.

How to use the “Additional data” field?¶

The Comments field is dedicated to the annotators who want to leave some notes for themselves or for others annotators from the project.

How to use the “Class” field?¶

The Class annotation categories are useful for assigning a “confidence level” to each gene annotation. It has been inspired by the “protein name confidence” defined in PseudoCAP (Pseudomonas aeruginosa community annotation project).

This information is not given by the automatic functional annotation procedure, except in case of functional annotation transfer from a genome being annotated with MaGe.

The different classes are:

  • 1a : Function from experimental evidences in the studied strain
  • 1b : Function from experimental evidences in the studied species
  • 1c : Function from experimental evidences in the studied genus
  • 2a : Function from experimental evidences in other organisms
  • 2b : Function from indirect experimental evidences (e.g. phenotypes)
  • 3 : Putative function from multiple computational evidences
  • 4 : Unknown function but conserved in other organisms
  • 5 : Unknown function

How to choose the “Class” annotation category?¶

Materials and methods

Sampling, genome sequencing, and metagenomic assembly

The sampling scheme, local soil characteristics, and study design that were utilized in this analysis have been previously described [29]. Previously, 60 soil samples were collected at depths of 10–20, 20–30, and 30–40 cm near the respective centers of six 10 m diameter plots, spaced 5 m apart. Sampling plots were spatially arranged in “blocks” of two plots across the meadow, and one of the two plots in each block received extended spring rainfall [31]. Samples were collected over a period of 2 months before and following autumn rainfall, resulting in 10 samples per plot (Fig. 1a). Briefly, DNA was extracted from 10 g of soil for each sample using the PowerMax Soil DNA isolation kit (MoBio Laboratories) from individual soil cores. Metagenomic libraries were prepared and sequenced with 2 × 250 bp paired read sequencing on the Illumina HiSeq2500 platform at the Joint Genome Institute. Reads were quality filtered to a maximimum 200 bp in length using BBduk [34]. Metagenomes were assembled using IDBA_UD [35] and individual genomes were binned using differential coverage binning and a suite of metagenomic binners as previously described [29].

a A bird’s eye view of the grassland meadow located at the Angelo Coast Range Reserve in Mendocino County, California (39° 44′ 17.7″ N, 123° 37′ 48.4″ W). b Histograms of average nucleotide identity (ANI) and alignment coverage (an approximation of % shared gene content) values for all-vs-all comparisons between genomes assembled from the meadow soils. c Relative DNA abundances of the 19 bacterial populations analyzed in this study in each sample, organized by experimental plot where sample was collected, ordered by sample depth from left to right.

Genome dereplication, filtering, and comparison

The 10,538 genome bins previously described obtained from the study site were dereplicated using dRep [36] with the secondary clustering threshold -sa 0.97, and were filtered with CheckM [37] to generate a dereplicated species-level (97% ANI) genome set to be used for read mapping with >70% completeness and <10% contamination. Representative genomes for each species cluster were chosen based on the highest CheckM completeness and lowest contamination using the scoring algorithm described by [36]. For this analysis, we then used the 19 species-level genome clusters with at least 12 replicate genomes that were estimated to be at least >80% complete with <10% contamination, independently assembled and binned out of different samples. Each of the 19 representative genomes was assembled from >100,000 short reads, and millions of reads could be assigned to each population from all 60 samples. Because microbial populations within the 10 g soil samples used for DNA extraction are comprised of orders of magnitude more cells than were sequenced, most read pairs are likely from unique cells (or DNA molecules) in the population. Each genome assembled from each sample was sequenced at around 10× coverage, but meadow-wide coverages for each population ranged from 224× to 908×(Supplementary Table S2).

Open reading frames were predicted using Prodigal [38] and annotated using [1] USEARCH against UniProt [39], Uniref90 [40], and KEGG [41], and [2] HMM-based annotation of proteins using PFAM [42], and antiSMASH 4.0 [43] for biosynthetic gene prediction. PERMANOVA tests for association of ANI matrices with environmental data were run using the adonis2 function with the parameter ‘by’ set to ‘margin’ in the vegan package in R [44]. Multidimensional scaling (MDS) plots were made with the mds function from the smacof package in R, with the ndim parameter set to 4. Hypergeometric tests for statistical enrichment of protein families was performed using HMMER annotated PFAM features in R.

Possible contamination was further removed from representative genomes using the assembled replicate genomes for each species population. The pan-genomic analysis pipeline Roary [45] was run with default settings on the set of genomes for each species to identify protein clusters across genomes. Contigs with at least 50% of their protein clusters found in less than 25% of each genome set were then discarded as potential contamination (generally fewer than 20 contigs, often small, were removed per genome). Therefore, the final set of contigs used in this analysis contained only contigs that reliably assembled and binned independently for a species in multiple samples. Organism DNA relative abundances were calculated using the total number of reads that mapped to each genome in each sample and dividing by the total number of reads per sample.

Read mapping, SNP calling, and nucleotide diversity

All metagenomic reads were mapped using Bowtie 2 [46] with default parameters except for the insert size parameter -X 1000 to an indexed database of all of the 664 dereplicated genomes obtained from the environment, a database which also contains representative genomes from each of our 19 species for study. Reads that mapped uniquely to the representative species of interest were then used for analysis. Read filtering of the resulting BAM files was performed using a custom script,, available at the link under code availability. Mapping files were then filtered for reads that meet the following criteria: (1) both reads in a pair map within 1500 bp of each other to the same scaffold (as a maximum possible end to end insert size), (2) the combined read pair maps with a percent identity of at least 96% to the reference [3], at least one of the read pairs has a mapq score >1, indicating that this is a uniquely best mapping for this read pair in the index. We further compared nucleotide diversity of genes calculated with a 96% ANIr2 cutoff to those at a 98% ANIr2 cutoff and found a strong correlation (Fig. S1). SNPs were then called at frequencies >5% in each population using reads from all samples, using a simple null model that assumes a false discovery rate <

For all population diversity metrics, we used reproducible custom python scripts (available under Code Availability) that calculated metrics, each explained below, from all filtered cross-sample read mappings. For each representative genome in our set of 19, we analyzed its data in samples that passed a cutoff of at least 50% of the genome being covered with at least 5× coverage. Five hundred eighty-six out of 1140 sample genome comparisons (19 genomes × 60 samples) passed this minimum requirement. Base pairs of reads with Phred scores less than 30 were not used in SNP or linkage analyses. Nucleotide diversity was calculated as the expected frequency of a difference between two sequencing reads at a position, equation pi=1 − (A 2 + C 2 + G 2 + T 2 ), where A, C, G, T and the observed proportions of each respective nucleotide. This is equivalent to the definition from [47] calculated separately on each genomic position with at least 5× coverage within each sample, and then averaged across genes. Sample read mappings were pooled by replicates, by plot of origin, by block and origin, and finally by all samples in the meadow, and nucleotide diversity was recalculated on each pooled set of samples. For downstream analyses, nucleotide diversity for all samples (meadow-wide, Fig. 3) and nucleotide diversity within each of the three sampling blocks (block-wide, Fig. 5) was analyzed. To quantify the impact of changing sequencing coverage on nucleotide diversity, we recalculated nucleotide diversity for each genome, subsampling coverage at each genomic position (Fig. S2). We found that the bias in nucleotide diversity due to low sequencing coverage was minimal above 50×, and our block-wide and meadow-wide coverages are often well above this coverage. We see a larger bias in nucleotide diversity when subsampling to only 5x coverage, but this bias is small compared with the biological variation observed between samples, and many of our individual soil samples are over 10×.

SNPs were called on the combined meadow-wide population set of filtered reads. We constructed a simple null model based on an error rate of 0.01% (the Phred 30 cutoff) and simulated simple sampling with replacement to construct estimated rates of erroneous SNP counts at genomic positions of varying coverages. Positions with an alternative allele that occurred with counts that had a false positive rate of

10 −6 with the given coverage of that site in the null model and a minimum allele frequency (MAF) of 5% were called as SNPs. Because meadow-wide coverages were at least 224× (and ranged up to 908×) the MAF cutoff alone would likely be a stringent cutoff for Phred 30 error rates. SNPs were assigned as synonymous or nonsynonymous using a custom BioPython script and the gene calls annotated by Prodigal.

Linkage disequilibrium, FST, and tests for selection

Linkage was calculated within mapped reads for all pairs of segregating sites that were spanned by at least 30 read pairs with high quality base pair data. R 2 and D′ linkages were calculated using formulas described by [48]. The relative rate of recombination to mutation (gamma/mu) for each population was calculated using the mcorr package (cite) on synonymous third position codon sites across filtered mapped reads from all samples. We analyzed 10 out of 19 genomes, as these had normally distributed residuals for the model fit and the bootstrapping mean was within 2× of the final estimate for gamma/mu. FST, (a measure of differences in allele frequencies between two populations), was calculated on sites segregating across both blocks being compared (for all three block comparisons) using the Hudson method [49] as recommended by [50], as implemented in the scikit-allel package [51]. A site had to have a coverage of at least 20× in each block in order to calculate FST, and genes which had coverages in a block outside of the range of two standard deviations were excluded from the analysis. A ratio of averages was then used to determine mean FST for each gene. A two-sample Wilcoxon test was used to determine if average linkage of highly differentiated loci differed from the genomic average for each species, and two-sample t-tests in R [44] were used to determine if average nucleotide diversity of highly differentiated loci differed from the genomic average. Both sets of tests were corrected for multiple hypotheses using the Benjamini–Hochberg [52] method.


Central to RefSeq's mission for prokaryotic genomes is the non-redundant protein data model ( 3, 4). An accession number that begins with ‘WP_’ signifies one RefSeq non-redundant prokaryotic protein sequence, corresponding to identical translations from at least one, but sometimes thousands of coding sequence regions (CDS) annotated on RefSeq genomes. Each RefSeq non-redundant protein represents an Identical Protein Group (IPG) that includes INSDC proteins as well as translations from complete or whole genome shotgun sequencing (WGS) RefSeq assemblies. When a single protein is found on multiple genomes, the DNA of these coding regions may be identical, or may differ by synonymous codon changes. RefSeq accession number WP_000059093.1, ‘chromosomal replication initiator protein DnaA,’ represents the translation of the dnaA gene of Salmonella enterica subsp. enterica serovar Typhimurium str. LT2, for example, whose sequence was reported in 2001 (PMID: 11677609). But the IPG ( also points to CDS feature translations from over 6000 additional annotated genome assemblies (counting both RefSeq and INSDC versions).

NCBI assesses the taxonomic placements of member proteins of an identical protein group, in order to assign a taxon for its non-redundant protein representative. A single WP protein may represent proteins from numerous species because of horizontal gene transfer. Plasmids encoding WP_004201164.1, the ‘subclass B1 metallo-beta-lactamase NDM-1΄, for example, have spread so broadly (reaching Acinetobacter baumannii, Enterobacter cloacae, Escherichia coli, Klebsiella pneumoniae etc.) that RefSeq assigns the suitably broad taxonomic assignment ‘Bacteria’ to this protein and to the Identical Protein Group it represents.

If a new characterization is ready to be published for a previously known protein with an existing Identical Protein Group, it is appropriate to cite the ‘WP_’ accession. Using an accession number to represent a protein and all genes that encode it exactly provides greater clarity than merely describing the protein or naming the gene. Purely descriptive information is much harder to link to actual sequences and often turns out to be ambiguous.

A critical advantage of the non-redundant protein data model is that it allows on-going biocuration and automated updates to functional annotation after the initial run of the PGAP annotation pipeline. As more information about the function of a protein is discovered, RefSeq biocurators can add or improve on protein names that are associated with supporting homology evidence. Updated names then propagate onto matching records. For example, the protein from the locus named Rv0693 of the genome of Mycobacterium tuberculosis H37Rv, sequenced in 1998 ( 5), was annotated as ‘probable coenzyme PQQ synthesis protein E’, a reasonable guess at the time. More recent work now suggests the name ‘mycofactocin radical SAM maturase’ for it ( 6), and for identical gene products from over 9000 annotated genomes. The identical protein group report at is over 200 pages long, with RefSeq annotations first, all identical and correct. In contrast, INSDC records seen in the last 100 pages show the heterogeneities, outdated information, and occasional errors expected in archival data.

RefSeq annotations of protein names, while designed to be terse, are also intended to inform as much as possible, and not merely to identify a protein. In our view, WP_000180747.1 from Helicobacter pylori should not be named ‘cytotoxicity-associated immunodominant antigen,’ which is ambiguous and outdated, nor ‘exotoxin CagA,’ which is specific but only mildly informative. RefSeq chose the name ‘type IV secretion system oncogenic effector CagA.’ This construction of the protein name alerts users to its biomedical importance, its involvement in a complex set of host-pathogen interactions, and its dependence on a whole system of additional proteins for its delivery and function ( 7). We designed the name for this protein as if we were building a hierarchy of protein names in which similarities in names reflect similarities in important characteristics such as molecular function, biological process, and/or evolutionary origin. The annotation that resulted should help those searching for all type IV secretion system effectors, or for all known oncogenic bacterial toxins (CagA is considered the first), and not only those users who already know what CagA is and how it functions.

Outside a very small number of highly studied model organisms, most proteins from prokaryotic genomes lack direct experimental characterization. Their molecular function and/or biological role must be inferred by homology. Frequently, only a very general type of functional name can be applied. RefSeq applies a name to every predicted protein in a genome using hierarchical rules (see below), even if the name is relatively low in information content. A growing class of researchers encounters genes and their protein translations as genomic features first, and only later will see the annotations they carry. Their methods may include basic genomics: inspecting a gene neighborhood, discovering a genomic island by pan-genome comparisons, finding sequence relatedness through BLAST searches, or simply browsing the predicted proteins from a newly sequenced species. Or the methods may involve high-throughput experimentation: RNA-Seq to follow gene expression levels, mass spectroscopy to find sites of post-translational modification, TnSeq to generate lists of genes essential for growth under various conditions, etc. For such users, low-information names are better than none. When viewed together in the context of gene neighborhoods or other groupings, names that convey relatively little information individually may still provide important clues that assist researchers aiming to characterize new systems of genes and their corresponding biological processes ( 8).

RefSeq has been working to make its functional annotation style as consistent as possible with that used by SwissProt/UniProt and preferred by GenBank, agreeing on actual names where possible, and on the guidance for how protein names should be structured in our respective databases. A jointly produced guide to recommended usages in protein naming remains in draft form at this writing, but should be made publicly available by both groups within a few months. This document will explain elements of protein naming as conducted at the respective databases and suggest a style that submitters of annotation to INSDC may use. The actual names that RefSeq applies are tightly linked to the forms of evidence we use to make the name assignments. RefSeq's use of a hierarchy of homology evidence to annotation rules for naming proteins is described in later sections.


Markowitz VM, Chen IM, Palaniappan K, Chu K, Szeto E, Pillay M, et al. IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res. 201442:D560–7.

Reddy TB, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, et al. The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res. 201543:D1099–106.

Morgulis A, Gertz EM, Schäffer AA, Agarwala R. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol. 20065:1028–40.

Bland C, Ramsey TL, Sabree F, Lowe M, Brown K, Kyrpides NC, et al. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinformatics. 20078:209.

Edgar RC. PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics. 20078:18.

Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 199725:955–64.

Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 20117, e1002195.

Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 200533:D121–4.

Nawrocki EP, Kolbe DL, Eddy SR. Infernal 1.0: inference of RNA alignments. Bioinformatics. 200925:1335–7.

Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 201011:119.

Mukherjee S, Huntemann M, Ivanova N, Kyrpides NC, Pati A. Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand Genomic Sci. 201510:18.

Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese-Scott C, Gonzales NR, Gwadz M, et al. CDD: a conserved domain database for inter-active domain family analysis. Nucleic Acids Res. 200735:D237–40.

Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 201442:D199–205.

Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 201026:2460–1.

Caspi R, Altman T, Billington R, Dreher K, Foerster H, Fulcher CA, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 201442:D459–71.

Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, et al. The Pfam protein families database. Nucleic Acids Res. 201240:D290–301.

Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, et al. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res. 200735:D260–4.

Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 201430:1236–40.

Chen IM, Markowitz VM, Chu K, Anderson I, Mavromatis K, Kyrpides NC, et al. Improving microbial genome annotations in an integrated database context. PLoS One. 20138, e54859.

Petersen TN, Brunak S, von Heijne G, Nielsen H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods. 201010:785–6.

Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 20013:567–80.

Materials and Methods

Data Sets

The protein and genomic sequences used in this study were downloaded from the National Center for Biotechnology Information (NCBI) database ( Tatusova et al. 2014). For this study, we focused on bacterial species for which at least ten fully sequenced and annotated genomes were available. Such data were available for four pathogenic species that we found were previously reported to be clonal: The MTBC ( Hershberg et al. 2008 Comas and Gagneux 2011 Namouchi et al. 2012), Yersinia pestis ( Achtman et al. 1999 Chain et al. 2004), Corynebacterium pseudotuberculosis ( Connor et al. 2007 Soares et al. 2013), and Francisella tularensis ( Johansson et al. 2004 Nubel et al. 2006 Vogler et al. 2009). In addition, we randomly selected ten pathogenic bacterial species that were not reported to be clonal. Supplementary table S1 , Supplementary Material online, lists all bacterial strains used in our research.

After downloading genetic data of the investigated strains, we used information available on the NCBI to manually verify that these strains were indeed fully sequenced and closed. Strains whose genome sequencing process description did not include finishing steps, such as gap closure, were removed from further analysis to prevent computational bias during gene loss analysis due to incomplete sequencing. In addition, we removed strains that are the result of specific manipulations of other strains in the laboratory and genomes that represent multiple clones taken from a single site rather than different strains. The removed strains are listed in supplementary table S2 , Supplementary Material online.

We combined Escherichia coli and Shigella spp. data sets into one group, as studies indicate that Shigella is a subspecies of E. coli ( Rolland et al. 1998 Pupo et al. 2000). From the 13 Clostridium botulinum strains available on the NCBI database, we analyzed only ten strains that belong to lineage 1 ( supplementary table S1 , Supplementary Material online). This was done because the three remaining strains showed remarkably low amino acid identity (AAI) values when compared with other strains from the same species and even when compared with each other (56–60% AAI). Such low AAI values raise issue with the validity of the claim that these strains belong to the same species as the remaining ten strains.

Classification of the Investigated Bacterial Species into Clonal and Nonclonal

The classification of bacterial species into clonal and nonclonal (or recombining) was initially based on the literature suggesting that the four species we consider clonal are indeed clonal. To further examine this, we wanted to examine whether for each species we see some indication of recombination occurring within their genes. Methods used for detection of the recombinant genes are expected to be sensitive to the length of the investigated sequences. They could also be affected by inclusion of paralogs in the analysis and by differences in the number of sequences used. Therefore, for each bacterial species we analyzed only “core,” single-copy genes (as identified by the pangenomes of these species, see below). We extracted sequences of the core genes from.fna files (files that include annotated nucleotide sequences of all genes in the given organism) of the investigated strains and aligned them using MUSCLE ( Edgar 2004) with default parameters. The 900 central positions of each alignment were chosen for the recombination analysis to prevent computational biases due to the differences in gene length. Genes shorter than 900 bp and genes whose alignment had more than 10% gaps within the 900-bp frame were discarded.

To test whether an alignment of orthologous core genes contained statistically significant evidence of recombination, we applied the Pairwise Homoplasy Index (PHI) test ( Bruen et al. 2006) using the Phipack software package ( Bruen 2005). The following parameters were used: Window size of 100 bp, P value computed from 1,000 permutations. We classified an alignment as “recombinant” when the P value of the permutation test was lower than 0.05 for PHI test and “nonrecombinant” when it was higher than 0.05. Alignments lacking enough informative sites to perform the recombination tests were considered “noninformative.”

It is important to note that we were not attempting to use PHI to examine rates of recombination. Rather, our sole propose was to examine whether some signal of recombination can be found within the sequences of the different genes. A species for which such a signal was very rarely found is likely to be evolving clonally. At the same time, a species for which many genes show a signal of undergoing homologous recombination is likely not evolving entirely clonally.

Determining the Relationship between AAI and Genomic Fluidity for the Investigated Bacterial Species

For each bacterial species/group, we used FASTA ( Pearson and Lipman 1988) to perform pairwise comparisons of all the strains found within the given group. Orthologous sequences were identified by requiring reciprocal best hits. Following the thresholds set by POGO-DB, for a putatively orthologous pair to be included in amino acid identity calculations it had to have an identity threshold of at least 30% identity across at least 70% of the protein length ( Lan et al. 2013). The average amino acid identity (AAI) of a pair of genomes was then calculated as the average percent of AAI across all orthologous protein pairs.

Paralog Removal

There are no clear rules as to how paralogs should be treated when generating a pangenome therefore, their inclusion in the pangenome may generate computational biases. We thus removed all genes that had a paralog within some or all of the genomes analyzed from consideration, prior to generating the pangenomes. To remove paralogs, the protein sequences from each genome were compared against all other proteins within that genome, using FASTA ( Pearson and Lipman 1988). Query sequences that identify within the genome sequences other than themselves were dubbed “paralog sequences” and removed from the file, together with the sequences they recognize. As a threshold for considering sequences as matching we required at least 80% NI between the sequences.

After identifying paralogs we used FASTA ( Pearson and Lipman 1988) to remove all paralog-related sequences from the files as well. Paralog-related sequences are coding gene sequences that have no paralogs in a given genome but identify with the paralog sequences in another genome above the threshold of 80% NI. Because of paralog sequence removal those sequences will be underrepresented in the pangenome and their inclusion in the calculations will generate artifacts. To maximize the sensitivity of the paralog and paralog-related sequences removal step, we performed sequence alignment and comparison at both the protein and nucleotide levels. Only sequences that were found to be nonparalog and nonparalog-related at both the protein and DNA levels were used for pangenome construction.

Pangenome Construction

After removal of all paralog and paralog-related genes, we constructed a library that contains the annotated protein sequence files of each of the considered strains. One of the protein files was chosen randomly as the initial database file and a second file was compared with the initial database using FASTA ( Pearson and Lipman 1988). A query protein sequence that identified a protein sequence within the database was considered to be present in both organisms. A query that did not match any sequence in the database was considered to be absent from the database and was added to the initial database. As a threshold for considering sequences as matching we required 80% NI. After all proteins within the query genome were compared with the database, a third file was compared with the expanded database and the process was repeated over all protein sequence files within the library.

A shorter protein query compared with a longer database protein will be more likely to pass the 80% NI threshold than a longer query protein compared with a short database protein. This may insert biases into the pangenome construction, by which the generated pangenome will change depending on the order in which files were compared with each other. To prevent such artifacts, the pangenome that was generated as described above was compared with itself using FASTA ( Pearson and Lipman 1988). Pangenes that matched pangenes other than themselves above the threshold described above were combined into one pangene.

The results of the FASTA alignments were used to calculate the number of strains in which each pangene was found within a species. The distribution of these numbers was used to generate the species’ pangenome plot (see figs. 2 and 3, left column).

Pangenome plots of ten recombining bacterial species. To generate these plots, all strains within a species were compared and orthologous genes were clustered together into groups. This allows for the calculation of the frequency with which each cluster of orthologous genes (referred to as a “pangene”) is found among members of its species. Depicted are the distributions of frequencies with which pangenes are found within each species. Two plots are provided for each species: Left: Protein pangenome plot—generated based on annotated protein sequences Right: HGT-artifact corrected pangenome plot—the protein pangenome was compared with the full DNA sequences of the strains from which it was generated. Pangenes appearing in less than 50% of strains within a species at the annotated protein level, but in more than 75% of the strains at the whole DNA level, were removed from the plot.

Pangenome plots of ten recombining bacterial species. To generate these plots, all strains within a species were compared and orthologous genes were clustered together into groups. This allows for the calculation of the frequency with which each cluster of orthologous genes (referred to as a “pangene”) is found among members of its species. Depicted are the distributions of frequencies with which pangenes are found within each species. Two plots are provided for each species: Left: Protein pangenome plot—generated based on annotated protein sequences Right: HGT-artifact corrected pangenome plot—the protein pangenome was compared with the full DNA sequences of the strains from which it was generated. Pangenes appearing in less than 50% of strains within a species at the annotated protein level, but in more than 75% of the strains at the whole DNA level, were removed from the plot.

Pangenome plots of four clonal bacterial species. To generate these plots, all strains within a species were compared and orthologous genes were clustered together into groups. This allows for the calculation of the frequency with which each cluster of orthologous genes (referred to as a “pangene”) is found among members of its species. Depicted are the distributions of frequencies with which pangenes are found within each species. Two plots are provided for each species: Left: Protein pangenome plot—generated based on annotated protein sequences Right: HGT-artifact corrected pangenome plot—the protein pangenome was compared with the full DNA sequences of the strains from which it was generated. Pangenes appearing in less than 50% of strains within a species at the annotated protein level, but in more than 75% of the strains at the whole DNA level, were removed from the plot.

Pangenome plots of four clonal bacterial species. To generate these plots, all strains within a species were compared and orthologous genes were clustered together into groups. This allows for the calculation of the frequency with which each cluster of orthologous genes (referred to as a “pangene”) is found among members of its species. Depicted are the distributions of frequencies with which pangenes are found within each species. Two plots are provided for each species: Left: Protein pangenome plot—generated based on annotated protein sequences Right: HGT-artifact corrected pangenome plot—the protein pangenome was compared with the full DNA sequences of the strains from which it was generated. Pangenes appearing in less than 50% of strains within a species at the annotated protein level, but in more than 75% of the strains at the whole DNA level, were removed from the plot.

Gene-Gain (HGT) Analysis

This analysis was carried out to determine whether “rare” pangenes (found in 25% or less genomes) were likely to be the results of gene gain through HGT. To this end, the initial pangenome was compared with the whole DNA sequence files of the considered bacterial strains using the FASTA program TFASTX (which allows for the comparison of a protein query against a DNA sequence) ( Pearson and Lipman 1988). Pangenes that are “rare” due to their being introduced into the species through HGT will be absent from the genomes in which they are not found at the DNA level as well. If, however, a pangene was identified as “rare” based on protein annotations but is found within many additional genomes when compared against whole DNA, it is not likely to be the result of HGT. Rather, such pangenes are very likely to represent misannotations occurring only in one or a few genomes. If according to the whole-genome level TFASTX alignments a pangene was found in over 75% of the strains of a species, but according to the protein-annotation level FASTA alignments it was found in less than 50% of the strains, it was dubbed as an artifact and removed from the initial pangenome. The pangenomes after removal of the possible artifacts are shown in the plots presented in figures 2 and 3, right column.

Estimating the Proportion of Lost “Near Core” Pangenes that Are Maintained as Pseudogenes

For the “near core” pangenes, we asked what proportion of genomes from which these genes were absent at the protein level still contained them at the whole genome DNA level. To this end, the protein sequences of these pangenes were compared with the DNA sequences of the genomes from which they were found to be absent at the annotated protein level using TFASTX.

Generation of RAST-Annotation-Based Pangenomes

To obtain consistent annotations of the investigated bacterial pangenomes, we used the RAST annotation webserver ( Aziz et al. 2008 Overbeek et al. 2014), with default parameters. We then constructed the protein-sequence-based pangenomes, based on these automatically generated annotations, and corrected them for the presence of possible HGT artifacts (as described for the pangenomes generated based on NCBI annotations, see supplementary figs. S1 and Supplementary Data , Supplementary Material online, left columns).


1 (GenBank BI386673 dated 20 May 2003). This sequence was posted to the Cytochrome P450 Homepage in 2005, though it was not recognized as a CYP74 Clan member at that time.

2 Unikonts, bikonts and excavates are three monophyletic divisions of eukaryotes determined by multi-gene tree building methods, using large concatenated sequence alignments [15]. Morphologically unikonts (animals, fungi, Amoebozoa) and bikonts (plants, alveolates, stramenopiles, Rhizaria) have one or two flagella at a specific point in their life cycle. Excavates have a characteristic ventral feeding groove on their cell surface (examples are Jakobids, Euglena, Giardia, trypanosomes). Cavalier–Smith argues that excavates are bikonts [16].

3 Xenambulacraria = hemichordates+echinoderms+xenoturbellarids.

4 The 236 P450 gene count for Branchiostoma (lancelet) seems high compared to other chordates. The paper of Putnam et al. [23] argues for a fairly strict paralogon relationship to vertebrates, so the vertebrate P450s should have an ohnolog precursor in lancelet. A large expansion is not expected. Part of the increase may be due to alleles being counted as different genes. The P450s of lancelet have not been systematically named yet so this possibility cannot be confirmed now. Another possibility is dramatic gene blooms in some clans. Note in figure 2 the very large sector in the CYP2 clan between 08.00 and 09.00 o'clock that looks like a large gene bloom. Blooms of P450s have been discussed by Sezutsu et al. [10].

5 RPL28 occurs in two different genomic locations in lancelet shown in figure 3.

6 Fungi have about 15–20 clans that are not completely defined yet. CYP51 and CYP7 Clan members are seen in fungi. The distribution of 7 Clan members is only found in some filamentous fungi strongly suggesting a lateral transfer from animals to fungi. Therefore, it is shown as an asterisk in figure 4. The other fungal clans are distinct from the animal clans.

8 XM_001627677, XM_001627678, XM_001627679.

10 Fugu Ensembl:ENSTRUG00000004549, zebrafish Ensembl:ENSDARG00000060094.

11 Dates for early animal divergence nodes: choanoflagellates versus all other animals 863 Ma (711–1052) sponges versus ctenophores, Trichoplax, cnidarians and bilaterians 812 Ma (671–985) ctenophores versus Trichoplax, cnidarians and bilaterians 733 Ma (603–893) Trichoplax versus cnidarians and bilaterians 592 Ma (551–696).


Wisedpanichkij R, Grams R, Chaijaroenkul W& Na-Bangchang K

. 2011 Confutation of the existence of sequence-conserved cytochrome P450 enzymes in Plasmodium falciparum . Acta Trop. 119, 19–22.doi:

2011 The Medicago genome provides insight into the evolution of rhizobial symbioses . Nature 480, 520–524.doi:

1996 P450 superfamily: update on new sequences, gene mapping, accession numbers and nomenclature . Pharmacogenetics 6, 1–42.doi:

. 2006 A revised evolutionary history of the CYP1A subfamily: gene duplication, gene conversion, and positive selection . J. Mol. Evol. 62, 708–717.doi:

. 1995 A cluster of cytochrome P450 genes of the CYP6 family in the house fly . DNA Cell Biol. 14, 73–82.doi:

. 1992 Evolution of a highly polymorphic human cytochrome P450 gene cluster: CYP2D6 . Genomics 14, 49–58.doi:

Kubota A, Stegeman JJ, Goldstone JV, Nelson DR, Kim EY, Tanabe S& Iwata H

. 2011 Cytochrome P450 CYP2 genes in the common cormorant: evolutionary relationships with 130 diapsid CYP2 clan sequences and chemical effects on their expression . Comp. Biochem. Physiol. Toxicol. Pharmacol. 153, 280–289.doi:

Nelson DR, Zeldin DC, Hoffman SM, Maltais LJ, Wain HM& Nebert DW

. 2004 Comparison of cytochrome P450 (CYP) genes from the mouse and human genomes, including nomenclature recommendations for genes, pseudogenes and alternative-splice variants . Pharmacogenetics 14, 1–18.doi:

Sezutsu H, Le Goff G& Feyereisen R

. 2013 Origins of P450 diversity . Phil. Trans. R. Soc. B 368, 20120428.doi:

. 1999 Cytochrome P450 and the individuality of species . Arch. Biochem. Biophys. 369, 1–10.doi:

. 1998 Metazoan cytochrome P450 evolution . Comp. Biochem. Physiol. 121, 15–22. Google Scholar

. 2003 Comparison of P450s from human and fugu: 420 million years of vertebrate P450 evolution . Arch. Biochem. Biophys. 409, 18–24.doi:

Lee DS, Nioche P, Hamberg M& Raman CS

. 2008 Structural insights into the evolutionary paths of oxylipin biosynthetic enzymes . Nature 455, 363–368.doi:

Burki F, Shalchian-Tabrizi K& Pawlowski J

. 2008 Phylogenomics reveals a new ‘megagroup’ including most photosynthetic eukaryotes . Biol. Lett. 4, 366–369.doi:

. 2006 Cell evolution and Earth history: stasis and revolution . Phil. Trans. R. Soc. B 361, 969–1006.doi:

. 2012 Rooting the eukaryotic tree with mitochondrial and bacterial proteins . Mol. Biol. Evol. 29, 1277–1289.doi:

Yoshida Y, Aoyama Y, Noshiro M& Gotoh O

. 2000 Sterol 14-demethylase P450 (CYP51) provides a breakthrough for the discussion on the evolution of cytochrome P450 gene superfamily . Biochem. Biophys. Res. Commun. 273, 799–804.doi:

Rezen T, Debeljak N, Kordis D& Rozman D

. 2004 New aspects on lanosterol 14alpha-demethylase and cytochrome P450 evolution: lanosterol/cycloartenol diversification and lateral transfer . J. Mol. Evol. 59, 51–58.doi:

. 2002 The neomuran origin of archaebacteria, the negibacterial root of the universal tree and bacterial megaclassification . Int. J. Syst. Evol. Microbiol. 52, 7–76. Crossref, PubMed, ISI, Google Scholar

Catchen JM, Conery JS& Postlethwait JH

. 2009 Automated identification of conserved synteny after whole-genome duplication . Genome Res. 19, 1497–1505.doi:

Housworth EA& Postlethwait J

. 2002 Measures of synteny conservation between species pairs . Genetics 162, 441–448. PubMed, Google Scholar

2008 The amphioxus genome and the evolution of the chordate karyotype . Nature 453, 1064–1071.doi:

Verslycke T, Goldstone JV& Stegeman JJ

. 2006 Isolation and phylogeny of novel cytochrome P450 genes from tunicates (Ciona spp.): a CYP3 line in early deuterostomes? Mol. Phylogenet. Evol. 40, 760–771.doi:

. 1970 Evolution by gene duplication . Berlin, Germany : Springer . Crossref, Google Scholar

. 2000 Robustness: it's not where you think it is . Nat. Genet. 25, 3–4.doi:

. 2007 The zebrafish genome in context: ohnologs gone missing . J. Exp. Zool. B Mol. Dev. Evol. 308, 563–577.doi:

Postlethwait J, Amores A, Cresko W, Singer A& Yan YL

. 2004 Subfunction partitioning, the teleost radiation and the annotation of the human genome . Trends Genet. 20, 481–490.doi:

. 2005 Two rounds of whole genome duplication in the ancestral vertebrate . PLoS Biol. 3, e314.doi:

. 2005 From 2R to 3R: evidence for a fish-specific genome duplication (FSGD) . BioEssays 27, 937–945.doi:

Siegel N, Hoegg S, Salzburger W, Braasch I& Meyer A

. 2007 Comparative genomics of ParaHox clusters of teleost fishes: gene cluster breakup and the retention of gene sets following whole genome duplications . BMC Genomics 8, 312.doi:

. 2008 Insights into cyclostome phylogenomics: pre-2R or post-2R . Zool. Sci. 25, 960–968.doi:

Kuraku S, Meyer A& Kuratani S

. 2009 Timing of genome duplications relative to the origin of the vertebrates: did cyclostomes diverge before or after? Mol. Biol. Evol. 26, 47–59.doi:

2009 Assessing the root of bilaterian animals with scalable phylogenomic methods . Proc. R. Soc. B 276, 4261–4270.doi:

. 2012 Phylogenetic and functional analyses of the cytochrome P450 family 4 . Mol. Phylogenet. Evol. 62, 458–471.doi:

Edgecombe GD, Giribet G, Dunn CW, Hejnol A, Kristensen RM, Neves RC, Rouse GW, Worsaae K& Sorensen MV

. 2011 Higher-level metazoan relationships: recent progress and remaining questions . Org. Divers. Evol. 11, 151–172.doi:

Xian-Guang H, Aldridge RJ, Siveter DJ, Siveter DJ& Xiang-hong F

. 2002 New evidence on the anatomy and phylogeny of the earliest vertebrates . Proc. R. Soc. Lond. B 269, 1865–1869.doi:

Xian-Guang H, Aldridge RJ, Bergström J, Siveter DJ, Siveter DJ& Xiang-hong F

. 2003 The Cambrian fossils of Chengjiang, China: the flowering of early animal life . London, UK : Blackwell Publishing Ltd . Crossref, Google Scholar

Zhu M, Zhao W, Jia L, Lu J, Qiao T& Qu Q

. 2009 The oldest articulated osteichthyan reveals mosaic gnathostome characters . Nature 458, 469–474.doi:

Sansom IJ, Smith MM& Smith MP

. 1996 Scales of thelodont and shark-like fishes from the Ordovician of Colorado . Nature 379, 628–630.doi:

. 2009 Animals (Metazoa) . In The timetree of life (eds

), pp. 223–230. Oxford, UK : Oxford University Press . Google Scholar

Hedges SB, Blair JE, Venturi ML& Shoe JL

. 2004 A molecular timescale of eukaryote evolution and the rise of complex multicellular life . BMC Evol. Biol. 4, 2.doi:

Hoffman PF, Kaufman AJ, Halverson GP& Schrag DP

. 1998 A neoproterozoic snowball earth . Science 281, 1342–1346.doi:

. 2008 Snowball Earth: status and new developments . GEO (IGC Special Climate Issue) 11, 44–46. Google Scholar

Coulier F, Popovici C, Villet R& Birnbaum D

. 2000 MetaHox gene clusters . J. Exp. Zool. 288, 345–351.doi:

Ryan JF, Pang K, Mullikin JC, Martindale MQ& Baxevanis AD

. 2010 The homeodomain complement of the ctenophore Mnemiopsis leidyi suggests that Ctenophora and Porifera diverged prior to the ParaHoxozoa . Evodevo 1, 9.doi:

Pett W, Ryan JF, Pang K, Mullikin JC, Martindale MQ, Baxevanis AD& Lavrov DV

. 2011 Extreme mitochondrial evolution in the ctenophore Mnemiopsis leidyi: insight from mtDNA and the nuclear genome . Mitochondrial DNA 22, 130–142.doi:

2008 Broad phylogenomic sampling improves resolution of the animal tree of life . Nature 452, 745–749.doi:

2005 The zebrafish gene map defines ancestral vertebrate chromosomes . Genome Res. 15, 1307–1314.doi:

Kevei Z, Seres A, Kereszt A, Kalo P, Kiss P, Toth G, Endre G& Kiss GB

. 2005 Significant microsynteny with new evolutionary highlights is detected between Arabidopsis and legume model plants despite the lack of macrosynteny . Mol. Genet. Genomics 274, 644–657.doi:

2012 Extensive conservation of ancient microsynteny across metazoans due to cis-regulatory constraints . Genome Res. (doi:10.1101/gr.139725.112). Crossref, Google Scholar

. 2003 2R or not 2R: testing hypotheses of genome duplication in early vertebrates . J. Struct. Funct. Genom. 3, 85–93.doi:

. 2003 The temporal distribution of gene duplication events in a set of highly conserved human gene families . Mol. Biol. Evol. 20, 154–161. Crossref, PubMed, Google Scholar

Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM& Haussler D

. 2002 The human genome browser at UCSC . Genome Res. 12, 996–1006.doi:

2011 The UCSC genome browser database: update 2011 . Nucleic Acids Res. 39, D876–D882.doi:

2009 The UCSC genome browser database: update 2009 . Nucleic Acids Res. 37, D755–D761.doi:

. 2002 BLAT: the BLAST-like alignment tool . Genome Res. 12, 656–664.doi:

2008 The Trichoplax genome and the nature of placozoans . Nature 454, 955–960.doi:

2007 Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization . Science 317, 86–94.doi:

2011 Using the Acropora digitifera genome to understand coral responses to environmental change . Nature 476, 320–323.doi:

2006 The chemical defensome: environmental sensing and response genes in the Strongylocentrotus purpuratus genome . Dev. Biol. 300, 366–384.doi:

. 2008 Environmental sensing and response genes in Cnidaria: the chemical defensome in the sea anemone Nematostella vectensis . Cell Biol. Toxicol. 24, 483–502.doi:

Goldstone JV, McArthur AG, Kubota A, Zanette J, Parente T, Jonsson ME, Nelson DR& Stegeman JJ

. 2010 Identification and developmental expression of the full complement of Cytochrome P450 genes in Zebrafish . BMC Genom. 11, 643.doi:

Baldwin WS, Marko PB& Nelson DR

. 2009 The cytochrome P450 (CYP) gene superfamily in Daphnia pulex . BMC Genom. 10, 169.doi:

. 2006 Evolution of insect P450 . Biochem. Soc. Trans. 34, 1252–1255.doi:

. 2011 Accelerated profile HMM searches . PLoS Comput. Biol. 7, e1002195.doi:

Johnson LS, Eddy SR& Portugaly E

. 2010 Hidden Markov model speed heuristic and iterative HMM search procedure . BMC Bioinformatics 11, 431.doi:

. 1998 Profile hidden Markov models . Bioinformatics 14, 755–763.doi:

. 2004 MUSCLE: a multiple sequence alignment method with reduced time and space complexity . BMC Bioinformatics 5, 113.doi:

. 2006 RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models . Bioinformatics 22, 2688–2690.doi:

Ott M, Zola J, Aluru S& Stamatakis A

. 2007 Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L . ACM/IEEE Supercomputing conference 2007 (ed.

), New York, NY : Association for Computing Machinery . Google Scholar

. 2006 Phylogenetic models of rate heterogeneity: a high performance computing perspective. In Proceedings of 20th IEEE/ACM Intl Parallel and Distributed Processing Symposium, Rhodos, Greece. Piscataway, NJ: IEEE. Google Scholar

Stamatakis A, Hoover P& Rougemont J

. 2008 A rapid bootstrap algorithm for the RAxML web servers . Syst. Biol. 57, 758–771.doi:

. 2003 Applying the bootstrap in phylogeny reconstruction . Stat. Sci. 18, 256–267. Crossref, ISI, Google Scholar

. 2011 Arthropod CYPomes illustrate the tempo and mode in P450 evolution . Biochim. Biophys. Acta 1814, 19–28.doi:

. 2007 Rapid birth-death evolution specific to xenobiotic cytochrome P450 genes in vertebrates . PLoS Genet. 3, e67.doi:

2008 CYP3 phylogenomics: evidence for positive selection of CYP3A4 and CYP3A7 . Pharmacogenet. Genom. 18, 53–66.doi:

Brenner S, Elgar G, Sandford R, Macrae A, Venkatesh B& Aparicio S

. 1993 Characterization of the pufferfish (Fugu) genome as a compact model vertebrate genome . Nature 366, 265–268.doi:

Avadhani NG, Sangar MC, Bansal S& Bajpai P

. 2011 Bimodal targeting of cytochrome P450s to endoplasmic reticulum and mitochondria: the concept of chimeric signals . FEBS J. 278, 4218–4229.doi:

Bhagwat SV, Biswas G, Anandatheerthavarada HK, Addya S, Pandak W& Avadhani NG

. 1999 Dual targeting property of the N-terminal signal sequence of P4501A1. Targeting of heterologous proteins to endoplasmic reticulum and mitochondria . J. Biol. Chem. 274, 24 014–24 022.doi:

Neve EP& Ingelman-Sundberg M

. 2001 Identification and characterization of a mitochondrial targeting signal in rat cytochrome P450 2E1 (CYP2E1) . J. Biol. Chem. 276, 11 317–11 322.doi:

. 2011 Origin and evolution of the ligand-binding ability of nuclear receptors . Mol. Cell. Endocrinol. 334, 21–30.doi:

. 2005 Xenobiotics and the evolution of multicellular animals: emergence and diversification of ligand-activated transcription factors . Integr. Comp. Biol. 45, 172–178.doi:

Schierwater B, Kamm K, Srivastava M, Rokhsar D, Rosengarten RD& Dellaporta SL

. 2008 The early ANTP gene repertoire: insights from the placozoan genome . PloS ONE 3, e2457.doi:

Johnson NS, Yun SS, Thompson HT, Brant CO& Li W

. 2009 A synthesized pheromone induces upstream movement in female sea lamprey and summons them into traps . Proc. Natl Acad. Sci. USA 106, 1021–1026.doi:

Hernandez RE, Putzke AP, Myers JP, Margaretha L& Moens CB

. 2007 Cyp26 enzymes generate the retinoic acid response pattern necessary for hindbrain development . Development 134, 177–187.doi:

Emoto Y, Wada H, Okamoto H, Kudo A& Imai Y

. 2005 Retinoic acid-metabolizing enzyme Cyp26a1 is essential for determining territories of hindbrain and spinal cord in zebrafish . Dev. Biol. 278, 415–427.doi:

McCaffery P, Wagner E, O'Neil J, Petkovich M& Drager UC

. 1999 Dorsal and ventral retinal territories defined by retinoic acid synthesis, break-down and nuclear receptor expression . Mech. Dev. 82, 119–130.doi:

Kahn RA, Volpicelli-Daley L, Bowzard B, Shrivastava-Ranjan P, Li Y, Zhou C& Cunningham L

. 2005 Arf family GTPases: roles in membrane traffic and microtubule dynamics . Biochem. Soc. Trans. 33, 1269–1272.doi:

. 2005 The Arf-like GTPase Arl1 and its role in membrane traffic . Biochem. Soc. Trans. 33, 601–605.doi:

Kahn RA, Cherfils J, Elias M, Lovering RC, Munro S& Schurmann A

. 2006 Nomenclature for the human Arf family of GTP-binding proteins: ARF, ARL, and SAR proteins . J. Cell Biol. 172, 645–650.doi:

Nebert D, Wikvall K& Miller WL

. 2013 Human cytochromes P450 in health and disease . Phil. Trans. R. Soc. B 368, 20120431.doi:

Markov GV, Tavares R, Dauphin-Villemant C, Demeneix BA, Baker ME& Laudet V

. 2009 Independent elaboration of steroid hormone signaling pathways in metazoans . Proc. Natl Acad. Sci. USA 106, 11 913–11 918.doi:

. 2007 Presence of sex steroids and cytochrome P450 genes in amphioxus . Endocrinology 148, 3554–3565.doi:

Castro LF, Santos MM& Reis-Henriques MA

. 2005 The genomic environment around the Aromatase gene: evolutionary insights . BMC Evol. Biol. 5, 43.doi:

Nebert DW, Nelson DR& Feyereisen R

. 1989 Evolution of the cytochrome P450 genes . Xenobiotica 19, 1149–1160.doi:

. 2005 The genesis and evolution of homeobox gene clusters . Nat. Rev. Genet. 6, 881–892.doi:

Larroux C, Fahey B, Degnan SM, Adamski M, Rokhsar DS& Degnan BM

. 2007 The NK homeobox gene cluster predates the origin of Hox genes . Curr. Biol. 17, 706–710.doi:

Mulley JF, Chiu CH& Holland PW

. 2006 Breakup of a homeobox cluster after genome duplication in teleosts . Proc. Natl Acad. Sci. USA 103, 10 369–10 372.doi:

. 1995 Dorsoventral axis inversion . Nature 373, 110–111.doi:

. 1996 Dorsoventral axis inversion: a phylogenetic perspective . BioEssays 18, 251–254.doi:

. 1822 Considérations générales sur la vertébré . Mém. Mus. Hist. Nat. 9, 89–119. Google Scholar

. 2006 A molecular time-scale for eukaryote evolution recalibrated with the continuous microfossil record . Proc. R. Soc. B 273, 1867–1872.doi:

. 2010 The invasion of the land by plants: when and where? New Phytol. 188, 306–309. Crossref, PubMed, ISI, Google Scholar

Wellman CH, Osterloff PL& Mohiuddin U

. 2003 Fragments of the earliest land plants . Nature 425, 282–285.doi:

Vinogradov SN, Hoogewijs D& Arredondo-Peter R

. 2011 What are the origins and phylogeny of plant hemoglobins .? Commun. Integr. Biol. 4, 443–445.doi:

Ghoshroy S, Binder M, Tartar A& Robertson DL

. 2010 Molecular evolution of glutamine synthetase II: phylogenetic evidence of a non-endosymbiotic gene transfer event early in plant evolution . BMC Evol. Biol. 10, 198.doi:

Emiliani G, Fondi M, Fani R& Gribaldo S

. 2009 A horizontal gene transfer at the origin of phenylpropanoid metabolism: a key adaptation of plants to land . Biol. Direct. 4, 7.doi:

. 2010 Phenylpropanoid biosynthesis . Mol. Plant 3, 2–20.doi:

Batard Y, Schalk M, Pierrel MA, Zimmerlin A, Durst F& Werck-Reichhart D

. 1997 Regulation of the cinnamate 4-hydroxylase (CYP73A1) in Jerusalem artichoke tubers in response to wounding and chemical treatments . Plant Physiol. 113, 951–959. Crossref, PubMed, Google Scholar

. 2006 Intermediary metabolism in sea urchin: the first inferences from the genome sequence . Dev. Biol. 300, 282–292.doi:

Kaneko M, Ohnishi Y& Horinouchi S

. 2003 Cinnamate:coenzyme A ligase from the filamentous bacterium Streptomyces coelicolor A3(2) . J. Bacteriol. 185, 20–27.doi:

Yin L, Zhu M, Knoll AH, Yuan X, Zhang J& Hu J

. 2007 Doushantuo embryos preserved inside diapause egg cysts . Nature 446, 661–663.doi:

Chernikova D, Motamedi S, Csuros M, Koonin EV& Rogozin IB

. 2011 A late origin of the extant eukaryotic diversity: divergence time estimates using rare genomic changes . Biol. Direct. 6, 26.doi:

Rozman D, Stromstedt M, Tsui LC, Scherer SW& Waterman MR

. 1996 Structure and mapping of the human lanosterol 14alpha-demethylase gene (CYP51) encoding the cytochrome P450 involved in cholesterol biosynthesis comparison of exon/intron organization with other mammalian and fungal CYP genes . Genomics 38, 371–381.doi:

Lamb DC, Kelly DE& Kelly SL

. 1998 Molecular diversity of sterol 14alpha-demethylase substrates in plants, fungi and humans . FEBS Lett. 425, 263–265.doi:


Broad-spectrum, diffusible antifungal activity produced by CaCal35

In no-contact confrontation assays on PDA plates, CaCal35 inhibited colony expansion of A. niger, Fusarium oxysporum f.sp. lycopersici (Fol), F. oxysporum f.sp. fragariae (Fof), Rhizoctonia solani, Sclerotium rolfsii, Sclerotinia sclerotiorum, Verticillium dahliae and Magnaporthe grisea (Fig. 1, panels A2-H2), compared to control plates (fungus only, Fig. 1, panels A1-H1). For A. niger, this antifungal activity of CaCal35 (Fig. 2B, compared to 2A, fungus only) was shown to be diffusible and expressed in the absence of fungus, as follows. After culturing CaCal35 in the presence (Fig. 2B) or absence (Fig. 2D) of A. niger on PDA agar plates for 3 days, the sections of agar containing bacteria and fungi were removed using a sterile surgical blade. The remaining sections of agar were inoculated with A. niger spores at different distances relative to where CaCal35 had been growing on the plate. Compared to the control (Fig. 2A), A. niger was unable to grow on agar on which CaCal35 had been previously and proximally grown in the presence of A. niger (Fig. 2C), suggesting that the antifungal activity of CaCal35 was water-soluble and diffusible through agar. We also observed that growth of A. niger was inhibited on agar on which CaCal35 had been previously grown in the absence of A. niger and that the degree of inhibition decreased as a function of the distance between A. niger and CaCal35 (Fig. 2E–G). This observation is consistent with diffusion and dilution of the antifungal activity away from its source, CaCal35. We postulate that this antifungal activity represents a secondary metabolite that will be referred to hereafter as carenaemin.

Biocontrol of tomato Fusarium wilt by CaCal35 in combination with BvFZB42

Notwithstanding the observed antifungal activity of CaCal35 against Fol on agar plates (Fig. 1, panel B2), greenhouse-grown tomato plants were not protected from Fol-induced loss of biomass (Fig. 3A) or vascular discoloration (Fig. 3B) by prior root inoculation with CaCal35. Under the same conditions, BvFZB42 did not protect tomato plants from Fol either (Fig. 3A and B). However, when mixed together, CaCal35 and BvFZB42 (referred to as ‘Colli42’) significantly reduced (P < 0.05) Fol-induced loss of biomass (Fig. 3A) as well as vascular discoloration (Fig. 3B). Fol-challenged plants that were Colli42-treated looked like unchallenged (i.e. no-Fol) plants (Fig. S1D and A, respectively), whereas Fol-challenged plants that were treated with CaCal35 or BvFZB42 looked like untreated Fol-challenged plants (Fig. S1C, E, and B respectively).

Protection by Colli42 was also observed in an experimental Fol-infested field of tomato plants: only the Colli42 treatment significantly (P < 0.05) reduced Fol-induced vascular discoloration, whereas CaCal35 by itself or BvFZB42 by itself did not (Fig. 4A). Colli42 also returned the highest marketable yield of all treatments (i.e. 37.46 US tons of red tomato fruit per acre), but the differences between treatments were not significant (Fig. 4B).

Isolation and characterization of CaCal35 transposon mutants with reduced antifungal activity

To identify the gene(s) underlying the observed antifungal and biocombicontrol activity of CaCal35, approximately 6,500 random EZ-Tn5 transposon insertion mutants of CaCal35 were generated and screened for reduced antifungal activity against A. niger in a bottomless 96-well microtitre plate set-up (see Experimental Procedures). Two such mutants, designated 46A06 and 60C09, were identified (Fig. S2). We confirmed in a standard confrontation assay that both mutants had lost the ability to inhibit hyphal growth of A. niger (Fig. 1, panels A3 and A4). Mixing the two mutants together did not result in restoration to wild-type levels of inhibition (Fig. 1, panel A5). Also, A. niger appeared unaffected after inoculation onto a PDA plate on which either 46A06 or 60C09 was previously grown (Fig. 2H–I and J–K respectively).

Mutants 46A06 and 60C09 had also lost, completely or partially, the ability to inhibit Fol, Fof, R. solani, S. rolfsii, S. sclerotiorum and V. dahliae (Fig. 1, B3-G3 and B4-G4 respectively). For none of these fungi did the two mutants achieve wild-type levels of fungal inhibition when mixed together (Fig. 1, B5-F5). M. grisea was the only fungus that was still inhibited by mutant 46A06 as well as 60C09 (Fig. 1, H3 and H4, respectively), suggesting that the transposon insertions in these mutants did not impact CaCal35’s suppression of M. grisea.

In greenhouse experiments, a mixture of BvFZB42 and mutant 46A06 did not significantly protect tomato plants from Fol-induced loss of biomass, and neither did a mixture of wild-type CaCal35 and the BvFZB42 mutant AK3, or a mixture of the two mutants (46A06 and AK3) (Fig. 5A). The reduction in vascular discoloration that was observed with these mixtures was less than that achieved with Colli42, but not significantly (Fig. 5B).

Genetic characterization of mutant 46A06

Genomic flank sequencing of the transposon in mutant 46A06 revealed the insertion site as 5’-GTGTACGGA-3’ inside locus LT85_RS01670 (Fig. S3A–C). This gene is part of a 76.95-kb region of the CaCal35 genome (NCBI coordinates 365,248-442,201 of NZ_CP009962.1). This region harbours at least 35 predicted genes (Table 1) and was tagged by antiSMASH as containing a hybrid PKS-NRPS gene cluster. No hybrid PKS-NRPS gene clusters like it were found on the published genomes of other Collimonas strains, including C. fungivorans Ter6 and Ter331, C. pratensis Ter91 and Ter291, C. arenae Ter10 and Ter282, and Collimonas sp. ES_Al-A1, PA-H2, OK607, OK242, OK412, OK307, and UBA1456. However, we were able to identify gene clusters with similar synteny and predicted function on the genomes of Chromobacterium violaceum ATCC 53434 (Parker et al., 1988 Singh et al., 1988 ), Chitinimonas koreensis DSM 17726 (Kim et al., 2006 ), Paraburkholderia megapolitana LMG 23650 (Vandamme et al., 2007 ) and Paraburkholderia acidophila ATCC 31363 (Horsman et al., 2017 ). The similarity appeared to be limited to a stretch of genes corresponding to LT85_RS01720 → LT85_RS01665 in CaCal35, which includes the gene carrying the transposon insertion in 46A06, i.e. LT85_RS01670 (Fig. S4).

NCBI locus tag Predicted protein NCBI protein ID
LT85_RS01600 Acetyl-CoA C-acyltransferase WP_038484473.1
LT85_RS01615 SDR family oxidoreductase WP_038484482.1
LT85_RS01625 Hypothetical protein WP_038484490.1
LT85_RS01630 DGQHR domain-containing protein WP_156117397.1
LT85_RS01635 LysE family translocator WP_038484493.1
LT85_RS01640 Lrp/AsnC family transcriptional regulator WP_038484496.1
LT85_RS01645 VOC family protein WP_038484499.1
LT85_RS01650 MarR family transcriptional regulator WP_038484502.1
LT85_RS01655 TCR/Tet family MFS transporter WP_052135471.1
LT85_RS01660 Deaminated glutathione amidase WP_172656931.1
LT85_RS01665 Hypothetical protein WP_156117398.1
LT85_RS01670 Polysaccharide pyruvyl transferase family protein WP_038484508.1
LT85_RS01675 4'-phosphopantetheinyl transferase superfamily protein WP_052134537.1
LT85_RS01680 Thioesterase WP_052134540.1
LT85_RS01685 Polyunsaturated fatty acid/polyketide biosynthesis protein WP_052134542.1
LT85_RS01690 Hypothetical protein WP_038484511.1
LT85_RS01695 Type I polyketide synthase WP_038484514.1
LT85_RS01700 Non-ribosomal peptide synthetase WP_052134544.1
LT85_RS01715 Hybrid non-ribosomal peptide synthetase/type I polyketide synthase WP_081991927.1
LT85_RS01720 Hybrid non-ribosomal peptide synthetase/type I polyketide synthase WP_038484528.1
LT85_RS01725 Helix-turn-helix transcriptional regulator WP_038484531.1
LT85_RS01730 SDR family oxidoreductase WP_081991930.1
LT85_RS01735 LysR family transcriptional regulator WP_052134547.1
LT85_RS01740 alpha/beta hydrolase WP_038484534.1
LT85_RS25150 Pirin family protein WP_009049067.1
LT85_RS01750 NmrA/HSCARG family protein WP_017604182.1
LT85_RS01755 Phosphoesterase WP_052134552.1
LT85_RS01760 c-type cytochrome WP_052135472.1
LT85_RS01765 Branched-chain amino acid transferase WP_038484543.1
LT85_RS01775 NAD(P)-dependent alcohol dehydrogenase WP_038484549.1
LT85_RS01780 AraC family transcriptional regulator WP_038494762.1
LT85_RS01785 LysR family transcriptional regulator WP_038484552.1
LT85_RS01790 SDR family oxidoreductase WP_038484555.1
LT85_RS01795 SnoaL-like domain-containing protein WP_038484558.1
LT85_RS01800 DEAD/DEAH box helicase WP_081991932.1
  • a Highlighted in bold is the gene (LT85_RS01670) that carried the transposon insertion in 46A06. This gene is part of the cplABCDEFGHIJK supraoperon (where cpl stands for carenaemin production locus, LT85_RS01720 → LT85_RS01660) which is highlighted in grey.

LT85_RS01670 is the ninth in a predicted supraoperon of 11 genes (LT85_RS01720 → LT85_RS01660 Fig. 6A, Table S1), which we refer to as cplABCDEFGHIJK (where cpl stands for carenaemin production locus). According to the ‘co-linearity rule’ (Fischbach and Walsh, 2006 ), the first five genes of this supraoperon (cplABCDE) were predicted to code for the production of a secondary metabolite featuring a core scaffold of three amino acid residues, with the first and second amino acids linked through a peptide bond and the second and third by a -CH2-CHOH- moiety (Fig. 6B and C). The middle amino acid is predicted to be a serine, based both on NRPSPredictor2 and on the Stachelhaus sequence (i.e. DVWHFSLVDK) of the corresponding adenylation domain in the cplB gene product. The Stachelhaus sequences of the other two adenylation domains (DIWQFGLILK in cplA and DALFMGGVIK in cplC) were most similar (but not identical) to those reported for glutamine (DAWQFGLIDK) and leucine (DALVMGAVMK) respectively (Rausch et al., 2005 ). The presence of an epimerization domain in the predicted cplC product suggested that the third amino acid exists in D-configuration in the structure. The presence of a ketosynthase/malonyltransferase domain and an aminotransferase domain at the N-terminal end of the cplA gene product puts a predicted R1-CNH2-CH2- group at one end of the secondary metabolite (which generates a glycine amino acid), with a -CO-R2 group predicted at the other end, based on the ketosynthase/malonyltransferase domain in cplD. The chemical formula of the predicted core structure shown in Fig. 6C, not including R1 and R2, is C21H35O8N5, with a predicted mass of 485.53 Da.

The LT85_RS01670 gene (i.e. cplI) which carried the transposon insertion in mutant 46A06 is annotated to code for a polysaccharide pyruvyl transferase family protein (accession number WP_081991923). Pyruvyl transferases catalyse the transfer of the pyruvate moiety to a saccharide target (Hager et al., 2019 ). Proteins belonging to this functional family include CsaB (Fujinami and Ito, 2018 Hager et al., 2018 ), WcaK (Stevenson et al., 1996 ), AmsJ (Wang et al., 2012 ) and PssM (Ivashina et al., 2010 ), which are all involved in the production of exopolysaccharides (EPS), including secondary cell wall polymers in Bacillus species, colanic acid in Escherichia coli, amylovoran in Erwinia and EPS in Rhizobium leguminosarum respectively. Immediately downstream of cplI and part of the same predicted operon as cplI are cplJ (LT85_RS01665) and cplK (LT85_RS01660), annotated as a hypothetical protein and a deaminated glutathione amidase respectively.

Genetic characterization of 60C09

In mutant 60C09, the EZ-Tn5 transposon was found inserted into the sequence 5’-GTATAAGCA-3’, where TAA is the stop codon for gene LT85_RS04520 (Fig. S3D–F). Multiple genes in the immediate vicinity of LT85_RS04520 (Fig. 6B Table 2) are predicted to contribute to the so-called Wzy-dependent pathway for synthesis and incorporation of oligosaccharide repeat units into cell surface lipopolysaccharides (Kalynych et al., 2014 ). This pathway involves the initial transfer of a sugar moiety to a lipid carrier (probably catalysed by the product of LT85_RS04530), the addition of sugar groups by glycosyltransferases (LT85_RS25805, LT85_RS04525) to generate an oligosaccharide repeat unit. Individual units are transported into the periplasm by a Wzx flippase (LT85_RS04500), where they undergo polymerization by a Wzy polymerase (LT85_RS25800) to yield a polysaccharide that then may be transferred to the inner-core oligosaccharide of lipid A by WaaL ligase (LT85_RS04650) and transported to the cell surface.

NCBI locus tag Predicted protein NCBI protein ID
LT85_RS04480 dTDP-glucose 4,6-dehydratase (RmlB) WP_038485799.1
LT85_RS04485 Glucose-1-phosphate thymidylyltransferase (RmlA) WP_038485802.1
LT85_RS04490 dTDP-4-dehydrorhamnose 3,5-epimerase (RmlC) WP_038485806.1
LT85_RS04495 dTDP-4-dehydrorhamnose reductase (RmlD) WP_038485809.1
LT85_RS04500 flippase WP_081992030.1
LT85_RS25800 Oligosaccharide repeat unit polymerase WP_081992033.1
LT85_RS25805 dTDP-L-rhamnosyltransferase WP_081992035.1
LT85_RS04515 dtdp-L-rhamnose 4 epimerase (Tle) WP_038495031.1
LT85_RS04520 Serine acetyltransferase WP_038485818.1
LT85_RS04525 dTDP-6-deoxy-L-talosyltransferase WP_038485820.1
LT85_RS04530 Lipid carrier: UDP-N-acetylgalactosaminyltransferase WP_038485823.1
LT85_RS04535 UDP-N-acetylglucosamine 4,6-dehydratase WP_052134643.1
LT85_RS04540 H-NS histone family protein WP_038485826.1
LT85_RS04545 Phosphomannomutase/phosphoglucomutase WP_038485829.1
LT85_RS04550 Lipopolysaccharide heptosyltransferase I WP_038485833.1
LT85_RS25195 Class I SAM-dependent methyltransferase WP_081992037.1
LT85_RS04560 Hypothetical protein WP_038485836.1
LT85_RS04565 Class I SAM-dependent methyltransferase WP_172656939.1
LT85_RS04570 Glycosyltransferase WP_052134650.1
LT85_RS04575 Glycosyltransferase WP_038485842.1
LT85_RS04580 Hypothetical protein WP_038485845.1
LT85_RS04585 DUF2142 domain-containing protein WP_172656940.1
LT85_RS04595 Selenocysteine-specific translation elongation factor WP_038485851.1
LT85_RS04600 L-seryl-tRNA(Sec) selenium transferase WP_052134652.1
LT85_RS04605 Formate dehydrogenase accessory protein FdhE WP_038485854.1
LT85_RS04610 Formate dehydrogenase subunit gamma WP_038485857.1
LT85_RS04615 Formate dehydrogenase subunit beta WP_038485861.1
LT85_RS04620 Formate dehydrogenase-N subunit alpha WP_156117426.1
LT85_RS04625 Sulfate ABC transporter substrate-binding protein WP_038485868.1
LT85_RS04630 3-deoxy-D-manno-oct-2-ulosonic acid (Kdo) hydroxylase WP_038485871.1
LT85_RS04635 3-deoxy-D-manno-octulosonic acid transferase WP_038495045.1
LT85_RS04640 YdcF family protein WP_038485874.1
LT85_RS04645 Glycosyl transferase WP_052134654.1
LT85_RS04650 O-antigen ligase domain-containing protein WP_038485877.1
LT85_RS04655 UDP-glucose 4-epimerase GalE subunit WP_038485880.1
  • a Highlighted in bold is the gene (LT85_RS04520) that carried the transposon insertion in 60C09. This gene is part of a predicted operon (highlighted in grey).

One of the sugars in the oligosaccharide repeat unit probably is rhamnose, seeing that (i) LT85_04485, _04480, _04490 and _04495 resemble the rmlABCD genes for the conversion of D-glucose-1-P into deoxythymidine diphosphate (dTDP)-L-rhamnose (King et al., 2009 ), and (ii) the LT85_RS25805 gene product is highly similar to RfbF, which is a dTDP-rhamnosyl transferase that in Shigella flexneri generates the N-acetylglucosamine-Rha-Rha-Rha tetrasaccharide repeat (Morona et al., 1995 ). Another sugar in the oligosaccharide repeat unit may be 6-deoxytalose, given that the two genes flanking LT85_RS04520 are predicted to code for the conversion of dTDP-L-rhamnose into dTDP-L-6-deoxytalose (LT85_04515) and for adding dTDP-L-6-deoxytalose to the growing oligosaccharide repeat unit (LT85_04525), based on similarity to the tle and gtr29 gene products, respectively, from Acinetobacter baumannii (Kenyon et al., 2017 ). The LT85_RS04520 product itself is annotated as a serine acetyltransferase. It showed substantial homology to WcaB from E. coli which also carries the signature of a serine acetyltransferase, but has been shown to catalyse the acetylation of a galactosyl residue after its incorporation as the fourth sugar into the heptasaccharide repeat unit of colanic acid (Scott et al., 2019 ). This acetylation enables the next glycosyltransferase to add the fifth sugar onto the colanic acid repeating unit (Scott et al., 2019 ).

Characterization of the antifungal activity produced by CaCal35

We were unsuccessful in our efforts to recover antifungal activity from culture supernatants of CaCal35 grown in broth. We therefore devised an agar diffusion assay to extract antifungal activity from CaCal35 grown on agar instead. This assay (Fig. S5) exploits the fact that such activity is produced by CaCal35 on agar in the absence of fungus and is diffusable through agar. The diffusates that were collected in this way from wild-type CaCal35 and from mutants 46A06 and 60C09 were tested in a modified confrontation assay with A. niger to reveal that hyphal growth of A. niger was reduced in the presence of diffusates extracted from CaCal35, but not with those extracted from mutants 46A06 and 60C09 (Fig. 7A). The diffusates were analysed by liquid chromatography–mass spectrometry, revealing two prominent peaks (labelled as a and b in Fig. 7B) that were present in the Cal35 diffusate but absent in the diffusates collected from the mutants 46A06 (Fig. 7C) and 60C09 (Fig. 7D). These two peaks eluted with retention times of 13.3 and 13.7 min, and m/z values of 835.493 and 849.509 Da, respectively, in both negative and positive mode. The difference in m/z between peaks a and b equals 14.016 Da, which approximates (Patiny and Borel, 2013 ) the mass of a CH2 moiety. Furthermore, we observed two peaks that were unique to the diffusate from mutant 46A06 and absent from those of wild-type CaCal35 and mutant 60C09 (labelled as c and d in Fig. 7C). These peaks had retention times of 12.3 and 12.8 min, and m/z values were 765.488 and 779.503 Da in both negative and positive mode respectively. The difference in m/z between peaks c and d was 14.015, i.e. the same difference in m/z as between peaks a and b, and consistent with a CH2 moiety. In positive mode, the differences in m/z between peaks a and c and between peaks b and d were 70.006 and 70.005, respectively, which corresponds with a C3H2O2 moiety.

<p>This section provides any useful information about the protein, mostly biological knowledge.<p><a href='/help/function_section' target='_top'>More. </a></p> Function i

Catalyzes the phosphorylation of L-fuculose. Can also phosphorylate, with lower efficiency, D-ribulose, D-xylulose and D-fructose.

<p>Manual validated information which has been generated by the UniProtKB automatic annotation system.</p> <p><a href="/manual/evidences#ECO:0000255">More. </a></p> Manual assertion according to rules i

<p>Manually curated information for which there is published experimental evidence.</p> <p><a href="/manual/evidences#ECO:0000269">More. </a></p> Manual assertion based on experiment in i