- Split View
-
Views
-
Cite
Cite
Nicolas Terrapon, Vincent Lombard, Élodie Drula, Pascal Lapébie, Saad Al-Masaudi, Harry J Gilbert, Bernard Henrissat, PULDB: the expanded database of Polysaccharide Utilization Loci, Nucleic Acids Research, Volume 46, Issue D1, 4 January 2018, Pages D677–D683, https://doi.org/10.1093/nar/gkx1022
- Share Icon Share
Abstract
The Polysaccharide Utilization Loci (PUL) database was launched in 2015 to present PUL predictions in ∼70 Bacteroidetes species isolated from the human gastrointestinal tract, as well as PULs derived from the experimental data reported in the literature. In 2018 PULDB offers access to 820 genomes, sampled from various environments and covering a much wider taxonomical range. A Krona dynamic chart was set up to facilitate browsing through taxonomy. Literature surveys now allows the presentation of the most recent (i) PUL repertoires deduced from RNAseq large-scale experiments, (ii) PULs that have been subjected to in-depth biochemical analysis and (iii) new Carbohydrate-Active enzyme (CAZyme) families that contributed to the refinement of PUL predictions. To improve PUL visualization and genome browsing, the previous annotation of genes encoding CAZymes, regulators, integrases and SusCD has now been expanded to include functionally relevant protein families whose genes are significantly found in the vicinity of PULs: sulfatases, proteases, ROK repressors, epimerases and ATP-Binding Cassette and Major Facilitator Superfamily transporters. To cope with cases where susCD may be absent due to incomplete assemblies/split PULs, we present ‘CAZyme cluster’ predictions. Finally, a PUL alignment tool, operating on the tagged families instead of amino-acid sequences, was integrated to retrieve PULs similar to a query of interest. The updated PULDB website is accessible at www.cazy.org/PULDB_new/
INTRODUCTION
Polysaccharides constitute the main source of carbon for most organisms on Earth. Because of their enormous structural diversity, polysaccharide deconstruction requires the concerted action of large numbers of specific enzymes. While most bacteria break down polysaccharides by exporting their carbohydrate-active enzymes (CAZymes) into the extracellular milieu and import the simple sugars produced, an inventive solution operates in Gram-negative bacteria of the Bacteroidetes phylum. The genomes of these bacteria feature Polysaccharide Utilization Loci, or PULs. A PUL comprises a single genomic locus that encodes the necessary proteins to bind a given polysaccharide at the cell surface, to perform an initial cleavage to large oligosaccharides, to import these oligosaccharides in the periplasmic space, to complete the degradation into monosaccharides and to regulate PUL gene expression. Some Bacteroidetes species contains up to 100 PULs with almost 20% of their genome dedicated to these systems (1), explaining their evolutionary success as primary glycan degraders in the human gut microbiota (2). Bacteroidetes are found in almost all environments, and the last decade has seen a continuous acceleration of published PUL analyses, notably by RNAseq experiments and in-depth biochemistry. To facilitate individual PUL analysis, in 2015 we launched PULDB to present PULs predicted solely from genome sequences along with those reported in the literature (3). The principle of the PUL prediction is to start from every susCD-like gene pair, and then to extend PUL boundaries to operonic genes (based on intergenetic distances between genes on the same strand (4)) and to more distant regulators and CAZyme coding genes which catalyze polysaccharide breakdown. While we previously mainly focused on the algorithm and presented a limited number of genomes with a recognized bias towards human gut species/strains, we present here a major update of PULDB. This release includes a 10-fold increase in analyzed genomes that offers a much deeper coverage of the Bacteroidetes phylum and different environments. A tool has been integrated to the web interface to facilitate taxonomy browsing in PULDB. Also this release updates to the most recent literature-derived PULs and CAZyme families. Additional protein families relevant in a PUL context are now displayed and used in a PUL aligner that allows the user to retrieve the most conserved modular PUL organizations.
10-FOLD INCREASE IN CAZy-ANALYZED SPECIES
In order to achieve a >10-fold increase in PULDB, we analyzed 820 complete genome sequences (∼3 million genes) mostly of the Bacteroidetes phylum downloaded from JGI (http://genome.jgi.doe.gov/) and NCBI (https://www.ncbi.nlm.nih.gov/nuccore) servers. Our PUL prediction procedure relies on genomic data but also requires the semi-manual expert annotation of CAZymes (5). We identified 153 202 CAZyme modules in the 820 genomes, mostly glycoside hydrolases (53%) and glycosyltransferases (31%), classified according to the sequence-based families that are described in the CAZy database. Then the 820 genomes were subjected to the PUL predictions as described earlier. Compared to the 2015 PULDB dataset (3), the new genome sampling expands far beyond the human gastrointestinal tract (now represented by ∼80 species), and notably includes 64 rumen gut species, as well as many bacterial species from soil or marine environments. The coverage of Bacteroidetes taxonomical diversity also drastically increased. The 2015 dataset almost exclusively consisted of species from the Bacteroidales order (70% belonging to the Bacteroides genus). In the current dataset, Bacteroidales only represents 40% (only half being from the Bacteroides genus), a proportion comparable to the Flavobacteriales order while three additional orders (Cytophagales, Sphingobacteriales and Chitinophagales) are now also presented. Moreover, the presence of the PUL fundamental susCD gene tandem now allows the prediction of PULs beyond the Bacteroidetes phylum, namely in the Gemmatimonadetes and Ignavibacteriae phyla (which group with Bacteroidetes in the FCB group), and also in the Balneolaeota phylum.
To facilitate navigation across the various taxonomical levels, and to identify species of interest, we implemented a new browsing tool in PULDB. We adapted the Krona multilayered pie-chart, introduced for metagenomics analysis (6), to represent the hierarchical aspects of the taxonomy (Figure 1). Implemented using the latest HTML5 and JavaScript interactive technology, Krona allows zooming in and out very efficiently and can be easily customized by the user for the desired taxonomic depth or font size, allowing the production of high-quality publication-ready pictures. It also offers text searches and improved navigation. We also added a color scale indicative of the number of PULs per genome (estimated for ancestral taxa by a simple arithmetic mean) which immediately offers an overview of the PUL diversity at the different taxonomical levels. Finally, in the upright part, where Krona provides statistics about genome for each taxa, we added several hyperlinks to the species list, to the NCBI taxonomy and to PULDB predicted PULs in this group/species.
LITERATURE-DERIVED PULs, COGNATE SUBSTRATES AND NEW CAZymes FAMILIES
The study of polysaccharide degradation by PUL encoded systems is a highly active research field. A continuous literature survey enabled us to complete the PULDB data with literature-derived PUL data (previously called experimentally-validated PULs). Notably, recent high-throughput experiments led to the delineation of PULs in Bacteroides cellulosilyticus WH2 (7), Bacteroides thetaiotaomicron 7330 (8) and Zobellia galactinovorans (9). Attempts to define PUL boundaries in the absence of expression data were also reported in the genome publication of Capnocytophaga canimorsus Cc5 (10). Moreover, several specific analyses have focused on the degradation of defined polysaccharides by their corresponding PULs, including plant (fructan (11), pectin (12), xylan (13,14), xyloglucan (15) and type II rhamnogalacturonan (RGII) (16)) and non-plant (α-mannan (17), galactomannan (18), 1,6-β-glucan (19), mucin (20), sialoglycoconjugates (21), N-glycan (17,22–24), heparin and heparan sulfate (25), chitin (26), alginate and laminarin (27)) polysaccharides. To facilitate the retrieval of characterized PULs by their cognate substrate, a new field appears in the PULDB homepage, to search for a given character substring within the PUL substrate labels. Finally, the recent RGII publication notably reported the biochemical characterization of seven new glycoside hydrolase families that were immediately added to the CAZy database, designated GH137 to GH143. Similarly, other publications led to the creation of new CAZyme families: GH136, GH144, GH145, PL24 to PL27 (28–34). All new CAZy families have also been added to PULDB. As a consequence, the PUL predictions are improved by these new families, which allow refinement of both PUL boundaries and prediction confidence, as illustrated with the Jbrowse view (35) of the homologous RGII-PUL in Terrimonas ferruginea DSM 30193 (Figure 2).
ADDITIONAL DISPLAY OF SULFATASES, PROTEASES, EPIMERASES, ROKs AND TRANSPORTERS
In PULDB, simplified representations of PULs are proposed as trains whose wagons, the constitutive proteins, are colored/tagged if their protein function is relevant in the PUL context. We initially focused on SusC outer-membrane transporter (purple), SusD outer-membrane binding proteins (orange), several regulators (light blue), integrases which sometimes join adjacent PULs (dark gray) and CAZyme families (mainly glycoside hydrolases in light pink, polysaccharide lyases in dark pink, carbohydrate-binding modules in green, carbohydrate esterases in brown). All other proteins remained tagged as ‘unknown’ (light gray). To increase readability of these PUL representations, we searched additional protein families with relevant function in PULs, based on (i) the literature, (ii) over-representation in PUL contexts and (iii) reliability of Pfam domain annotation (36). These new families are now tagged/colored in the new PULDB release. The most important accessory enzymes that directly assist polysaccharide degradation are the sulfatases, which remove sulfate groups from algal and mammalian-host glycans (25,37,38). Sulfatases now appear colored in yellow in PULDB and are labeled according to their SulfAtlas family classification (39). Proteins in the Major Facilitator Superfamily (MFS) are inner membrane transporters that participate in carbohydrate metabolism after polysaccharide depolymerization (40). Their presence in the vicinity of PULs and their participation in species growth have been demonstrated (41). MFS are thus colored in purple in PULDB, like SusC transporters, as well as ATP-Binding Cassette transporters. Even though PULDB has not been designed to annotate carbohydrate (monosaccharide) metabolism, in which a large variety of protein functions are involved, we intend to provide users with some indicators that several ‘unknown’ genes in a given PUL may not contribute to polysaccharide deconstruction. Thus, we colored in light blue and tagged domains of the ROK family (Repressors, ORFs and Kinases), and as well as some epimerases (42,43) that are frequently found in PULs. Finally, proteases have been shown to appear in some operons with susCD genes and to participate to the degradation of non-glycan substrates (20), raising the question of the extension the PUL paradigm beyond glycans. The observation of their high frequency in some PULs without CAZyme genes, motivates the integration of proteases in PULDB (gold-colored), labeled with the clan information of the MEROPS classification (44). All tags that can be searched and displayed in PULDB are shown in Figure 3, and are available at www.cazy.org/PULDB/tags.html.
CAZyme CLUSTERS
While most PULs resemble simple operonic systems, some substrates have been shown to activate the concerted action of several PULs, e.g. RGII (16), and sometimes a PUL and an additional gene cluster devoid of susCD genes, thus failing to fulfill the standard PUL paradigm. This was exemplified by the xylan degradation system of Bacteroides xylanisolvens (26). Indeed, when the complexity of the substrate increases, more enzymes are required for its breakdown and thus a ‘longer’ PUL needs to be maintained. This represents a challenge for bacteria to constrain all necessary enzymes within a single locus/regulatory system. Comparative genomics analysis of homologous PULs for RGII breakdown (16), the most complex known polysaccharide, revealed many species with several scattered loci, one containing susCD genes and several others made of three or more clustered CAZyme genes. To cope with such detached gene clusters, the present PULDB update introduces the display of so-called ‘CAZyme clusters’. To predict CAZyme clusters, we apply exactly the same algorithm as in PUL prediction, but instead of initiating the prediction around susCD genes, we start from a core of at least three adjacent CAZyme genes, not necessarily on the same strand, separated by a maximum of one single inserted gene. The display of CAZyme clusters in the PULDB web interface is accessible via a checkbox. CAZyme clusters will also help in PUL annotation of fragmented genomes. For example, despite an incomplete genome assembly, Bacteroides ovatus ATCC 8483 became a model Bacteroidetes species thanks to RNA analysis conducted by Martens and coworkers (45). The complete genome sequence obtained later; however, reveals that the incomplete initial assembly prevented the delineation of a large PUL (Bovatus_02505 to Bovatus_02540). This was because the locus was scattered across four different short scaffolds for which CAZyme cluster definition would have at least reported two of the three split clusters.
THE PUL ALIGNER
A new tool is presented in this PULDB release to allow a user to search and identify PULs that are similar to a PUL of interest, and is accessible in the web pages dedicated to each PUL. This tool is a PUL aligner which allows retrieval conserved modular organizations. Inspired from the RADS modular alignment method for proteins (46), this tool produces local alignments of a query PUL (or CAZyme cluster) against all PULs (and CAZyme clusters) in PULDB. However, instead of aligning concatenated amino-acid sequences of proteins, it treats each protein relevant to PUL function as one character. Implementing the classical Needleman–Wunsch algorithm (47), it requires a substitution-scoring matrix between modules, as well as gap costs. A simple scheme based on the most relevant features of PULs was empirically designed. Matches of identical glycoside hydrolase and polysaccharide lyase families are given a score of +200 because they are the main actors of the polysaccharide breakdown specificity, matches of all other proteins families a score of +100 and a match of the susCD pair a value of +50 only, due to its presence in all predicted PULs. Proteins tagged as unknown are ignored. Given that a mutation of a protein domain into another is an evolutionary event less likely than for amino-acids, our scoring scheme also favors gaps over substitutions by giving the following penalties: internal gap opening/extension: −20/−10 and terminal gap opening/extension: −10/−5; substitution: −50. As a result, the alignment scores allow the ranking of similar PULs from the most identical (syntenic) to the most rearranged. Figure 4 shows the results of a search starting from the xyloglucan PUL of B. ovatus ATCC 8483 (15) as the query and three aligned PULs with various conservation levels. The PUL aligner can also help in comparative genomics studies of a PUL, (i) by estimating its spread among strains of the same species, among its genus, and beyond, and (ii) by identifying the rearrangements (deletion/insertion) events that occurred during the evolution of a particular PUL.
FUNDING
Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah [67/130/35-HiCi to S.A.-M.]; European Union’s Seventh Framework Program [FP/2007/2013]; European Research Council (ERC) [322820 to H.G., B.H.]. Funding for open access charge: ERC [322820].
Conflict of interest statement. None declared.
Comments