- Split View
-
Views
-
Cite
Cite
Pablo Minguez, Ivica Letunic, Luca Parca, Luz Garcia-Alonso, Joaquin Dopazo, Jaime Huerta-Cepas, Peer Bork, PTMcode v2: a resource for functional associations of post-translational modifications within and between proteins, Nucleic Acids Research, Volume 43, Issue D1, 28 January 2015, Pages D494–D502, https://doi.org/10.1093/nar/gku1081
- Share Icon Share
Abstract
The post-translational regulation of proteins is mainly driven by two molecular events, their modification by several types of moieties and their interaction with other proteins. These two processes are interdependent and together are responsible for the function of the protein in a particular cell state. Several databases focus on the prediction and compilation of protein–protein interactions (PPIs) and no less on the collection and analysis of protein post-translational modifications (PTMs), however, there are no resources that concentrate on describing the regulatory role of PTMs in PPIs. We developed several methods based on residue co-evolution and proximity to predict the functional associations of pairs of PTMs that we apply to modifications in the same protein and between two interacting proteins. In order to make data available for understudied organisms, PTMcode v2 (http://ptmcode.embl.de) includes a new strategy to propagate PTMs from validated modified sites through orthologous proteins. The second release of PTMcode covers 19 eukaryotic species from which we collected more than 300 000 experimentally verified PTMs (>1 300 000 propagated) of 69 types extracting the post-translational regulation of >100 000 proteins and >100 000 interactions. In total, we report 8 million associations of PTMs regulating single proteins and over 9.4 million interplays tuning PPIs.
INTRODUCTION
The complexity of the eukaryotic cell cannot be explained only by the number of genes and proteins but by their complex degree of regulation which includes several levels and mechanisms. Thus, one of the biggest challenges of the new molecular biology is to understand these systems globally in order to extract the active part of the regulation landscape from one or correlated snapshots of a cell state. One important step of this regulation is performed after translation where protein function is defined by the interplay between protein–protein interactions (PPIs) and post-translational modifications (PTMs). These two mechanisms are interdependent since PPIs are described to be regulated by PTMs (1,2) and intermediate enzymes are also subject of modification.
PTMs are indeed an abundant (3) and widely spread (4) source of protein regulation. They are involved in a vast number of functions, from protein stabilizing factors (5) to regulators of molecular switches (6). Their outcome on the proteins depends on the type of the modification, several hundreds are described, and on their possible combinations, PTMs may not act alone but coupled to others by their cooperation or competition (7). A few well-studied examples of functions defined by these types of combinatorial patterns are described in particular protein families, such as the regulation of histones (8) or the regulation of the family of transcription factors Fox0 (9). Some universal PTM combinations also have a known role, as phosphorylation coupled to ubiquitination in the protein degradation pathway (10), and there are several cases of PTM types modifying the same residue in a competitive manner (11). These and other evidences suggest the existence of a universal molecular barcode, dubbed the PTM code, that would encrypt the signal for the regulation of protein location and function including their interaction with other proteins.
Here we present the second release of the PTMcode database, a resource for known and predicted PTM functional associations. For this update we have tripled the number of compiled experimentally verified PTMs (up to ∼300 000) and spread their signal through conserved sites in orthologous proteins of 19 eukaryotic species augmenting this number to 1.3 million. In total, we provide PTM functional annotation for more than 100 000 proteins. For each of them, PTMcode outlines its known and predicted post-translational regulatory landscape which includes the functional links between its modified residues and from this second release also the functional associations between PTMs that may regulate its interaction with other proteins. PTMcode v2 is accessible through the url http://ptmcode.embl.de.
METHODS SUMMARY
Relative Residue Conservation Score (rRCS) calculation
The rRCS measures the conservation of an amino acid within a multiple sequence alignment (MSA) from an orthologous group of proteins. It takes into account the occurrence of the amino acid in the exact position and the maximum branch distance between the species of the proteins with the amino acid conserved, for more details see (12). For this release we used orthologous groups from eggNOG 4.0 (13) and a species tree generated out of marker genes. ETE (14) and treeKO (15) python libraries where used to manipulate trees. DisEMBL algorithm was used for the calculation of protein disordered regions (16).
Co-evolution algorithm
We use mutual information (MI; 17–19) to measure co-evolution of two modified residues. MI is calculated for every pair of PTMs in a protein using the MSA of its most ancient orthologous group. A background distribution of MI values is calculated from non-modified residues of the same protein, type of amino acid and located in similar protein regions (ordered or disordered). Pairs of PTMs with a MI value higher than 95% of the background distribution are selected as co-evolving. For the residues where MI cannot be calculated (for restriction of MI calculation see 12), we calculate the ratio of the conserved site and compared to the distribution of the non-modified sites with the same limitation taken as background, pairs with a ratio above 95% of the background distribution are selected as co-evolving.
Co-evolution algorithm for PPIs
The two proteins of the PPI are mapped to their most ancient orthologous groups. The MSAs from these two orthologous groups are pruned to keep only proteins of their common species. Thus, we guarantee a fair comparison between the conservation of the modified residues over the organisms present in both alignments. After this, the algorithm behaves as described in previous section.
RESULTS
Mapping PTMs from various sources into a single framework
PTMcode v2 provides a highly curated data set of PTMs collected from six public databases: UniProt (20), PHOSIDA (21), PhosphoSite (22), PhosphoELM (23), dbPTM (24) and HPRD (25) and from nine high-throughput experiments reported in papers (11,26–33). Our input data set has tripled the previous release numbers and now consists of 316 546 experimentally verified PTMs of 69 different types spread over 45 361 proteins from 19 eukaryotes (Figure 1A). Modified residues were mapped into reference protein sequences from the eggNOG 4.0 database and each source validated for every protein requiring all its PTMs matching the correct amino acid. Besides the extensive exercise of PTM compilation, the value of our data set relies on the framework built for its collection, mapping, annotation and finally its visualization and data retrieval. PTMcode uses the same protein repertoire and synonyms dictionary as the databases eggNOG (13) and STRING (34), a powerful tool that allows us to easily include orthologs and network neighborhood information. We use the protein orthologous groups provided by eggNOG 4.0 to calculate our rRCS (12, Methods Summary), that evaluates every PTM conservation. For a user tip, a rRCS >95 means that the modified residue is more conserved than the 95% of the same type of amino acids in the same type of region of the protein, still some caution must be taken when filtering the most conserved PTMs in a protein since fast-evolving PTMs may also have functional roles as shown in phosphorylation sites (35).
PTMcode also offers functional annotation of PTMs. First, they are mapped into highly curated protein domains and unstructured regions provided by the SMART database (36), another resource of reference that allows us to show the modifications in the functional context of the protein. We also provide when available the responsible enzymes for the modifications and a simple description of their potential role as regulators or structural stabilizers depending on the nature of the PTM. As a novelty in this new release we display the number of Pubmed articles in which a PTM has been originally described (extracted directly from the annotation in the sources). This information represents an additional tip for the functionality assessment of a modified site. Indeed, the PTMs described in more Pubmed articles are generally more conserved and have a higher co-evolution rate than those described in few articles (Supplementary Figure S1) which suggests a higher probability to be functional. Finally, PTMcode's main contribution to the field is to provide known and predicted functional associations between PTMs. All these features together make PTMcode an integrative framework for the study of protein post-translational regulation.
Propagating PTM sites to orthologs in other organisms
The new technical advances on the identification of protein-modified sites (37) have drastically increased the availability of PTMs. However, we are still far from having complete PTM repertoires (3) partly due to the condition-specificity of most PTMs (28). The cascade of data production has just started in this field and there are still many organisms with no available high-throughput studies. In addition, although hundreds of different PTM types are described, only a few have been subject of proteome-wide screenings (e.g. phosphorylation, acetylation or glycosylation). Based on these two needs, to cover more species and to amplify the number of modifications of understudied PTM types, PTMcode v2 spreads the signal from experimental validated PTMs to the conserved sites in orthologs from other species, we tag them as ‘propagated PTMs’ (Figure 1B and C).
Indeed, conservation has been widely used as a proxy for functionality (38–40) so it is fair to consider those conserved residues as potentially modified residues with the same constraints that are applied to the use of conservation to the functional assessment of experimentally verified PTMs (35). For example, for a total of 69 875 phosphoserines that could be propagated to human proteins, we obtained 15 914 PTMs matching already known human phosphorylations. This represents a 22.7% of overlap (still a big underestimation since specially the other species phosphoproteome is far from being complete) while the random expectation of hitting the correct residue is 15% taking into account the different distribution of phosphorylations in ordered and disordered regions (8038 known phosphoserines in 172 678 serines in ordered regions and 60 095 known phosphoserines in 577 025 serines in disordered).
Thus, although propagated PTMs should be taken with much caution and used only for exploratory analyses, they exhibit a reasonable overlap with known PTMs as a good indicator for their reliability. By this simple exercise we obtained 1 347 165 sites with a modification signal propagated from verified PTMs allowing us to produce post-translational information for species that otherwise count with almost none experimental data (Figure 2).
The aim is to bring the number of PTMs of organisms with no large-scale PTM surveys closer to the reality, thus, for Pan troglodytes from which we only could map 10 experimentally verified phosphorylations, we are able to report almost 100 000 propagated phosphorylations, definitely a closer number to the 122 000 validated phosphosites that we have in human. The percentage of increase in the number of PTMs differs among the PTM types, while types with no high-throughput screenings, such as malonylation, nitration, glycation or neddylation, have a high increment (12.1%, 9.8%, 9% and 9.1%, respectively), the most studied PTM types, such as phosphorylation, ubiquitination and acetylation, have an increase of 3.8%, 6% and 5.4%.
Predicting functional links between PTMs within proteins
The main focus of the PTMcode database is the study of PTMs that are functionally linked. Several types of associations are possible (41): PTMs mutually exclusive that compete for the same residue (11), PTMs that are required for others to take place in a signaling cascade (42), PTMs that are close enough to influence each other (1) or even modifications involved in concerted allosteric conformational changes (43). In order to cover this broad spectrum of possible scenarios PTMcode counts on five evidence channels that explore different properties of the coupled PTMs: (i) the ‘co-evolution’ channel (Figure 3A) that extracts pairs of PTMs with a similar evolutionary history (see Methods Summary); (ii) the ‘structural distance’ channel (Figure 3B) that reports close modified residues in 3D structures; (iii) the ‘competition’ channel which highlights residues modified by several PTM types; (iv) the ‘manual annotation’ channel that offers associated PTM pairs described in the literature; and (v) finally, the ‘hotspots’ channel that calculates high-density modified protein regions.
We applied our previously described five methodologies (44) to the new data set which includes the propagated PTMs. Thus, PTMcode v2 has now more than 1.2 million functional links of PTMs within the same protein formed of 205 571 PTMs in 21 713 proteins (∼8 million pairs of 1.2 million PTMs in 100K proteins including the propagated PTMs) representing an increase of ∼3.2 times of reported PTM associations (20 times including propagated) compared to previous release.
The ‘co-evolution’ channel is the one that provides the majority of our predictions. In a previous work (12) we showed that co-evolution of PTM pairs can be used as a proxy for their functional association. Indeed, co-evolving PTM pairs were shown to be associated with protein short-linear motifs and globular domains as well as to be closer in sequence and space. Moreover, sets of proteins with particular types of co-evolving PTMs were enriched in certain functions, locations and PPI clusters compared to proteins with non-co-evolving PTMs of the same type. On the other hand, co-evolution is also the evidence channel harder to interpret in terms of mechanism as it may point to very broad functional associations. In this release we have fine-tuned our co-evolution algorithm taking extra controls as we extract pairs of PTMs with a higher co-evolution rate than pairs of non-modified residues in the same protein with the same type of amino acid and placed on similar protein regions (Methods Summary) while before we used the random expectation as background. For instance, a protein ubiquitinated in a disordered region and with a phosphoserine in an ordered region would have as background distribution the co-evolution values of the pairs formed by all non-modified lysines in disordered regions and non-modified serines in ordered regions.
Thus, in PTMcode v2 we report over 1.2 million co-evolving PTM pairs within more than 20 000 proteins (almost 8 million interplays in ∼98 000 proteins if we include the propagated PTMs). Co-evolving PTM pairs with a co-evolution score higher than 95% of the background distribution are reported within the ‘co-evolution’ channel and they may be further explored using the Jalview plugin (45), that visualizes the protein sequence alignment of the orthologous group used for the calculation. The common species where the residues are conserved are also reported in the co-evolution pop-up window.
Predicting functional links between PTMs in interacting proteins
Probably the most ambitious aim for this second PTMcode release is to extract associations between PTMs placed on interacting proteins. These functional links would be candidates to regulate the interaction although several other indirect associations are possible. Several computational approaches have shown enrichments in PTMs clusters in proteins complexes (46) and a higher number of interaction partners in modified proteins compared to non-modified (47). PPIs are indeed subject of this type of regulation mostly due to the activity of PTMs located in protein interfaces (1) although allosteric regulation has also been reported (48). PTMcode v2 provides two channels for the extraction of associated PTMs in the regulation of two interacting proteins, a ‘co-evolution’ and a ‘structural distance’ channel.
We adapted our co-evolution algorithm to measure the interplay between two modified residues from different proteins (see Methods Summary) and applied it to PTMs in physically interacting proteins taken from the STRING 9.1 database (34) with a score over 700. In total, PTMcode v2 provides over 3.6 million predicted PTM associated pairs (∼9.2 million considering propagated PTMs) through the ‘co-evolution’ channel covering ∼11 000 proteins (∼31 000 including propagated PTMs) in ∼44 000 PPIs (∼102 000 adding the propagated PTMs).
In order to assess the accuracy of our co-evolution algorithm catching pairs of associated PTMs within interacting proteins, we compared the results in the PPI data set with the co-evolution scores calculated from pairs of PTMs placed on experimentally proved non-interacting proteins extracted from the Negatome database (49). Although we used the most stringent data set available, Negatome cannot guarantee that the protein pairs do not have another functional association rather than their binding. Still, we found that PPIs show a higher rate of PTMs pairs selected as co-evolving (with our usual threshold of 95) than the non-interacting proteins, compared using a fisher test (P-value < 0.001).
The nature of this ‘co-evolution’ channel allows us to catch not only PTMs that show a direct regulation but also those that would have more broadly defined interplays, such as PTMs, contributing to the same functionality not necessarily at the same stage. Thus, as in the case of co-evolving pairs of PTMs within the same protein users should interpret them from a broad perspective.
The ‘structural distance’ channel aims to collect the PTM pairs that may be in physical contact (if both are present at the same time) by means of their proximity. This does not exclude that distant PTMs in protein interfaces may be co-regulating the interaction. Here, we measure the distance of every two modified residues in protein interfaces that are available over Protein Data Bank (PDB) complex structures (50). We classify as ‘close enough’ the PTM pairs with residues below a separation of 4.69 Å, a threshold extracted from manually curated cross-talking modifications described in our first release (44). A total of 65 000 possible cross-talking PTM pairs (∼79 000 if we add propagated PTMs) are reported. The details of these predicted physical associations can be visualized and further analyzed by the Jmol plugin integrated in PTMcode.
Visualizing PTMs associated with PPIs
PTMcode is a protein-oriented database in which users can search for their favorite protein and get information about its PTMs including their conservation assessment, an overview of the whole protein functional context and the functional associations between them (known and predicted). From this second release it also includes the analysis of the regulation of the PPIs in which the protein is involved. The entry point to the database is either a browser facility to obtain all associations of two particular PTM types or a powerful search engine that now includes the option to search for interacting proteins and jump directly to our ‘PPI view’ mode. Still from a single protein search, users are guided to the ‘single protein view’ mode that now also includes, if PPIs are available for that particular protein, basic information about how many interaction partners it has and the predictions for the PTM associations that may regulate them.
As the main novelty in visualization we have implemented a flash-based network view integrated into our interactive graphic interface where users can explore the network neighborhood centered on their favorite protein and jump to the ‘PPI view’ (Figure 4). There, the two proteins are shown facing each other and every pair of known or predicted associated PTMs may be further analyzed clicking on the channels pop-ups where extra information about every source of evidence is displayed.
CONCLUSIONS
The PTMcode database is imbued by systems biology philosophy in the sense of being a resource that aims to provide a global picture of the post-translational regulatory landscape of eukaryotic proteins. We offer a unique environment where PTMs are shown as active players into the whole functional context of the protein, from its domain architecture to its interaction network neighborhood. Within this framework, in addition to the extra value given to the data taken from other resources, the main contribution of PTMcode to the field is to supply known and predicted functional associations between PTMs. Those associations are extracted using in-house implemented methodologies that measure (i) the co-evolution between two protein residues and (ii) different types of PTMs proximity (close in space or part of highly modified protein regions).
In this second release we have tripled the size of our PTM collection and implemented a simple algorithm for the propagation of the modifications through orthologs making available for the first time a predicted post-translational regulatory scheme for thousands of proteins that had no information available.
Another major update is the analysis of associated PTM pairs in interacting proteins which may contain tips for the regulation of PPIs. This places PTMcode as a bridge of information between databases of PPIs (34,51) and those dedicated to collect and analyze PTMs (21,22).
As the chosen name for the database suggests our final aim is to contribute to the understanding of the so-called PTM code, a molecular barcode composed by function-specific combinations of PTMs. This challenge entails significant limitations. First, we are far from having a complete PTM repertoire even for a single organism and many PTM types are still vastly understudied so their functional impact might be being underestimated. On top of that, both PTMs and PPIs are indeed dependent on particular cell states while they are normally provided as a collage with no condition specificity information. These and others are the challenges for future studies, the new release for our PTMcode database that we report herein comes to add a step forward into this objective since it produces new knowledge out of the collected data feeding the scientific community with more information and more hypothesis.
We thank Yan Yuan for all his help and support on all technical and infrastructure issues we encountered during this project.
FUNDING
EMBL. Funding for open access charge: EMBL, Meyerhofstrasse 1, 69117 Heidelberg, Germany.
Conflict of interest statement. None declared.
Comments