- Split View
-
Views
-
Cite
Cite
Samart Wanchana, Supat Thongjuea, Victor Jun Ulat, Mylah Anacleto, Ramil Mauleon, Matthieu Conte, Mathieu Rouard, Manuel Ruiz, Nandini Krishnamurthy, Kimmen Sjolander, Theo van Hintum, Richard M. Bruskiewich, The Generation Challenge Programme comparative plant stress-responsive gene catalogue, Nucleic Acids Research, Volume 36, Issue suppl_1, 1 January 2008, Pages D943–D946, https://doi.org/10.1093/nar/gkm798
- Share Icon Share
Abstract
The Generation Challenge Programme (GCP; www.generationcp.org ) has developed an online resource documenting stress-responsive genes comparatively across plant species. This public resource is a compendium of protein families, phylogenetic trees, multiple sequence alignments (MSA) and associated experimental evidence. The central objective of this resource is to elucidate orthologous and paralogous relationships between plant genes that may be involved in response to environmental stress, mainly abiotic stresses such as water deficit (‘drought’). The web-based graphical user interface (GUI) of the resource includes query and visualization tools that allow diverse searches and browsing of the underlying project database. The web interface can be accessed at http://dayhoff.generationcp.org .
INTRODUCTION
Comparative biology provides valuable insights into organismal function and evolution, highlighting the divergence and conservation of gene families and biological processes. In order to cross-reference genes from one species to other related species, accurate predictions of orthologous and paralogous relationships are necessary. Such cross-referencing potentially permits researchers to infer the molecular functions of genes lacking such annotation from experiments in other, better-characterized organisms. Paralogous genes arising from ancient duplication events are likely to have diverged in function, whereas orthologous genes with common ancestry separated only by speciation are more likely to retain identical or highly similar function over evolutionary time ( 1 , 2 ). Such orthologous and paralogous gene loci almost invariably share some common molecular characteristics; thus, important inferences of function may be possible once these relationships are clearly defined.
The Generation Challenge Programme (GCP; www.generationcp.org ) is a global crop research consortium striving to apply comparative genomics and molecular analysis to plant genetic resources to enhance efforts in plant breeding for plant stress tolerance. Clustering of orthologous genes across multiple crop species is a powerful strategy for the identification of stress-responsive gene loci and their corresponding alleles of high agronomic value, for application in breeding for stress tolerance.
To facilitate cross-species gene functional analysis, the GCP commissioned a project to assemble tools for the compilation and visualization of comparative information about stress-responsive genes. The result is an online resource, code-named Dayhoff, after Margaret Dayhoff, the famous early pioneer in comparative analysis of sequences.
Orthologues and paralogues of stress-responsive genes are presented by means of phylogenetic trees constructed using a phylogenomic inference method ( 3 , 4 ). The Dayhoff catalogue is expected to guide the bioinformatics analysis and interpretation of research results generated by comparative genomics experiments. For example, microarray data about drought stress obtained across diverse crop species will be analysed in a comparative manner to identify conserved gene expression profiles exhibited under similar stresses, in a similar fashion to experiments in other model species ( 5–7 ).
DATABASE CONSTRUCTION AND IMPLEMENTATION
Dayhoff is a MySQL database based mainly on the Chado schemata of the Generic Model Organism Database project ( 8 ) ( www.gmod.org ), with local enhancements where necessary, to store protein family information such as protein multiple sequence alignments (MSA), phylogenetic trees and supported stress evidence from experiments and the literature. The web interface uses GCP Java-based software technology ( http://pantheon.generationcp.org ) connected to third-party software such as ATV ( 9 ), Jalview ( 10 ) and BLAST ( 11 ) for analysing and viewing the query's results. The Dayhoff site is also cross-linked to a complementary GCP-funded comparative gene analysis resource called GreenPhyl. GreenPhyl provides comparative genomic analyses of Arabidopsis thaliana and Oryza sativa whole-genome assemblies and can be accessed directly at http://greenphyl.cirad.fr/cgi-bin/greenphyl.cgi .
DATA ANALYSIS AND CURATION
The core data set in Dayhoff consists of stress-related protein families characterized by a phylogenomic inference approach ( 4 , 12 ). The method has been shown to enable the highest accuracy in predicting protein molecular function ( 12 ), to avoid most false homology inference problems, and to distinguish between orthologous and paralogous genes ( 4 ). Phylogenetic trees representing protein families were constructed by the following steps. First, homologous sequences for each stress protein compiled from the literature were gathered by using the FlowerPower tool on the Berkeley Phylogenomics Group (BPG) web server ( 13 ), with Uniprot proteins ( 14 ) used as a database. FlowerPower uses iterative subfamily hidden Markov model (HMM) searches against PSI-BLAST-identified homologues and alignment analysis to discriminate between partial and global homologies ( 12 ). Then, MSAs of homologous proteins were constructed with the high-accuracy MSA program, MUSCLE v. 3.52 ( 15 ). After masking the alignments to remove columns with many gap characters, functional subfamilies were identified for each group using the SCI-PHY web server ( 12 ). SCI-PHY uses Bayesian and information-theoretic approaches to construct a hierarchical tree and cut tree into subtrees to identify functional subfamilies ( 12 ). The analysed trees were saved in the extended New Hampshire format (NHX) for display by the ATV program ( 9 ).
Stress-responsive genes to be analysed were compiled from available literature documenting genes analysed from diverse experimental sources (Supplementary Table 1). In the current version of Dayhoff, stress genes include those analysed from drought, salt, cold, ABA and GA stress experiments. Both up- and down-regulated genes under those stress types are available for O. sativa and A. thaliana . To overlay this experimental evidence on the gene family trees, BLASTP searches of candidate stress genes were performed against the database of Uniprot proteins used in phylogenetic tree construction. The BLAST results were limited into the ranks of parameter cutoff values as following: ≥80% to >95% similarity, E-value <1 e −20 to <1 e −50 and bit scores >50 to >1000.
USER INTERFACE
There are three main options for using the database: browsing protein families, query database by gene names or protein names and BLAST search against protein families ( Figure 1 ).
Browsing protein families
The database can be used by browsing the entire set of stress protein families that have been constructed ( Figure 1 A). Users can select for browsing the database from the main drop-down menu. A list of protein families as well as links for phylogenetic trees and MSA are shown on the front page. Details about each protein family, for example, the list of Uniprot IDs, protein names, Gene Ontology (GO) terms and key publications for each protein obtained from Uniprot database ( 14 ), can be accessed through the family ID links ( Figure 1 B). Additional information can be displayed by selecting from the drop-down list. MSAs and the phylogenetic trees can be viewed by Jalview and ATV, respectively ( Figure 1 C and D). There are two choices for presenting the MSAs, by a whole family or users can select some proteins of interest to be aligned by checking the check boxes ( Figure 1 B). Hyperlinks to the Uniprot database and other online resources are also provided. Users can find stress evidence mapped to the matched protein(s) in the family owing to the BLASTP search results ( Figure 1 E). BlastP cutoff values for% identity, E-value and score are provided for filtering the BLASTP results. Users may need to change the default parameters in order to receive optimum results.
Query database
In the current version of Dayhoff, users can search the database by keywords within two fields of data type: Family name and Protein name ( Figure 1 G). By searching Family name, the matched family will be retrieved. Users can view more information through the family ID link as well as MSA and tree links. By searching Protein name, matched protein(s) will be listed together with Family ID link and some other information.
BLAST protein families
Users can submit a protein or DNA sequence in Fasta or raw format in order to BLAST the Dayhoff database as well as the GreenPhyl database ( Figure 1 H). Dayhoff is interconnected to the GreenPhyl database via a GCP-compliant BioMOBY ( 16 ) client web service. Users will receive the results of best hits of protein family from both Dayhoff and GreenPhyl. The results will be provided with links to Dayhoff protein families and hyperlinks to classified families at the GreenPhyl web site.
FUTURE DIRECTIONS
Further integration of the comparative stress-responsive gene catalogue with the GCP platform software will enhance access to comparative gene data in a variety of bioinformatics analysis contexts. In particular, Dayhoff will be connected using GCP technology to a MAXD gene expression database, for direct integration into comparative microarray data analyses.
ACKNOWLEDGEMENTS
Funding to pay the Open Access publication charges for this article was provided by Generation Challenge Programme.
Conflict of interest statement . None declared.
Comments