Abstract

The Generation Challenge Programme (GCP; www.generationcp.org ) has developed an online resource documenting stress-responsive genes comparatively across plant species. This public resource is a compendium of protein families, phylogenetic trees, multiple sequence alignments (MSA) and associated experimental evidence. The central objective of this resource is to elucidate orthologous and paralogous relationships between plant genes that may be involved in response to environmental stress, mainly abiotic stresses such as water deficit (‘drought’). The web-based graphical user interface (GUI) of the resource includes query and visualization tools that allow diverse searches and browsing of the underlying project database. The web interface can be accessed at http://dayhoff.generationcp.org .

INTRODUCTION

Comparative biology provides valuable insights into organismal function and evolution, highlighting the divergence and conservation of gene families and biological processes. In order to cross-reference genes from one species to other related species, accurate predictions of orthologous and paralogous relationships are necessary. Such cross-referencing potentially permits researchers to infer the molecular functions of genes lacking such annotation from experiments in other, better-characterized organisms. Paralogous genes arising from ancient duplication events are likely to have diverged in function, whereas orthologous genes with common ancestry separated only by speciation are more likely to retain identical or highly similar function over evolutionary time ( 1 , 2 ). Such orthologous and paralogous gene loci almost invariably share some common molecular characteristics; thus, important inferences of function may be possible once these relationships are clearly defined.

The Generation Challenge Programme (GCP; www.generationcp.org ) is a global crop research consortium striving to apply comparative genomics and molecular analysis to plant genetic resources to enhance efforts in plant breeding for plant stress tolerance. Clustering of orthologous genes across multiple crop species is a powerful strategy for the identification of stress-responsive gene loci and their corresponding alleles of high agronomic value, for application in breeding for stress tolerance.

To facilitate cross-species gene functional analysis, the GCP commissioned a project to assemble tools for the compilation and visualization of comparative information about stress-responsive genes. The result is an online resource, code-named Dayhoff, after Margaret Dayhoff, the famous early pioneer in comparative analysis of sequences.

Orthologues and paralogues of stress-responsive genes are presented by means of phylogenetic trees constructed using a phylogenomic inference method ( 3 , 4 ). The Dayhoff catalogue is expected to guide the bioinformatics analysis and interpretation of research results generated by comparative genomics experiments. For example, microarray data about drought stress obtained across diverse crop species will be analysed in a comparative manner to identify conserved gene expression profiles exhibited under similar stresses, in a similar fashion to experiments in other model species ( 5–7 ).

DATABASE CONSTRUCTION AND IMPLEMENTATION

Dayhoff is a MySQL database based mainly on the Chado schemata of the Generic Model Organism Database project ( 8 ) ( www.gmod.org ), with local enhancements where necessary, to store protein family information such as protein multiple sequence alignments (MSA), phylogenetic trees and supported stress evidence from experiments and the literature. The web interface uses GCP Java-based software technology ( http://pantheon.generationcp.org ) connected to third-party software such as ATV ( 9 ), Jalview ( 10 ) and BLAST ( 11 ) for analysing and viewing the query's results. The Dayhoff site is also cross-linked to a complementary GCP-funded comparative gene analysis resource called GreenPhyl. GreenPhyl provides comparative genomic analyses of Arabidopsis thaliana and Oryza sativa whole-genome assemblies and can be accessed directly at http://greenphyl.cirad.fr/cgi-bin/greenphyl.cgi .

DATA ANALYSIS AND CURATION

The core data set in Dayhoff consists of stress-related protein families characterized by a phylogenomic inference approach ( 4 , 12 ). The method has been shown to enable the highest accuracy in predicting protein molecular function ( 12 ), to avoid most false homology inference problems, and to distinguish between orthologous and paralogous genes ( 4 ). Phylogenetic trees representing protein families were constructed by the following steps. First, homologous sequences for each stress protein compiled from the literature were gathered by using the FlowerPower tool on the Berkeley Phylogenomics Group (BPG) web server ( 13 ), with Uniprot proteins ( 14 ) used as a database. FlowerPower uses iterative subfamily hidden Markov model (HMM) searches against PSI-BLAST-identified homologues and alignment analysis to discriminate between partial and global homologies ( 12 ). Then, MSAs of homologous proteins were constructed with the high-accuracy MSA program, MUSCLE v. 3.52 ( 15 ). After masking the alignments to remove columns with many gap characters, functional subfamilies were identified for each group using the SCI-PHY web server ( 12 ). SCI-PHY uses Bayesian and information-theoretic approaches to construct a hierarchical tree and cut tree into subtrees to identify functional subfamilies ( 12 ). The analysed trees were saved in the extended New Hampshire format (NHX) for display by the ATV program ( 9 ).

Stress-responsive genes to be analysed were compiled from available literature documenting genes analysed from diverse experimental sources (Supplementary Table 1). In the current version of Dayhoff, stress genes include those analysed from drought, salt, cold, ABA and GA stress experiments. Both up- and down-regulated genes under those stress types are available for O. sativa and A. thaliana . To overlay this experimental evidence on the gene family trees, BLASTP searches of candidate stress genes were performed against the database of Uniprot proteins used in phylogenetic tree construction. The BLAST results were limited into the ranks of parameter cutoff values as following: ≥80% to >95% similarity, E-value <1 e −20 to <1 e −50 and bit scores >50 to >1000.

USER INTERFACE

There are three main options for using the database: browsing protein families, query database by gene names or protein names and BLAST search against protein families ( Figure 1 ).

 An example of browsing ( A ), querying ( G ) and BLAST searching the Dayhoff database ( H ). Phylogenetic tree and multiple sequence alignment of each protein family are displayed by ATV ( C ) and Jalview ( D ), respectively. Information on each protein family is shown in a new window ( B ) from the protein family ID links. Candidate stress proteins can be viewed in a new window when you toggle the Get Candidate Stress Protein button (B and E ). A location of stress genes in the rice genome is drawn in the chromosome graphic ( F ). Dayhoff can be queried by protein names or family names (G). A protein sequence or nucleotide sequence can be submitted to BLAST the Dayhoff and GreenPhyl databases. The results of BLAST search are provided in both Dayhoff protein families and GreenPhyl classified families (H).
Figure 1.

An example of browsing ( A ), querying ( G ) and BLAST searching the Dayhoff database ( H ). Phylogenetic tree and multiple sequence alignment of each protein family are displayed by ATV ( C ) and Jalview ( D ), respectively. Information on each protein family is shown in a new window ( B ) from the protein family ID links. Candidate stress proteins can be viewed in a new window when you toggle the Get Candidate Stress Protein button (B and E ). A location of stress genes in the rice genome is drawn in the chromosome graphic ( F ). Dayhoff can be queried by protein names or family names (G). A protein sequence or nucleotide sequence can be submitted to BLAST the Dayhoff and GreenPhyl databases. The results of BLAST search are provided in both Dayhoff protein families and GreenPhyl classified families (H).

Browsing protein families

The database can be used by browsing the entire set of stress protein families that have been constructed ( Figure 1 A). Users can select for browsing the database from the main drop-down menu. A list of protein families as well as links for phylogenetic trees and MSA are shown on the front page. Details about each protein family, for example, the list of Uniprot IDs, protein names, Gene Ontology (GO) terms and key publications for each protein obtained from Uniprot database ( 14 ), can be accessed through the family ID links ( Figure 1 B). Additional information can be displayed by selecting from the drop-down list. MSAs and the phylogenetic trees can be viewed by Jalview and ATV, respectively ( Figure 1 C and D). There are two choices for presenting the MSAs, by a whole family or users can select some proteins of interest to be aligned by checking the check boxes ( Figure 1 B). Hyperlinks to the Uniprot database and other online resources are also provided. Users can find stress evidence mapped to the matched protein(s) in the family owing to the BLASTP search results ( Figure 1 E). BlastP cutoff values for% identity, E-value and score are provided for filtering the BLASTP results. Users may need to change the default parameters in order to receive optimum results.

Query database

In the current version of Dayhoff, users can search the database by keywords within two fields of data type: Family name and Protein name ( Figure 1 G). By searching Family name, the matched family will be retrieved. Users can view more information through the family ID link as well as MSA and tree links. By searching Protein name, matched protein(s) will be listed together with Family ID link and some other information.

BLAST protein families

Users can submit a protein or DNA sequence in Fasta or raw format in order to BLAST the Dayhoff database as well as the GreenPhyl database ( Figure 1 H). Dayhoff is interconnected to the GreenPhyl database via a GCP-compliant BioMOBY ( 16 ) client web service. Users will receive the results of best hits of protein family from both Dayhoff and GreenPhyl. The results will be provided with links to Dayhoff protein families and hyperlinks to classified families at the GreenPhyl web site.

FUTURE DIRECTIONS

Further integration of the comparative stress-responsive gene catalogue with the GCP platform software will enhance access to comparative gene data in a variety of bioinformatics analysis contexts. In particular, Dayhoff will be connected using GCP technology to a MAXD gene expression database, for direct integration into comparative microarray data analyses.

ACKNOWLEDGEMENTS

Funding to pay the Open Access publication charges for this article was provided by Generation Challenge Programme.

Conflict of interest statement . None declared.

REFERENCES

1
Koonin
EV
Orthologs, paralogs, and evolutionary genomics
Annu. Rev. Genet.
2005
, vol. 
39
 (pg. 
309
-
338
)
2
Thornton
JW
DeSalle
R
Gene family evolution and homology: genomics meets phylogenetics
Annu. Rev. Genomics Hum. Genet.
2000
, vol. 
1
 (pg. 
41
-
73
)
3
Brown
D
Sjolander
K
Functional classification using phylogenomic inference
PLoS Computat. Biol.
2006
, vol. 
2
 pg. 
e77
 
4
Sjolander
K
Phylogenomic inference of protein molecular function: advances and challenges
Bioinformatics
2004
, vol. 
20
 (pg. 
170
-
179
)
5
Bergmann
S
Ihmels
J
Barkai
N
Similarities and differences in genome-wide expression data of six organisms
PLoS Biol.
2004
, vol. 
2
 pg. 
e9
 
6
McCarroll
SA
Murphy
CT
Zou
S
Pletcher
SD
Chin
C.-S
Jan
YN
Kenyon
C
Bargmann
CI
Li
H
Comparing genomic expression patterns across species identifies shared transcriptional profile in aging
Nat. Genet.
2004
, vol. 
36
 (pg. 
197
-
204
)
7
Zhou
X
Gibson
G
Cross-species comparison of genome-wide expression patterns
Genome Biol.
2004
, vol. 
5
 pg. 
232
 
8
Mungall
CJ
Emmert
DB
The FlyBase C: a Chado case study: an ontology-based modular schema for representing genome-associated biological information
Bioinformatics
2007
, vol. 
23
 (pg. 
i337
-
i346
)
9
Zmasek
CM
Eddy
SR
ATV: display and manipulation of annotated phylogenetic trees
Bioinformatics
2001
, vol. 
17
 (pg. 
383
-
384
)
10
Clamp
M
Cuff
J
Searle
SM
Barton
GJ
The Jalview Java alignment editor
Bioinformatics
2004
, vol. 
20
 (pg. 
426
-
427
)
11
McGinnis
S
Madden
TL
BLAST: at the core of a powerful and diverse set of sequence analysis tools
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
W20
-
W25
)
12
Glanville
JG
Kirshner
D
Krishnamurthy
N
Sjolander
K
Berkeley Phylogenomics group web servers: resources for structural phylogenomic analysis
Nucleic Acids Res.
2007
, vol. 
35
 (pg. 
W27
-
W32
)
13
Krishnamurthy
N
Brown
D
Sjolander
K
FlowerPower: clustering proteins into domain architecture classes for phylogenomic inference of protein function
BMC Evol. Biol.
2007
, vol. 
7
 pg. 
S12
 
14
The UniProt C
The universal protein resource (UniProt)
Nucleic Acids Res.
2007
, vol. 
35
 (pg. 
D193
-
D197
)
15
Edgar
RC
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
1792
-
1797
)
16
Wilkinson
M
Schoof
H
Ernst
R
Haase
D
BioMOBY successfully integrates distributed heterogeneous bioinformatics web services. The planet exemplar case
Plant Physiol.
2005
, vol. 
138
 (pg. 
5
-
17
)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.