iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: http://www.ncbi.nlm.nih.gov/pubmed/23598997
Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul;41(12):e121.
doi: 10.1093/nar/gkt263. Epub 2013 Apr 17.

Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions

Affiliations

Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions

Jaina Mistry et al. Nucleic Acids Res. 2013 Jul.

Abstract

Detection of protein homology via sequence similarity has important applications in biology, from protein structure and function prediction to reconstruction of phylogenies. Although current methods for aligning protein sequences are powerful, challenges remain, including problems with homologous overextension of alignments and with regions under convergent evolution. Here, we test the ability of the profile hidden Markov model method HMMER3 to correctly assign homologous sequences to >13,000 manually curated families from the Pfam database. We identify problem families using protein regions that match two or more Pfam families not currently annotated as related in Pfam. We find that HMMER3 E-value estimates seem to be less accurate for families that feature periodic patterns of compositional bias, such as the ones typically observed in coiled-coils. These results support the continued use of manually curated inclusion thresholds in the Pfam database, especially on the subset of families that have been identified as problematic in experiments such as these. They also highlight the need for developing new methods that can correct for this particular type of compositional bias.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Observed number of overlapping domains (dark grey) and expected number of false positives (light grey) at three different E-value significance thresholds for the 13 356 Pfam families considered here.
Figure 2.
Figure 2.
(A) Cumulative proportion of overlapping domains in Pfam families. Families are ranked according to the number of their domains that overlap (in descending order) after applying a winner-takes-all greedy algorithm that assigns overlapping domains to families (see ‘Materials and Methods’ section). Data shown for three E-value significance thresholds. (B) Same as 2A red line, with additional plots for families that overlap with two or more, and three or more clans only (E-value = 0.01).
Figure 3.
Figure 3.
(A) Venn diagram with overlap between families predicted to be coiled-coil, disordered and transmembrane (see ‘Materials and Methods’ section) as observed in 13 356 total families. Coiled-coil: consecutive coiled-coil regions of 20 residues predicted in ≥50% of seed member regions. Disordered: consecutive intrinsic disordered regions of 20 residues predicted in ≥50% of seed member regions. Transmembrane helices: ≥2 transmembrane helices predicted in ≥50% of seed members regions. (B) Overrepresentation of predicted coiled-coil, transmembrane helices and instrinsic disorder when considering Pfam families with overlapping domains versus all Pfam families. Overlaps are calculated with respect to an E-value significance threshold of 0.01. Families are sorted by the number of clans they overlap with (descending) after a winner-takes-all greedy algorithm for assigning overlapping domains to families is applied. Note: two/three or more means two/three or more clans other than the one the family belongs to. Overrepresentation at each point x in the x axis is obtained by calculating the proportion of families with a given label (e.g. coiled-coil) among the first x families and dividing by the proportion of all families (n = 13 356) with that label. Note that for the sake of simplicity, we truncated the x-axis at 400 families. Labels assigned to families as described in 3A.
Figure 4.
Figure 4.
Comparison between the proportion of residues predicted to be in coiled-coil and in disordered regions (dark and light grey, respectively) in overlaps versus the proportion in UniProtKB (version 2011_06).
Figure 5.
Figure 5.
Proportion of residues in predicted transmembrane helices, coiled-coil regions and intrinsically disordered regions in different sets of sequences: UniProtKB (version 2011_06), all domains in the 13 356 Pfam families that we consider in this study, all overlapping regions, overlapping regions of families that overlap with two or more and three or more clans, overlapping regions of the top 20 families in Table 1. Both Pfam domains and overlapping regions are calculated based on an E-value threshold of 0.01.

Similar articles

Cited by

References

    1. Friedberg I. Automated protein function prediction—the genomic challenge. Brief. Bioinform. 2006;7:225–242. - PubMed
    1. Punta M, Ofran Y. The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Comput. Biol. 2008;4:e1000160. - PMC - PubMed
    1. Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A. Protein function annotation by homology-based inference. Genome Biol. 2009;10:207. - PMC - PubMed
    1. Eisen JA, Fraser CM. Phylogenomics: intersection of evolution and genomics. Science. 2003;300:1706–1707. - PubMed
    1. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA. 1988;85:2444–2448. - PMC - PubMed

Publication types