PaperBLAST: Text Mining Papers for Information about Homologs

doi:10.1128/mSystems.00039-17

. 2017 Aug 15;2(4):e00039-17.

doi: 10.1128/mSystems.00039-17. eCollection 2017 Jul-Aug.

PaperBLAST: Text Mining Papers for Information about Homologs

Morgan N Price¹, Adam P Arkin¹

Affiliations

PMID: 28845458
PMCID: PMC5557654
DOI: 10.1128/mSystems.00039-17

PaperBLAST: Text Mining Papers for Information about Homologs

Morgan N Price et al. mSystems. 2017.

. 2017 Aug 15;2(4):e00039-17.

doi: 10.1128/mSystems.00039-17. eCollection 2017 Jul-Aug.

Authors

Morgan N Price¹, Adam P Arkin¹

Affiliation

¹ Environmental Genomics & System Biology, Lawrence Berkeley National Lab, Berkeley, California, USA.

PMID: 28845458
PMCID: PMC5557654
DOI: 10.1128/mSystems.00039-17

Abstract

Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST's database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/. IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins' functions.

Keywords: annotation; text mining.

PubMed Disclaimer

Figures

**FIG 1**
Example of PaperBLAST results. For each protein that is linked to the literature and is similar to the query protein, PaperBLAST shows a list of articles. For each article, PaperBLAST shows up to two snippets that mention the protein. a.a., amino acids.

**FIG 2**
Coverage of PaperBLAST. (A) How often hypothetical proteins or other vaguely annotated proteins from different types of organisms have homologs in the PaperBLAST database with a BLAST score ratio above the given threshold. (B) How often vaguely annotated bacterial proteins have homologs in PaperBLAST, in the characterized subset of Swiss-Prot, or in any of the three curated databases that are included in PaperBLAST (the characterized subset of Swiss-Prot, GeneRIF, or EcoCyc). In both panels, only homologs with high-coverage alignments (at least 80%) were included.

See this image and copyright information in PMC

Cited by

In vivo manipulation of human gut Bacteroides fitness by abiotic oligosaccharides.
Wesener DA, Beller ZW, Hill MF, Yuan H, Belanger DB, Frankfater C, Terrapon N, Henrissat B, Rodionov DA, Leyn SA, Osterman A, van Hylckama Vlieg JET, Gordon JI. Wesener DA, et al. Nat Chem Biol. 2024 Oct 23. doi: 10.1038/s41589-024-01763-6. Online ahead of print. Nat Chem Biol. 2024. PMID: 39443715
High-throughput protein characterization by complementation using DNA barcoded fragment libraries.
Biggs BW, Price MN, Lai D, Escobedo J, Fortanel Y, Huang YY, Kim K, Trotter VV, Kuehl JV, Lui LM, Chakraborty R, Deutschbauer AM, Arkin AP. Biggs BW, et al. Mol Syst Biol. 2024 Nov;20(11):1207-1229. doi: 10.1038/s44320-024-00068-z. Epub 2024 Oct 7. Mol Syst Biol. 2024. PMID: 39375541 Free PMC article.
Interactive tools for functional annotation of bacterial genomes.
Price MN, Arkin AP. Price MN, et al. Database (Oxford). 2024 Sep 6;2024:baae089. doi: 10.1093/database/baae089. Database (Oxford). 2024. PMID: 39241109 Free PMC article.
Comparative transcriptomics reveals a highly polymorphic Xanthomonas HrpG virulence regulon.
Monnens TQ, Roux B, Cunnac S, Charbit E, Carrère S, Lauber E, Jardinaud MF, Darrasse A, Arlat M, Szurek B, Pruvost O, Jacques MA, Gagnevin L, Koebnik R, Noël LD, Boulanger A. Monnens TQ, et al. BMC Genomics. 2024 Aug 9;25(1):777. doi: 10.1186/s12864-024-10684-6. BMC Genomics. 2024. PMID: 39123115 Free PMC article.
Barcoded overexpression screens in gut Bacteroidales identify genes with roles in carbon utilization and stress resistance.
Huang YY, Price MN, Hung A, Gal-Oz O, Tripathi S, Smith CW, Ho D, Carion H, Deutschbauer AM, Arkin AP. Huang YY, et al. Nat Commun. 2024 Aug 5;15(1):6618. doi: 10.1038/s41467-024-50124-3. Nat Commun. 2024. PMID: 39103350 Free PMC article.

See all "Cited by" articles

References

1. Chang YC, Hu Z, Rachlin J, Anton BP, Kasif S, Roberts RJ, Steffen M. 2016. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res 44:D330–D335. doi:10.1093/nar/gkv1324. - DOI - PMC - PubMed
1. Clark WT, Radivojac P. 2011. Analysis of protein function and its prediction from amino acid sequence. Proteins 79:2086–2096. doi:10.1002/prot.23029. - DOI - PubMed
1. Tian W, Skolnick J. 2003. How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 333:863–882. doi:10.1016/j.jmb.2003.08.057. - DOI - PubMed
1. The UniProt Consortium 2017. UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169. doi:10.1093/nar/gkw1099. - DOI - PMC - PubMed
1. Poux S, Arighi CN, Magrane M, Bateman A, Wei C-H, Lu Z, Boutet E, Bye-A-Jee H, Famiglietti ML, Roechert B. 2016. On expert curation and sustainability: UniProtKB/Swiss-Prot as a case study. bioRxiv https://doi.org/10.1101/094011. - DOI - PMC - PubMed

Associated data

figshare/10.6084/m9.figshare.4836407

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

[1] Chang YC, Hu Z, Rachlin J, Anton BP, Kasif S, Roberts RJ, Steffen M. 2016. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res 44:D330–D335. doi:10.1093/nar/gkv1324. - DOI - PMC - PubMed

[2] Chang YC, Hu Z, Rachlin J, Anton BP, Kasif S, Roberts RJ, Steffen M. 2016. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res 44:D330–D335. doi:10.1093/nar/gkv1324. - DOI - PMC - PubMed

[3] Clark WT, Radivojac P. 2011. Analysis of protein function and its prediction from amino acid sequence. Proteins 79:2086–2096. doi:10.1002/prot.23029. - DOI - PubMed

[4] Clark WT, Radivojac P. 2011. Analysis of protein function and its prediction from amino acid sequence. Proteins 79:2086–2096. doi:10.1002/prot.23029. - DOI - PubMed

[5] Tian W, Skolnick J. 2003. How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 333:863–882. doi:10.1016/j.jmb.2003.08.057. - DOI - PubMed

[6] Tian W, Skolnick J. 2003. How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 333:863–882. doi:10.1016/j.jmb.2003.08.057. - DOI - PubMed

[7] The UniProt Consortium 2017. UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169. doi:10.1093/nar/gkw1099. - DOI - PMC - PubMed

[8] The UniProt Consortium 2017. UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169. doi:10.1093/nar/gkw1099. - DOI - PMC - PubMed

[9] Poux S, Arighi CN, Magrane M, Bateman A, Wei C-H, Lu Z, Boutet E, Bye-A-Jee H, Famiglietti ML, Roechert B. 2016. On expert curation and sustainability: UniProtKB/Swiss-Prot as a case study. bioRxiv https://doi.org/10.1101/094011. - DOI - PMC - PubMed

[10] Poux S, Arighi CN, Magrane M, Bateman A, Wei C-H, Lu Z, Boutet E, Bye-A-Jee H, Famiglietti ML, Roechert B. 2016. On expert curation and sustainability: UniProtKB/Swiss-Prot as a case study. bioRxiv https://doi.org/10.1101/094011. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

PaperBLAST: Text Mining Papers for Information about Homologs

Affiliation

PaperBLAST: Text Mining Papers for Information about Homologs

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Associated data

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials