iBet uBet web content aggregator. Adding the entire web to your favor.

Abstract

We present here LOCATE, a curated, web-accessible database that houses data describing the membrane organization and subcellular localization of proteins from the FANTOM3 Isoform Protein Sequence set. Membrane organization is predicted by the high-throughput, computational pipeline MemO. The subcellular locations of selected proteins from this set were determined by a high-throughput, immunofluorescence-based assay and by manually reviewing >1700 peer-reviewed publications. LOCATE represents the first effort to catalogue the experimentally verified subcellular location and membrane organization of mammalian proteins using a high-throughput approach and provides localization data for ∼40% of the mouse proteome. It is available at http://locate.imb.uq.edu.au.

INTRODUCTION

Determination of the membrane organization and the subcellular location of a protein are essential to understanding its biochemical function. A cell is divided into different cellular compartments and each compartment is associated with a different range of biochemical processes; by localizing a protein to a specific compartment, or set of compartments, the cellular role of the protein can be inferred. This information can provide insight into the functions of hypothetical or novel proteins and can provide a more specific organellar context in which to investigate a particular protein. Historically, these data have been difficult to produce on a large scale for higher eukaryotic organisms. However, recent advances in membrane organization prediction methods and high-throughput subcellular localization assays have made it possible to generate these datasets. We used high-throughput methods to predict the membrane organization for the entire mouse proteome and to determine the subcellular localization of a subset of the proteome. We then developed a database, LOCATE, to organize and warehouse these data.

DATABASE CONTENT

Dataset

The mouse proteome dataset we used was the FANTOM3 Isoform Protein Sequence set (IPS7) generated by the RIKEN FANTOM Consortium (1). This dataset is comprised of protein sequences based on transcript sequences generated from direct sequencing of full-length transcripts. The sequenced transcripts were clustered into transcriptional units (TUs) where a TU is a grouping of transcripts that arise from a single genomic locus and share at least one nucleotide having the same genomic location and orientation. The IPS7 dataset contains 33 451 protein sequences encoded by 19 853 TUs.

Membrane organization

Protein orientation with respect to the membrane was predicted by MemO, a high-throughput, automated pipeline, which combines publicly available feature predictors with empirically determined annotation rules (1,2) (M. J. Davis, F. Clark, J. L. Fink, Z. Yuan, F. Zhang, T. Kasukawa, Y. Hayashizaki, P. Carnici and R. D. Teasdale, manuscript in preparation). The pipeline is described briefly here.

Prediction of signal peptides was performed by a local implementation of SignalP v2.0 (3) and by the Australian National Genomic Information Service (ANGIS, http://biomanager.angis.org.au) version of SPScan. A protein was predicted to contain a signal peptide if the averaged and normalized raw output scores from both methods exceeded a threshold identified to maximize the proportions of true positives and true negatives on a training set.

α-Helical transmembrane domain prediction was performed by a consensus method consisting of five currently available predictors: HMMTOP (4), TMHMM v2.0 (5), SVMTM v3.0 (6), MEMSAT (7) and DAS (8). A protein was said to contain a transmembrane domain if at least 7, but no more than 42, consecutive residues in the protein (ignoring a gap of <4 residues) were predicted to participate in a transmembrane domain by at least three of the five predictors.

The prediction of the absence or presence of the signal peptide and transmembrane domain provided a classification into one of five categories of membrane organization: We applied this pipeline to the 33 451 protein sequences in the IPS7 dataset and identified 5116 (∼15%) proteins containing signal peptides, and 8238 (∼25%) proteins containing transmembrane domains. These proteins were then allocated to the five membrane organization categories based on combinations of those features. The class breakdown of proteins is shown in Table 1.

soluble intracellular proteins (no transmembrane domains or signal peptide);
soluble secreted proteins (signal peptide, no transmembrane domains);
type I membrane proteins (one transmembrane domain, signal peptide) (9);
type II membrane proteins (one transmembrane domain, no signal peptide) (9);
multi-pass membrane protein (multiple transmembrane domains) (9).

Subcellular localization

Proteins were selected for experimentation based on clone availability and the extent of previous characterization of their subcellular localization. When selecting multipass membrane proteins, only those without a predicted ER signal peptide were chosen. N-terminally tagged myc-gene of interest expression constructs were generated using a modified overlapping PCR methodology originally reported by Suzuki et al. (10). The expressed protein, within fixed transfected HeLa cells, was detected by indirect immunofluorescence and representative images were collected and analyzed to determine the protein's subcellular localization. To date, experimental subcellular localization data have been generated for 417 of these selected proteins and localization data based on primary literature review have been gathered for 1752 TUs.

Both the experimental and literature-mined localization data were manually examined and evaluated for sufficient quality prior to addition to the database. When evaluating literature-mined localization data, only papers describing the localization of full-length proteins in individual mammalian cells in which the protein is detected directly were included in our analysis. These peer-reviewed observations were not reinterpreted. However, some observations were excluded when considered not to be of a sufficient quality.

Because it was not always possible to determine to which protein isoform the literature data referred, we assigned the literature-mined location to all protein isoforms encoded by the corresponding TU. Table 1 summarizes the subcellular localization statistics by membrane organization class.

To provide as complete a location description as possible for any given protein, we also include localization data mined from other online databases including LIFEdb (11), Mouse Genome Informatics (12), UniProt (13), RefSeq (14) and others. A total of 7410 TUs and 11 353 protein isoforms are annotated with these data. In total, we have localization data for 8017 TUs and 12 598 protein isoforms representing 41 and 37% of the IPS7 set, respectively.

Data presentation

General information

Information in LOCATE is displayed as a web page which describes a particular protein entry in detail. The page is divided into sections which summarize several types of data. The top of the page contains a summary of the MemO classification and the subcellular localization of the protein as well as associated metadata provided by FANTOM3 annotations such as the protein identifier, a functional description, protein name synonyms, the source organism and links to other databases which also contain this protein.

Transmembrane topology and predicted domains

Knowing what functional domains and motifs exist in a protein is extremely useful when attempting to decipher the cellular role of the protein. We have generated predictions of Pfam and SCOP domains for all proteins in the database and have displayed the predicted domains on a graphical protein schematic diagram alongside the membrane organization data (Figure 1). The presence and position of certain domains in relation to predicted transmembrane domains can provide insights into the validity of the functional annotation of the protein (if one exists) as well as the validity or range of the transmembrane domain prediction.

Subcellular location data

If a protein entry has high-throughput subcellular localization data, we display the subcellular location(s) in which that particular protein isoform was observed and a high-resolution fluorescent-image which best illustrates the observed localization. Information about the experimental conditions such as the cell type and epitope used in the localization assays is also displayed. If a protein entry has subcellular localization data mined from literature, we display the determined subcellular location(s), the PubMed ID, and a full citation of the data source.

Controlled vocabulary

Consistent naming of subcellular locations is critical to the integrity and extensibility of the LOCATE data. Therefore, we have constructed a controlled vocabulary which describes both experimentally determined and literature-mined subcellular locations. In the case of high-throughput experimental subcellular localization assays, it is not always possible to determine the exact cellular compartment to which the protein is observed to localize. To address this problem, our controlled vocabulary contains a hierarchical set of terms that allows the call to be only as specific as the data allow. This system also reflects the confidence of the localization call; use of a very specific term implies higher confidence. Some proteins have been observed to localize to more than one subcellular compartment; in these cases, we allow the use of multiple terms to describe the observed locations. When mining subcellular localization data from the literature, we use terms that allow for different levels of location resolution and for cellular components that are specific to cells with a lineage or morphology that differs from the model cells used in our experiments. In both vocabularies, we use Gene Ontology (15) terms to describe subcellular locations whenever possible (see the LOCATE website for more details).

Observed spliced isoforms

For each protein in the database, we display a list of all proteins that belong to the same TU to allow comparisons between each of the observed protein isoforms. Specifically, we display the membrane organization and length of each isoform on a splicing graph which illustrates the observed exons and the various alternate splice forms for that particular TU (Figure 2). These graphs enable analysis of the pattern of membrane organization variation within the observed protein isoforms and examination of the possible effects of alternative splicing on membrane organization. The graphs were generated by a customized version of the Splicing Graph Module (16).

Data accessibility

This database does not seek to duplicate information contained in other databases unless it is particularly useful when viewed in juxtaposition with the subcellular localization or membrane organization data. However, we understand the value of convenient data accessibility and provide links to offsite resources such as SymAtlas (17), GenBank (18), RIKEN (1), MGI (19), READ (20), Pfam (21), SCOP (22), UniProt (13), OMIM (23), Entrez Gene (24), BIND (25), the GeneNetwork (26) and the Mouse Retrovirus Tagged Cancer Gene Database (RTCGD) (20) where applicable.

Because the major aim of this database effort is to present protein subcellular location data and the predicted membrane organization of the protein, these two features are the primary search mechanisms; proteins can be retrieved by protein class, subcellular localization or both. Alternatively, individual protein entries can be retrieved by searching the database with a protein ID (RIKEN clone/IPS ID, GenBank accession number, Entrez Gene ID), by protein name, by Pfam or SCOP accession number, or by functional description. BLAST searches against the database, and subsets of the database, are also available. The BLAST results are enhanced to display the membrane organization of the hits. We also offer a number of batch data retrieval options. The proteins in any given search can be retrieved as FASTA-formatted protein or transcript sequences, subcellular localization data, membrane organization data or protein schematics. XML-marked-up documents containing these data can also be obtained.

CONCLUSIONS

LOCATE represents a significant contribution to the biological research community by organizing and presenting membrane organization and subcellular localization data for the mouse proteome. The LOCATE search interface allows users to retrieve data and sets of data using several different approaches. The interface to individual proteins was designed to maximize ease of interpretation by providing summaries or visualizations that contain the most relevant points of data; links are provided to the raw data or other details that are necessary for a careful evaluation of the experimental results. LOCATE data can be retrieved as individual entries or downloaded as HTML, plain text or XML files.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

Table 1

Open in new tab

Distribution of membrane organization classes and high-quality localization data in LOCATE

Membrane organization class	MemO data	Subcellular localization data
	IPS proteins in class (TUs/isoforms)	Isoforms with experimental data	TUs with literature-mined data	Total represented (TUs/isoforms)
Soluble, intracellular protein	13 105/22 265	0	302	302/353
Soluble, secreted protein	2190/2948	0	340	340/469
Type I membrane protein	1038/1548	0	377	377/653
Type II membrane protein	2149/2869	207	408	549/766
Multi-pass membrane protein	2538/3821	210	325	460/652
Total proteins analyzed	19 538/33 451	417	1752	2028/2893

Membrane organization class	MemO data	Subcellular localization data
	IPS proteins in class (TUs/isoforms)	Isoforms with experimental data	TUs with literature-mined data	Total represented (TUs/isoforms)
Soluble, intracellular protein	13 105/22 265	0	302	302/353
Soluble, secreted protein	2190/2948	0	340	340/469
Type I membrane protein	1038/1548	0	377	377/653
Type II membrane protein	2149/2869	207	408	549/766
Multi-pass membrane protein	2538/3821	210	325	460/652
Total proteins analyzed	19 538/33 451	417	1752	2028/2893

The MemO Data columns show the absolute numbers of proteins classified by MemO into each membrane organization class. The ‘Subcellular localization data’ columns show the number of protein isoforms that have an experimentally determined subcellular location and the number of transcriptional units (TUs) that have a literature-mined subcellular location as well as the total numbers of TUs and isoforms that have subcellular localization data. Localization data mined from other databases is not included here.

Table 1

Open in new tab

Distribution of membrane organization classes and high-quality localization data in LOCATE

Membrane organization class	MemO data	Subcellular localization data
	IPS proteins in class (TUs/isoforms)	Isoforms with experimental data	TUs with literature-mined data	Total represented (TUs/isoforms)
Soluble, intracellular protein	13 105/22 265	0	302	302/353
Soluble, secreted protein	2190/2948	0	340	340/469
Type I membrane protein	1038/1548	0	377	377/653
Type II membrane protein	2149/2869	207	408	549/766
Multi-pass membrane protein	2538/3821	210	325	460/652
Total proteins analyzed	19 538/33 451	417	1752	2028/2893

Membrane organization class	MemO data	Subcellular localization data
	IPS proteins in class (TUs/isoforms)	Isoforms with experimental data	TUs with literature-mined data	Total represented (TUs/isoforms)
Soluble, intracellular protein	13 105/22 265	0	302	302/353
Soluble, secreted protein	2190/2948	0	340	340/469
Type I membrane protein	1038/1548	0	377	377/653
Type II membrane protein	2149/2869	207	408	549/766
Multi-pass membrane protein	2538/3821	210	325	460/652
Total proteins analyzed	19 538/33 451	417	1752	2028/2893

Figure 1

Visualization of MemO- and Pfam- and SCOP-predicted motif data. (a) Plots the number of computational methods (from 0 to 5) that predict whether a residue in the protein sequence participates in a helical transmembrane domain. Five independent methods are used in the TMD prediction; we assign a residue to a TMD if at least three of the five methods have a positive prediction at that position in the sequence and the range of the predicted TMD fulfils a set of rules defined in the MemO pipeline (M. J. Davis, F. Clark, J. L. Fink, Z. Yuan, F. Zhang, T. Kasukawa, Y. Hayashizaki, P. Carnici and R. D. Teasdale, manuscript in preparation). (b) A schematic diagram of a protein sequence with predicted domains mapped onto it. In this particular diagram, the transmembrane domains predicted by MemO are shown at the top of the figure and the domains predicted by Pfam or SCOP are shown in the bottom of the figure. The schematics are vertically aligned to show the positional relationships of the predicted TMDs and other domains.

Open in new tab Download slide

Figure 2

Splicing graph. This graph shows the observed exons and splice junctions for the transcriptional unit 101566 and the splice isoforms of the transcripts that arise from this transcriptional unit. The light gray color represents soluble, cytoplasmic proteins (PA101566.2 and PA101566.4); light orange represents a Type II membrane protein (PA101566.1); black represents all observed exons. The green and red bars represent the observed start and stop codons, respectively. The teal rectangle represents the position and range of the MemO-predicted transmembrane domain; note that the transmembrane domain occurs in the exon that only appears in the Type II membrane protein and not in the soluble, cytoplasmic proteins. This is a clear example of how alternate splicing of these transcripts may change the proteins' membrane organization.

Open in new tab Download slide

The authors would like to acknowledge Nicholas Hamilton for implementing DomainDraw, the domain drawing program; Robert Luetterforst for assistance with the literature mining; and Emma Redhead for designing the LOCATE XML schema and XML document generator. The work was supported by funds from the Australian Research Council (ARC) and by the Research Grant for the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government to Y.H., and the Research Grant for the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government. R.D.T. is supported by a National Health and Medical Research Council of Australia R. Douglas Wright Career Development Award. R.N.A. is supported by a Postgraduate Research Scholarship from the IMB, University of Queensland. M.J.D. is supported by the National Institute for Diabetes, Digestion and Kidney Disease, National Institutes of Health (DK63400) as part of the Stem Cell Genome Anatomy Project (http://www.scgap.org/). Funding to pay the Open Access publication charges for this article was provided by University of Queensland and Australian Research Council.

Conflict of interest statement. None declared.

REFERENCES

Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T., Lenhard, B., Wells, C., et al.

2005

The transcriptional landscape of the mammalian genome

Science

309

1559

–1563

Kanapin, A., Batalov, S., Davis, M.J., Gough, J., Grimmond, S.M., Kawaji, H., Magrane, M., Matsuda, H., Schonbach, C., Teasdale, R.D., et al.

2003

Mouse proteome analysis

Genome Res

1335

–1344

Nielsen, H. and Krogh, A. In Glasgow, J. (Ed.).

Sixth International Conference on Intelligent Systems for Molecular Biology

1998

AAAI Press Vol.

pp.

122

–130

Tusnady, G.E. and Simon, I.

2001

The HMMTOP transmembrane topology prediction server

Bioinformatics

849

–850

Krogh, A., Larsson, B., vonHeijne, G., Sonnhammer, E.L.L.

2001

Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes

J. Mol. Biol

305

567

–580

Yuan, Z., Mattick, J.S., Teasdale, R.D.

2004

SVMtm: support vector machines to predict transmembrane segments

J. Computat. Chem

632

–636

Jones, D.T., Taylor, W.R., Thornton, J.M.

1994

A model recognition approach to the prediction of all-helical membrane protein structure and topology

Biochemistry

3038

–3049

Cserzo, M., Wallin, E., Simon, I., vonHeijne, G., Elofsson, A.

1997

Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method

Protein Eng

673

–676

Goder, V. and Spiess, M.

2001

Topogenesis of membrane proteins: determinants and dynamics

FEBS Lett

504

–93

Suzuki, H., Fukunishi, Y., Kagawa, I., Saito, R., Oda, H., Endo, T., Kondo, S., Bono, H., Okazaki, Y., Hayashizaki, Y.

2001

Protein–protein interaction panel using mouse full-length cDNAs

Genome Res

1758

–1765

Bannasch, D., Mehrle, A., Glatting, K.H., Pepperkok, R., Poustka, A., Wiemann, S.

2004

LIFEdb: a database for functional genomics experiments integrating information from external sources, and serving as a sample tracking system

Nucleic Acids Res

D505

–D508

Eppig, J.T., Bult, C.J., Kadin, J.A., Richardson, J.E., Blake, J.A., Anagnostopoulos, A., Baldarelli, R.M., Baya, M., Beal, J.S., Bello, S.M., et al.

2005

The Mouse Genome Database (MGD): from genes to mice—a community resource for mouse biology

Nucleic Acids Res

D471

–D475

Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., et al.

2005

The Universal Protein Resource (UniProt)

Nucleic Acids Res

D154

–D159

Pruitt, K.D., Tatusova, T., Maglott, D.R.

2005

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

Nucleic Acids Res

D501

–D504

Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.

2000

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium

Nature Genet

–29

Lee, B.T., Tan, T.W., Ranganathan, S.

2004

DEDB: a database of Drosophila melanogaster exons in splicing graph form

BMC Bioinformatics

189

Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A., et al.

2002

Large-scale analysis of the human and mouse transcriptomes

Proc. Natl Acad. Sci. USA

4465

–4470

Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.

2005

GenBank

Nucleic Acids Res

D34

–D38

Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A., Eppig, J.T.

2003

MGD: the Mouse Genome Database

Nucleic Acids Res

193

–195

Akagi, K., Suzuki, T., Stephens, R.M., Jenkins, N.A., Copeland, N.G.

2004

RTCGD: retroviral tagged cancer gene database

Nucleic Acids Res

D523

–D527

Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al.

2004

The Pfam protein families database

Nucleic Acids Res

D138

–D141

Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G.

2004

SCOP database in 2004: refinements integrate structure and sequence family data

Nucleic Acids Res

D226

–D229

Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., McKusick, V.A.

2005

Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders

Nucleic Acids Res

D514

–D517

Maglott, D., Ostell, J., Pruitt, K.D., Tatusova, T.

2005

Entrez Gene: gene-centered information at NCBI

Nucleic Acids Res

D54

–D58

Alfarano, C., Andrade, C.E., Anthony, K., Bahroos, N., Bajec, M., Bantoft, K., Betel, D., Bobechko, B., Boutilier, K., Burgess, E., et al.

2005

The Biomolecular Interaction Network Database and related tools 2005 update

Nucleic Acids Res

D418

–D424

Wu, C.C., Huang, H.C., Juan, H.F., Chen, S.T.

2004

GeneNetwork: an interactive tool for reconstruction of genetic networks using microarray data

Bioinformatics

3691

–3693

© The Author 2006. Published by Oxford University Press. All rights reserved  The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Download all slides

Month:	Total Views:
December 2016	1
January 2017	1
February 2017	4
March 2017	7
April 2017	5
May 2017	3
June 2017	4
July 2017	10
August 2017	8
September 2017	5
October 2017	6
November 2017	13
December 2017	26
January 2018	19
February 2018	14
March 2018	19
April 2018	18
May 2018	34
June 2018	13
July 2018	22
August 2018	25
September 2018	6
October 2018	15
November 2018	22
December 2018	22
January 2019	32
February 2019	16
March 2019	24
April 2019	34
May 2019	25
June 2019	25
July 2019	20
August 2019	26
September 2019	24
October 2019	20
November 2019	13
December 2019	18
January 2020	9
February 2020	22
March 2020	21
April 2020	15
May 2020	14
June 2020	24
July 2020	18
August 2020	16
September 2020	14
October 2020	27
November 2020	30
December 2020	18
January 2021	13
February 2021	16
March 2021	18
April 2021	6
May 2021	16
June 2021	20
July 2021	19
August 2021	4
September 2021	10
October 2021	14
November 2021	21
December 2021	16
January 2022	7
February 2022	17
March 2022	19
April 2022	23
May 2022	21
June 2022	26
July 2022	31
August 2022	29
September 2022	25
October 2022	29
November 2022	16
December 2022	7
January 2023	11
February 2023	5
March 2023	9
April 2023	6
May 2023	22
June 2023	9
July 2023	9
August 2023	24
September 2023	7
October 2023	9
November 2023	16
December 2023	28
January 2024	16
February 2024	21
March 2024	18
April 2024	26
May 2024	23
June 2024	19
July 2024	23
August 2024	10
September 2024	10
October 2024	34

Article Contents

LOCATE: a mouse protein subcellular localization database

Abstract

INTRODUCTION

DATABASE CONTENT

Dataset

Membrane organization

Subcellular localization

Data presentation

General information

Transmembrane topology and predicted domains

Subcellular location data

Controlled vocabulary

Observed spliced isoforms

Data accessibility

CONCLUSIONS

SUPPLEMENTARY DATA

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

LOCATE: a mouse protein subcellular localization database

Abstract

INTRODUCTION

DATABASE CONTENT

Dataset

Membrane organization

Subcellular localization

Data presentation

General information

Transmembrane topology and predicted domains

Subcellular location data

Controlled vocabulary

Observed spliced isoforms

Data accessibility

CONCLUSIONS

SUPPLEMENTARY DATA

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only