The Status, Quality, and Expansion of the NIH Full-Length cDNA Project: The Mammalian Gene Collection (MGC)
Abstract
The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5′-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.
Footnotes
-
[Supplemental material is available online at www.genome.org. The sequence data for the full-length clones from this study have been submitted to GenBank under accession nos. BC000001-BC077073.]
-
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2596504.
-
↵4 National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
-
↵5 National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA.
-
↵6 National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
-
↵7 National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, Maryland 20892, USA.
-
↵8 National Institute of Heart Lung and Blood, National Institutes of Health, Bethesda, Maryland 20892, USA.
-
↵9 National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA.
-
↵10 SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, Maryland 21702, USA.
-
↵11 National Cancer Institute, Center for Bioinformatics, Rockville, Maryland 20852, USA.
-
↵17 Laboratory of Cell Biology, National Institute of Mental Health, National Institutes of Health, Bethesda, Maryland 20892, USA.
-
↵12 Center for Biomolecular Science & Engineering, University of California, Santa Cruz, Santa Cruz, California 95064, USA.
-
↵13 Laboratory for Computational Genomics, Washington University, St. Louis, Missouri 63130, USA.
-
↵14 The I.M.A.G.E. Consortium, Biology and Biotechnology Research Program, Lawrence Livermore National Laboratory, Livermore, California 94550, USA.
-
↵15 BD Biosciences Clontech, Palo Alto, California 94303, USA.
-
↵16 Department of Pediatrics, University of Iowa Health Care, Iowa City, Iowa 52242, USA.
-
↵18 Genome Science Laboratory, RIKEN Genomic Science Laboratory, Saitama 351-0198, Japan.
-
↵19 National Institute on Aging, NIH, Baltimore, Maryland 21224, USA.
-
↵32 National Institute of Genetics, Mishima 411-8540, Japan.
-
↵20 Department of Medical Genome Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo 108-8639, Japan.
-
↵21 Express Genomics, Frederick, Maryland 21701, USA.
-
↵22 Open Biosystems, Huntsville, Alabama 35806, USA.
-
↵23 Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63130, USA.
-
↵24 Genome Institute of Singapore, Singapore 138672.
-
↵25 Baylor College of Medicine Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA.
-
↵3 Present address: University of Iowa Hospitals and Clinics, Iowa City, IA 52242, USA.
-
↵26 The Institute for Systems Biology, Seattle, Washington 98103, USA.
-
↵27 NIH Intramural Sequencing Center, Gaithersburg, Maryland 20877, USA.
-
↵28 Stanford Human Genome Center, Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA.
-
↵29 University of British Columbia Genome Sciences Centre, BC Cancer Agency, Vancouver BC, V5Z 4S6 Canada.
-
↵30 Department of Genetics and Genome Sequencing Center, Washington University Medical School, St. Louis, Missouri 63130, USA.
-
↵31 Agencourt Bioscience Corporation, Beverly, Massachusetts 01915, USA.
-
↵1 A complete list of authors appears at the end of this manuscript.
-
↵2 Corresponding author: Daniela S. Gerhard. E-MAIL gerhardd{at}mail.nih.gov; FAX (301) 480-4368.
-
- Accepted April 26, 2004.
- Received March 19, 2004.
- Cold Spring Harbor Laboratory Press