iBet uBet web content aggregator. Adding the entire web to your favor.

Author Notes

Abstract

In 1997 the primary focus of the Genome Sequence DataBase (GSDB; www.ncgr.org/gsdb ) located at the National Center for Genome Resources was to improve data quality and accessibility. Efforts to increase the quality of data within the database included two major projects; one to identify and remove all vector contamination from sequences in the database and one to create premier sequence sets (including both alignments and discontiguous sequences). Data accessibility was improved during the course of the last year in several ways. First, a graphical database sequence viewer was made available to researchers. Second, an update process was implemented for the web-based query tool, Maestro. Third, a web-based tool, Excerpt, was developed to retrieve selected regions of any sequence in the database. And lastly, a GSDB flatfile that contains annotation unique to GSDB (e.g., sequence analysis and alignment data) was developed. Additionally, the GSDB web site provides a tool for the detection of matrix attachment regions (MARs), which can be used to identify regions of high coding potential. The ultimate goal of this work is to make GSDB a more useful resource for genomic comparison studies and gene level studies by improving data quality and by providing data access capabilities that are consistent with the needs of both types of studies.

Introduction

A new era in biological research involving the comparison and functional analysis of complete genomes has begun. This revolution is the direct result of several technological advances. First, the advent of efficient, inexpensive, high throughput DNA sequencing strategies ( 1–5 ) was necessary in order to produce large volumes of accurate sequences. The rate at which progress has been made in the Microbial Genome Initiative is evidence of the effect that high throughput DNA sequencing has had on biological research. In the last year the number of complete microbial genomes, that are available in the public nucleotide sequence databases, has more than doubled and another 100 are expected in the next few years. Singly, each of these sequences and its associated biological annotation contribute to the advancement of the understanding of gene function and microbial biochemistry/physiology ( 1 ). Collectively the availability of these complete genomes provides researchers with raw data to perform comparative genomic studies. For example, last year a minimal gene complement for cellular life was determined based upon the comparative analysis of the Mycoplasma genitalium genome with that of Haemophilus influenzae ( 6 ). In addition, a comparative analysis (at the protein level) of eubacterial, archaeal, and unicellular eukaryotic genomes was conducted to investigate the origins of archaea and novel protein functions ( 7 ). These examples mark the beginning of the era of comparative and functional genomics.

The availability of complete genome sequences has not only had a significant impact on the study of microbes and the Human Genome Project, in terms of sequencing and mapping efforts, but also upon other areas of biology such as agriculture ( 8 ) and bioremediation ( 9 ).

The second technological advancement that was necessary for the realization of comparative and functional genomics was the development and improvement of sequence analysis algorithms ( 10–12 ). These advancements are evident by the increased numbers of biological feature prediction software and homology identification software that are available at web-sites like Pedro's BioMolecular Research Tools ( www.public.iastate.edu/~pedro/research_tools.html ) and BCM Search Launcher ( 13 , http://kiwi.imgen.bcm.tmc.edu:8088/search-launcher/launcher.html ). Improvements to tools like Grail ( 12 ) and GeneFinder ( http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html ) to predict introns, exons and protein coding regions in a variety of organisms and homology searching tools like Beauty ( 10 ) and BLAST ( 11 ) have also been important to the revolution. The GSDB web site also contains a tool to detect matrix attachment regions, which are regions that often mark concentrated areas of transcription factor binding ( 14 , www.ncgr.org/MarFinder ).

Lastly, the development of effective computer and software systems to manipulate and manage large volumes of data was necessary for the realization of comparative and functional genomics ( 15 , 16 ). The ability to query the public nucleotide sequence databases in an efficient manner, when they contain more than one billion bases ( 15–17 ) and associated biological annotation is the direct result of improvements in computer, database and network systems.

The ability to produce, analyze, and store large volumes of nucleotide sequence and associated data are the technological cornerstones of comparative and functional genomics.

The Genome Sequence DataBase (GSDB) located at the National Center for Genome Resources ( www.ncgr.org ) is poised to support many types of data that are essential to comparative and functional genomics studies. In August 1996 GSDB completed a database conversion ( 15 , 18 ) that included upgrading of the Sybase system and improvement of the database schema. As a result of this conversion, GSDB can now store a complete genome (of any number of base pairs) as a single sequence, sequencing confidence data, sequence analysis results/scores, sequence alignments, discontiguous sequences (discontigs), data ownership and researcher defined features. While all of these changes are useful to the researcher community as a whole, these changes are crucial to the support of genomic level sequence comparisons and functional analyses. In most public nucleotide sequence databases it is difficult to adequately evaluate the significance of a biological feature. For example, comparing a protein coding region that has been hypothesized in one sequence based upon significant homology to another sequence which contains a protein coding region is not possible given the data stored in most nucleotide databases. Therefore comparison of the two sequences can only be accomplished by repeating the homology search on the same data set using identical search parameters. Storage of search parameters and scores with hypothetic features would greaty diminish the need to recreate homology searches. GSDB is the only public nucleotide sequence database designed to support the representation of such data.

Data Quality Improvements

One of the primary driving forces behind both GSDB data related and programming projects are the needs of the research community. In order for a public database to be useful to researchers, the quality of data within it must be relatively high. In a relational database like GSDB improvements in data quality can occur from either the removal of erroneous data, or from the addition and creation of more meaningful data annotation and data relationships. Discussed below are two ongoing projects; one of which involves the identification and removal of erroneous sequence data (vector contamination) and one of which involves organizing multiple related sequences into meaningful groupings through the use of the database's ability to represent sequence alignments and discontiguous sequences.

Vector contamination

In this age of high volume sequence production, many researchers are relying more heavily on homology with annotated sequences in the public nucleotide sequence databases to determine the gene content of their sequence(s). Consequently, it is critical that the sequences in the public nucleotide sequence databases not contain vector contamination. The incorporation of vector sequences into the public nucleotide sequence databases has been a problem for a long time. Various researchers have examined this problem since 1992 ( 19–25 ). Recently, the GSDB staff reassessed this data quality issue and analyzed the sequences in the database to determine the extent and progression of vector contamination.

The initial phase of the analysis involved the creation of a database that contained ∼200 unique multiple cloning sites that are commonly used in cloning vectors. A homology search was then performed between each sequence in GSDB and the multiple cloning site sequence database. This initial screen was used to identify potentially contaminated sequences. The initial screening was limited to only the multiple cloning sites because these sequences are rather rare in natural sequences and they are frequently adjacent to the cloned sequence of interest. Other regions of cloning vectors are frequently derived from naturally occurring sequences and they are infrequently adjacent to the cloned sequence of interest. By only searching against the multiple cloning sites the number of sequences that were incorrectly identified as containing vector contamination was minimized. This method of vector identification was tested on the data set that was available to Lamperti et al. in 1992 ( 25 ). The percentage of identified contamination that we obtained was slightly higher (0.3% versus 0.23%) but in concurrence with their published results ( 25 ).

Following initial identification, a potentially contaminated sequence was compared to a database of complete vector sequences in order to identify the entire span of vector contamination. In addition, these sequences were examined for the existence of a restriction enzyme site at the junction between the potential vector sequence and the cloned sequence of interest. The finding of a restriction site at the proposed vector-cloned sequence junction further supported the argument that all of the vector contamination in the sequence had been identified correctly.

In the nearly 1 000 000 sequences that were examined, approximately 0.36% were found to contain vector sequences. Most of these sequences only contained vector contamination at either the 5′ or 3′ end, while a few contained vector contamination at both ends. The vector contamination has been denoted with annotation in all of these sequences and they are in the process of being removed from the sequence and stored in a comment attached to the sequence. Over 75% of the sequences that were identified as containing vector contamination were also found to contain a restriction enzyme site at the junction ( 26 ).

A much smaller number of sequences appeared to contain vector contamination not at the ends of the sequence, but at internal positions of the sequence. This potential vector contamination has been annotated as such, but will not be removed from the sequence until the data submitter has been consulted.

In the data submitted during the 5 year period that was examined, the overall level of vector contamination in the database appeared to remain relatively constant at a value of <1%. However, the study also revealed that >50% of the contamination that was incorporated into the database in the last 2 years was contained in EST and STS sequences. Since these sequences, especially the ESTs are routinely used in homology searches, it is imperative that the vector contamination be identified and removed from the database, or else research may be influenced by erroneous homology matches.

The immediate goal of identifying sequences within the database that contain vector contamination has been successfully completed and the procedure/program to remove this contamination is in development. It is anticipated that the subsequent removal of these vector sequences will be completed by May 1998. In addition, the automation procedure to identify vector contamination prior to its incorporation into the database is being designed.

Unique sequence and annotation representation

One of the primary ways in which GSDB differs from the IC (International Collaboration—DDBJ, EMBL and GenBank) databases is in the types of data which can be stored in the database and in data representation. First, GSDB stores a contiguous string of nucleotides as a single sequence in the database regardless of size. Consequently a complete bacterial genome or fungal chromosome can be retrieved as a single file. In addition GSDB assigns ownership of data within the database to the contributing researcher. Finally, GSDB supports the incorporation of many other types of data that are not supported in the IC databases. Included in this set of unique data types are: sequence alignments, sequence analysis data, sequencing confidence values, and discontiguous sequences. During the last year the GSDB staff has been utilizing these unique capabilities to represent sequence alignments and discontiguous sequences ( Table 1 ) to augment sequence annotation and representation in the database.

Sequence alignments

In the past 2 years several complete microbial genomes, viral genomes, fungal chromosomes, and naturally occurring plasmid sequences have been incorporated into GSDB. During 1997 the GSDB staff made significant progress in aligning small pieces of genomic sequences to the complete genomic sequence of the same organism. Base pair differences, as well as the base pair spans over which two sequences align are retained in the database. Approximately 2100 HIV 1 sequences were compared to the complete genomic sequence of HIV 1 strain HXB 2 (GSDB:S:135829) the HIV research community reference standard sequence. Alignments were created between this sequence and any other HIV 1 sequence that showed >80% homology over 95% of the length of the smaller sequence as determined by BLAST 2.0 and FASTA ( 11 , 28 ).

A similar comparison was performed with the complete genome of Escherichia coli ( 27 , GSDB:S:1649882) and other smaller E.coli sequences in the database. BLAST 1.4.9 and BLAST 2.0 were used to identify sequences that had >95% identity over the total length of the smaller sequence. In total, >2000 of the E.coli sequences within the database were incorporated into the complete genome record as sequence alignments. In this case, the base pair differences that occur between the complete genome and other aligned sequences are more likely the result of sequence variations or sequencing errors.

The GSDB staff has not defined the meaning of these differences, as it is not our role. It is our role to present the data to the research community in an unbiased format, so that individual researcher can decide the value of the differences for him/herself.

The HIV 1 and E.coli complete genomes are only two examples of sequences for which the GSDB staff has performed intraspecies homology searches. A listing that includes all genomes that were analyzed prior to October 1997 is provided in Table 1 . A listing of sequences that have been analyzed since October 1997 is provided in the ‘What's New’ section of the GSDB Web site ( Fig. 1 , www.ncgr.org/gsdb ). The GSDB capability of storing sequence alignment information is flexible in the kinds of sequence alignments that can be represented. In addition to representing intraspecies homology alignments, this capability can also be used to represent the relationship between a completed clone sequence and the subclone sequences were assembled into the complete clone. It can also be used to represent the relationship between a transcribed molecule (mRNA, rRNA, tRNA, etc.) and the corresponding genomic sequence. Lastly, it can be used to represent interspecies homology comparisons, especially those that are utilized to predict that a specific gene is in a sequence because of its similarity to a sequence with that gene from another organism.

Table 1

Sequence alignments

Open in new tab Download slide

Discontiguous sequences

A discontiguous sequence is a set of sequence fragments that are part of a larger contiguous sequence. (This set of sequences is provided a single GSDB sequence accession number, so that the set is easy to reference and retrieve.) Consequently there is a known physical relationship between these fragments, but they do not abut to one another, nor do they overlap. The known physical relationship between these fragments may be as simple as the knowledge that the fragments are all from the same cosmid (or any sequence of known size) and therefore cannot be more than X Kb apart. Alternatively the relationship may be as specific as knowing both the order in which the fragments occur within the larger contiguous sequence and the physical distance between them.

Figure 1

The GSDB web site.

Open in new tab Download slide

By constructing discontiguous sequences for sets of sequences in the database where physical relationships between the sequences are known, the GSDB staff has been organizing the data into more meaningful sets. One of the goals of the staff is to continuously review these sets and to add new fragments to the sets when they become available in the database. This effort will continue until all the gaps between the fragments are filled in and a single contiguous sequence has been ‘built’.

In the past year the GSDB staff constructed two sets of discontiguous sequences, corresponding to the genomes of two agriculturally important crops, corn and rice. A single discontiguous sequence was constructed for each chromosome in each of these genomes.In addition, the discontiguous sequences that correspond to the human chromosomes 1–22 and chromosome X ( 15 ) were updated by the addition of a significant number of sequence fragments. Accession numbers and descriptions of all of these discontiguous sequences can be found in the ‘What's New’ section of the GSDB web site ( www.ncgr.org/gsdb ). As more discontiguous sequences become available, accession numbers will be posted in ‘What's New’.

The physical position of each sequence fragment is stored in the database in terms of kilobases from the left end of the discontiguous sequence and an uncertainty value. By convention the left end of the discontiguous sequence is always the terminal region of the short arm (p arm) of the chromosome. In the case of the discontiguous sequences corresponding to the human chromosomes, most of the newly added fragments and all of the fragments that were used initially to construct these discontiguous sequences are STSs. The staff has also been working to place other, larger genomic sequences into the discontiguous sequences. This is being accomplished by performing BLAST homology searches between human genomic sequences and the sequence fragments that comprise a discontiguous sequence. When a genomic sequence is found to overlap/contain a specific sequence fragment, the genomic sequence is incorporated into the appropriate position of the discontiguous sequence.

In the upcoming year, efforts to improve data quality will continue both through the identification and correction of inaccurate data and data relationships, and through the incorporation of appropriate sequence relationships and other meaningful annotation.

Data Accessibility Improvements

During 1997 the GSDB staff also focused on improving the ease with which researchers can access sequences and annotation. Four major advances in data access were accomplished in this year. These include the release of the Sun and Macintosh versions of the graphical database interface tool, Annotator, the implementation of a timely update process for the web-based query tool, Maestro, the development of a web-based tool to retrieve selected portions of sequences, Excerpt, and the development and distribution of GSDB flatfiles that include the annotations which are unique to GSDB.

Annotator

The graphical database interface tool, Annotator, was released for both Sun and Macintosh platforms during 1997. This tool is designed to support the submission of individual sequences to GSDB and to allow researchers to review, edit and update the information contained within a single sequence. Annotator displays sequences, features and alignments from the database in a graphic format that is biologically relevant. Annotator is free software that is available from the ‘Software’ section of the GSDB Web site ( www.ncgr.org/gsdb ).

The scope of Annotator is somewhat limited because it is platform dependent. However, the utility of Annotator for viewing and editing data has reinforced the need for a graphical sequence viewer that can interact with the database. GSDB staff has begun development of a web based, graphical sequence viewer that will be both platform independent and more flexible than Annotator.

As part of GSDB's effort to provide the community with a useful data set, GSDB obtains data nightly from the DDBJ, EMBL and GenBank databases. Data from these databases are incorporated into GSDB and is available for the addition of annotation via our community annotation mechanism.

Update process for Maestro

Maestro allows users to retrieve sequences from the database by querying on 18 different fields, including accession number, author, gene symbol and product name. The data can be represented in three different formats: GSDB flatfile, fasta or GIO (GSDB Input/Output format). To allow quick retrieval of sequences that match a given criterion, Maestro is implemented using query tables that are snapshots of the database from a single timepoint. We have recently implemented a procedure that updates these query tables once each week, to allow querying on all data available in GSDB.

During the next year we will be adding additional fields to the querying capability of Maestro. Suggestions regarding other fields that should be queryable are welcome.

Excerpt

The number of large (>100 000 bp) sequences in the database is growing rapidly. In part because of the number of complete bacterial and archaeal genomes that are being sequenced, but also because researchers are building larger contigs from human and model organism sequences. Many researchers are interested only in a relatively small segment of these large sequences, and are limited in the size of sequences they can utilize because of software or hardware restriction. Excerpt allows users to select a subset of any sequence within GSDB based on base pair span, gene or product name, all genes, all intergenic regions, or the sequence broken into equal sized spans. This provides researchers with easy access to all sequences, regardless of their computing power. Excerpt is available from the ‘Data Retrieval’ section of the GSDB web site ( www.ncgr.org/gsdb ).

GSDB flatfile definition

The GSDB staff has defined a flatfile format that represents the unique data types from GSDB. While this representation is not graphical, it allows all users to access discontiguous sequence data and alignments from GSDB, regardless of the computer platform they use. The new flatfile will show both the sequences that are aligned to the sequence that is displayed and the sequences to which it is aligned. In addition, the sequence accession number of any discontiguous sequence of which it is a piece of will be included. Flatfiles of discontiguous sequences will include the sequence accession numbers and descriptions of all of the sequences that are part of the discontig, and any information in the database on the order or distance from left of the sequence pieces.

Additional data types that are supported by GSDB, but not the IC databases that are included in the GSDB flatfile representation include sequence confidence data, user defined features, analysis information, and owner information. When GSDB flatfiles are viewed using the GSDB web pages, the flatfile will contain a hyperlink to the contact information for the owner of the data.

New Analysis Tools

Our web site now includes a new tool for locating matrix attachment regions (MARs). The MAR sequences are 100–1000 bp long and anchor chromatin loops to the nuclear matrix. Adjacent MARs often delineate areas where transcription factor binding sites are concentrated, and can thus be used to identify areas within a sequence to search for coding potential.

This tool was developed by Singh et al. ( 14 ) and is available through the ‘Software’ section of the GSDB web site ( www.ncgr.org/gsdb ), and relies on the combination of a ‘database’ of known MAR sequences and a set of decision rules to determine if a MAR sequence is present in a sequence. MARs are identified based on the probability that the MAR motifs occur at random in a given window of the sequence being analyzed. Where known, predictions of MARs using this tool closely corresponded to their experimentally determined locations ( 14 ).

Future Directions

During the next year the GSDB staff will continue to make changes that will improve data quality and data access. We will continue to improve the unique data sets that we have created, and will create new unique data sets that will be useful to researchers throughout the biological community. To further improve data quality, we will implement biologically based data checks that will prevent bad data from being entered into GSDB.

With the tools and resources that we currently have available, GSDB is positioned to be a useful resource for computational and functional genomics. We will continue to create tools and data sets that will be useful for both computational questions and for functional questions.

Contact Information

GSDB can be contacted at: National Center for Genome Resources, 1800 Old Pecos Trail, Suite A, Santa Fe, NM 87505, USA. Tel: +1 505 982 7840 or +1 800 450 4854; Email: ncgr@ncgr.org or gsdb@ncgr.org ; URL: http://www.ncgr.org

Acknowledgements

We would like to acknowledge S.A.Krawetz and J.A.Kramer of Wayne State University Medical School for allowing us to host the MAR-Finder software tool. We would also like to acknowledge the assistance of J.H.Horton in building the maize discontiguous sequences.

References

Koonin

E.V.

et al. ,

Curr. Biol.

1997

, vol.

(pg.

404

416

)

Crossref

Burland

et al. ,

Nucleic Acids Res.

1995

, vol.

(pg.

2105

2119

)

Crossref

PubMed

Kunst

et al. ,

Microbiology

1995

, vol.

141

(pg.

249

255

)

Crossref

PubMed

Ogasawara

et al. ,

Microbiology

1995

, vol.

141

(pg.

257

259

)

Crossref

PubMed

Devine

K.M.

et al. ,

Trends Biochem. Sci.

1995

, vol.

(pg.

429

431

)

Fraser

C.M.

et al. ,

Science

1995

, vol.

270

(pg.

397

403

)

Crossref

PubMed

Koonin

E.V.

et al. ,

Mol. Microbiol.

1997

, vol.

(pg.

619

637

)

Crossref

PubMed

Tanksley

S.D.

McCouch

S.R.

Science

1997

, vol.

277

(pg.

1063

1066

)

Crossref

PubMed

Maymo-Gatell

et al. ,

Science

1997

, vol.

276

(pg.

1568

1571

)

Crossref

PubMed

Worley

K.C.

et al. ,

Genome Res.

1995

, vol.

(pg.

184

)

Crossref

PubMed

Altschul

S.F.

et al. ,

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

Crossref

PubMed

Uberbacher

E.C.

et al.

Adams

Fields

Venter

Automated DNA Sequencing and Analysis

1994

London

Academic Press

(pg.

307

312

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Smith

R.F.

et al. ,

Genome Res.

1996

, vol.

(pg.

454

462

)

Crossref

PubMed

Singh

G.B.

et al. ,

Nucleic Acids Res.

1997

, vol.

(pg.

1419

1425

)

Crossref

PubMed

Harger

C.A.

et al. ,

Nucleic Acids Res.

1997

, vol.

(pg.

)

Crossref

PubMed

Bairoch

Wilkins

M.R.

Williams

K.L.

Appel

R.D.

Hochstrasser

D.F.

Proteome Research: New Frontiers in Functional Genomics

1997

New York

Springer-Verlag

(pg.

147

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Benson

D.A.

et al. ,

Nucleic Acids Res.

1996

, vol.

(pg.

)

[See also this issue Nucleic Acids Res. (1998) 26, 1–7]

Crossref

PubMed

Keen

et al. ,

Nucleic Acids Res.

1996

, vol.

(pg.

)

Crossref

PubMed

Kristensen

et al. ,

J. DNA Sequencing Mapping

1992

, vol.

(pg.

343

346

)

Crossref

Bork

Bairoch

Trends Genet.

1996

, vol.

(pg.

425

427

)

Crossref

PubMed

Lopez

et al. ,

Nature

1992

, vol.

355

pg.

211

Crossref

PubMed

Reynolds

T.L.

BioTechniques

1994

, vol.

(pg.

1124

1125

)

PubMed

Rothberg

J.M.

Nature

1992

, vol.

356

pg.

738

Crossref

PubMed

Savakis

Doelz

Science

1993

, vol.

259

(pg.

1677

1678

)

Crossref

PubMed

Lamperti

E.D.

et al. ,

Nucleic Acids Res.

1992

, vol.

(pg.

2741

2747

)

Crossref

PubMed

Seluja

G.A.

et al.

in prep

Blattner

F.R.

et al. ,

Science

1997

, vol.

277

(pg.

1453

1462

)

Crossref

PubMed

Pearson

W.R.

Griffin

A.M.

Griffin

H.G.

Methods in Molecular Biology, vol. 24: Computer analysis of sequence data, part 1

1994

Totowa, NJ

Humana Press

(pg.

307

331

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Author notes

Present address: SmithKline Beecham Pharmaceuticals, Bioinformatics, UW 2230, 709 Swedeland Road, King of Prussia, PA 19460, USA

Download all slides

Month:	Total Views:
December 2016	1
February 2017	6
March 2017	3
April 2017	1
August 2017	1
October 2017	10
November 2017	5
December 2017	13
January 2018	15
February 2018	12
March 2018	31
April 2018	19
May 2018	48
June 2018	21
July 2018	15
August 2018	26
September 2018	33
October 2018	70
November 2018	30
December 2018	32
January 2019	33
February 2019	18
March 2019	47
April 2019	51
May 2019	70
June 2019	49
July 2019	52
August 2019	22
September 2019	41
October 2019	36
November 2019	42
December 2019	49
January 2020	54
February 2020	77
March 2020	31
April 2020	54
May 2020	41
June 2020	76
July 2020	35
August 2020	44
September 2020	65
October 2020	88
November 2020	66
December 2020	75
January 2021	35
February 2021	63
March 2021	52
April 2021	73
May 2021	54
June 2021	60
July 2021	79
August 2021	69
September 2021	104
October 2021	104
November 2021	111
December 2021	88
January 2022	72
February 2022	121
March 2022	145
April 2022	101
May 2022	90
June 2022	88
July 2022	73
August 2022	53
September 2022	87
October 2022	60
November 2022	52
December 2022	45
January 2023	63
February 2023	42
March 2023	47
April 2023	66
May 2023	52
June 2023	35
July 2023	39
August 2023	59
September 2023	23
October 2023	30
November 2023	30
December 2023	26
January 2024	13
February 2024	34
March 2024	26
April 2024	34
May 2024	34
June 2024	24
July 2024	20
August 2024	25
September 2024	32
October 2024	14
November 2024	13

Article Contents

The Genome Sequence DataBase (GSDB): Improving data quality and data access

Abstract

Introduction

Data Quality Improvements

Vector contamination

Unique sequence and annotation representation

Sequence alignments

Discontiguous sequences

Data Accessibility Improvements

Annotator

Update process for Maestro

Excerpt

GSDB flatfile definition

New Analysis Tools

Future Directions

Contact Information

Acknowledgements

References

Author notes

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

The Genome Sequence DataBase (GSDB): Improving data quality and data access

Abstract

Introduction

Data Quality Improvements

Vector contamination

Unique sequence and annotation representation

Sequence alignments

Discontiguous sequences

Data Accessibility Improvements

Annotator

Update process for Maestro

Excerpt

GSDB flatfile definition

New Analysis Tools

Future Directions

Contact Information

Acknowledgements

References

Author notes

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only