TranscriptSNPView: a genome-wide catalog of mouse coding variation. Genome Res Nucleic Acids Res Then please fill out this short questionnaire to prioritise the improvements that are of greatest importance to you. Toggle navigation. Replacement of this read-only version of the legacy site with the new version of FAIRsharing is planned for early January Ensembl creates, integrates and distributes reference datasets and analysis tools that enable genomics.
In the following recommendations:. In Collections. Ensembl Cunningham F et al. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. A database and API for variation, dense genotyping and resequencing data. Related Standards. Reporting Guidelines No guidelines defined.
Identifier Schemas No identifier schema standards defined. Metrics No metrics standards defined. Related Databases. Implementing Policies. Record Maintainer. We need your help! Do you have minutes to help FAIRsharing prioritize future enhancements? Your views will help us choose which enhancements to provide first.
Thank you very much! Learn more Mailing List. Use Your Own Data in Ensembl. Variants for my Gene. Compare Genes Across Species. Gene Expression in Ensembl. Retrieving Sequences. Find a Data Display. Browsing Chordate Genomes. The ACeDB project was a source of many of our original ideas for modeling genome information, but we did not think that its binary-file-based method of persistent storage would scale to accommodate the human genome.
The NCBI toolkit requires predominantly C-based programmatic access, and would have resulted in a longer development time and a steeper learning curve for biologists unfamiliar with the C language. In addition, the primary mechanism of persistence storage offered by the toolkit ASN. It was decided to use a relational database management system RDBMS because of its numerous benefits over a file-based approach.
A relational database scales well, is accessible to users via a well-known query language SQL , provides a means to index data for rapid queries, and allows many concurrent users to access the data at once. MySQL was chosen because of its faster performance and better long-string support, and a Perl application programming interface API was developed as the primary method of programmatic access.
Since the inception of the project, several other data storage and API solutions for genome information have become available. In particular, the GadFly project Mungall et al. The Grand Unified Schema Bahl et al. Although these projects provide excellent Web access to their resources, their schemas and code bases are either unavailable or their use is limited.
Finally, there have been several commercial genome management products based on proprietary technology from Softberry, Celera, and Doubletwist. Ensembl enjoyed interacting with many of these other developers, and freely shares all of its code and ideas. The rest of this article describes Ensembl's database and API, which have been the result of four years of development. Some decisions were well thought out and stood the test of time; others were due to the rapid pace of development, in particular at the start of the project.
The Ensembl system has proven flexible enough to be adopted for many genome projects. In house, Ensembl is currently used to annotate or display nine species, and externally the Ensembl system has been extended for use with the genomes of many organisms including Arabidopsis , rice, and numerous pathogens. The Ensembl database is used in two distinct phases and has two resultant patterns of usage. The first phase is the production of the data and involves a high volume of both reads and writes to the database.
The second phase, the presentation of the data by a Web interface, requires rapid read-only access to the database. It was decided to serve both phases with the same schema and programming interface despite their divergent patterns of usage. A single code base has the advantages that there is less code to maintain, it removes the necessity of a postdata production denormalization, and it leads to more robust and flexible code. It does, however, prevent the use of certain database speed optimization methods and leads to a compromise between normalization of data, query optimization, and development time.
We avoided autogenerating the code or schema from a higher-level language e. The tables defined in the Ensembl schema can be divided into three functional categories: tables for the storage of DNA and assemblies, tables for the storage of computed features and genes, and tables containing miscellaneous information.
Figure 1 provides a general outline of the database structure. Entity relationship model of the Ensembl schema. Tables are represented as divided rectangles consisting of a boldface table name at the top and a list of table attributes and attribute types below.
Internal identifiers and join tables are omitted. The basic unit of sequence is stored in the contig table. The string representation of the DNA sequence for a contig is stored in the dna table. Each contig row references a row in the clone table that provides additional detail about the BAC clone. Unfinished clones are comprised of multiple contig rows; a finished clone consists of a single contig.
The information needed to assemble chromosomal sequence from the set of contig sequences is stored in the assembly table. Various features are positioned on the genome sequence and stored in database tables. All features define a genomic position through a reference to a contig and start and end coordinates on the contig.
Some features contain additional, nonpositional information in related tables. An innovation in the storage of similarity search results is the compression of gapped alignment information in the form of dense character strings.
This was originally developed as an output format from Exonerate G. Slater, unpubl. Alignment features store the full extent of the gapped alignment and a cigar line. Each cigar line consists of an alternating series of numbers and letters, for example, 40M2I12M4D, with the letters standing for Match, Insertion, or Deletion.
The number preceding each letter dictates the length of the match, insertion, or deletion; used together with the feature's start and end coordinates, the complete alignment can be reconstructed. Prior to the adoption of cigar lines, alignments in Ensembl were stored as multiple ungapped features, with a single row for each matching region of the alignment. The more complex structure of a gene is distributed over multiple tables.
A gene from Ensembl's perspective is a set of transcripts that share at least one exon. This is a more limited definition than, for example, a genetic locus, but it describes a relationship that can be easily identified computationally. Exons and their associated genomic positions are stored in the exon table.
Transcripts reference zero or one translation table rows that describe the composition of untranslated regions and coding sequences. Pseudogenes and ncRNAs are examples of transcripts without translations. Exons are predicted with chromosomal positions but stored with contig positions.
The chromosomal coordinate system changes with each new assembly of a genome, and is thus more volatile than the contig coordinate system. Storing exons in contig coordinates ensures that unchanged exons have unchanged coordinates. One drawback to this approach is that exons may span multiple contigs when converted from chromosomal coordinates. When these split exons are retrieved from the database they are reassembled into a single exon by the API software.
Across different releases of human genome assemblies and other sequence data, Ensembl provides changing gene predictions. To allow the user to track a particular gene prediction despite changing coordinates, all gene-related predictions are assigned stable identifiers.
Between two versions of a genome we determine the correspondence between the old and new predictions, taking into account changes in genomic position or sequence. New predictions with a sufficiently high similarity to a previously made prediction inherit the previous prediction's stable identifier. The database follows some simple naming conventions to facilitate easier understanding and maintenance.
Ensembl's database access layer is written in Perl because of its numerous advantages as an implementation language. Perl is widely used in the bioinformatics and biology community, and it is a language well suited for writing Web applications. Another important factor was that Ensembl was originally created out of a Perl-based human annotation project, and parts of the existing software could be reused.
Adoption of Perl also enabled Ensembl to leverage the existence of the BioPerl project Stajich et al. BioPerl provided a base for an initial object model and aided in the dumping and parsing of flat files. As Ensembl has become more complex over its lifetime, this dependence on BioPerl has slowly diminished.
Currently there is very little BioPerl dependence inside Ensembl, and we are considering replacing the hard dependencies and producing a separate Ensembl-to-BioPerl bridge. However, some aspects of Perl are not well suited for a software project of Ensembl's size.
Whereas weak typing allows for rapid program development, absence of compile time checking of function prototypes and variable types is a steady source of runtime errors. Another disadvantage of Perl is its reference-count-based garbage collector, which effectively limits the use of circular references. Variables that are part of a circular reference structure are never garbage-collected and can introduce potentially serious memory leaks.
Avoidance of circular reference memory leaks has necessitated some compromises to the overall system design. As described below, a Java API was developed to test ideas and allow gradual progression to a more strongly typed language. Ensembl models real-world biological constructs as data objects. For example, Gene objects represent genes, Exon objects represent exons, and RepeatFeature objects represent repetitive regions.
Data objects provide a natural, intuitive way to access the wide variety of information that is available in a genome. All information relating to a data object can be obtained by querying the object's methods. As an example, a Transcript object can provide the user with its identifier, its exons and its translation, and the like.
Data representation and database access are cleanly separated in the Ensembl API. Database access code resides exclusively in adaptor classes that create data objects. Each data object x e. Adaptors provide multiple ways to retrieve and store data objects from the database via methods that follow strict naming conventions.
This separation of logic enables adaptor classes to share query generation code and insulates data objects from underlying schema changes. The modularity of this design also makes it easy to add new data objects to the system. The decoupling of database logic additionally allows the transparent substitution of one data source for another. The abstraction of the data sources allows them to be interchanged to address particular flexibility and performance needs.
Retrieval from the optimized database is facilitated by ProxyAdaptor classes that dynamically decide to forward requests to either the default data source or to the optimized data source.
The DBAdaptor is a specialized adaptor that maintains a connection to the database and acts as a factory for object adaptors. The centralized object adaptor creation code can ensure that only a single object adaptor of each type is created per database.
This enables the object adaptors to cache instances of the features they retrieve and to improve overall performance. A Container class alleviates memory leaks created by the circular references between the DBAdaptor and its object adaptors. Even when Slices will be based on completely finished genomes with no underlying contigs, their usage will remain the same.
This is one major advantage of an API-based approach to genomic data storage. The primary coordinate system in which all database features are stored has remained contig-based despite the common usage of a chromosomal-based coordinate system. Rather than distributing coordinate transformation code throughout the code base, we introduced a general Mapper class that encapsulates coordinate transformation between two sequences.
A more specific AssemblyMapper class utilizes the Mapper object and the contents of the assembly table to translate from contig-based coordinates to chromosomal coordinates and vice versa.
Object creation and coordinate transformation have an impact on the speed of the API. For our dominant use cases we find this is an acceptable performance decrease. We aim to have the greatest possible ease of use for the Perl API. It should not just support our gene prediction process and Web display code, but it should also make it simple for researchers to perform their own genome analysis tasks. Perl suffers from certain disadvantages as an implementation language for a large-scale project.
Java overcomes many of these problems and has the benefits of compile time type checking, enforced interfaces, multi threading, better support for graphical user interfaces, and correct garbage collection of circularly referenced objects. Additionally, the Java API separates interface from implementation for all standard data objects and adaptors and thus allows for transparent alternative implementations.
It is used by the standalone genome browser Apollo, and it is used internally for the stable identifier mapping process and by various other projects such as Toucan Aerts et al.
Ensembl releases about 35 databases on a monthly basis. To ensure their relational integrity and to validate that they are populated with reasonable data, several SQL-based tests are performed using a Java quality assurance system named ensjhealthcheck. The system consists of groups of tests individually encapsulated into Java classes.
It is relatively easy, even for Java newcomers, to write integrity checks that are automatically detected and used. Since the introduction of the ensj-healthcheck system, better consistency over schemas has been achieved, and it has become easier to detect data errors at an early stage.
Newly detected errors are added to the existing suite of test cases and are prevented from reoccurring. Genomic data pose serious challenges to biologists: both the mundane challenges of how to manipulate the data and the scientific challenges of how to find and use the information contained within it. In the case of mammalian genomes, the mundane aspects can come to dominate researchers' time.
The system described by this article enables scientists to work effectively with the genome by providing an efficient means to access vast amounts of data. The other articles in this issue detail how this software system is already used to create and to display genomic information for biologists.
The system has been designed for maximum utility by researchers. Ensembl's relational database allows the flexible retrieval of nonredundant information. For example, researchers can retrieve genes using several different constraints such as a genomic location or an HUGO identifier. An increasing number of laboratory biologists have Perl, Java, or Python programming skills.
The software system developed as part of this project allows simple scripts to be developed by external researchers and eliminates the overhead required to construct a framework for access to this genomic information. High-throughput technologies used in smaller laboratory settings demand specialized data analysis software to be written to integrate experimental results with reference data sets such as the genome. For example, researchers investigating specific gene families can run primer design programs across the genome to provide a high-throughput system to survey mutations.
The current Ensembl system is biased toward a clone-based genome project. Every genome imported into Ensembl requires entries in the clone, contig, chromosome, and assembly tables. Whole-genome shotgun assemblies do not naturally fit into this mould, either because they do not have clones or because the assembly may be fragmented into thousands of scaffolds instead of chromosomes.
The Ensembl sequence storage system will be improved by replacing the contig, clone, and chromosome tables with a single general sequence region table. A modified assembly table will describe the composition of arbitrary sequence regions rather than the makeup of chromosomes from contigs.
Locatable features will be stored with coordinates relative to sequence regions and will not be limited to storage in the contig coordinate system. Effectively, bias toward particular coordinate systems will be removed; the system will become more flexible and will accommodate a wider variety of methods of sequencing and assembly.
The Ensembl system will also be extended to include support for alternative sequence regions. This will include the ability to represent structural haplotypes with highly divergent sequence e. The generalization of the sequence and assembly storage and the addition of new features are expected to require only minor changes to the existing programming interface.
Users will be shielded from the low-level database and code changes by the layers of abstraction in the existing system. The current API will continue to function in nearly all cases, but some functions and naming conventions will slowly replace the current ones.
Retrieval speed should improve as features can be calculated and stored in the coordinate system in which they are predominantly requested. There exists a myriad of ways to store and manipulate genome sequence; the Ensembl system is a robust and scaleable solution. It is reassuring that the core concepts from the project inception e. A similar dynamic of evolutionary development is expected to occur over the next five years with the Ensembl database and API providing the central support for genome information.
We thank the other members of the Ensembl group for their support and patience during the development of the system. The publication costs of this article were defrayed in part by payment of page charges. Read article at publisher's site DOI : Sci Rep , 10 1 , 15 Jun Database Oxford , 1 , 01 Jan Mol Biochem Parasitol , , 27 Nov To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Paterson T , Law A. Bioinformatics , 28 21 , 03 Sep Bioinformatics , 23 13 :i, 01 Jul Cited by: articles PMID: Genome Res , 14 5 , 01 May Free to read.
Della Vedova G , Dondi R. Appl Bioinformatics , 2 2 , 01 Jan Cited by: 1 article PMID: Hammond MP , Birney E. Trends Genet , 20 6 , 01 Jun Cited by: 19 articles PMID: Contact us.
Europe PMC requires Javascript to function effectively. Recent Activity. Search life-sciences literature Over 39 million articles, preprints and more Search Advanced search. Search articles by 'Arne Stabenau'. Stabenau A 1 ,. McVicker G ,. Craig Melsopp Search articles by 'Craig Melsopp'. Melsopp C ,. Glenn Proctor Search articles by 'Glenn Proctor'.
Proctor G ,. Michele Clamp Search articles by 'Michele Clamp'. Clamp M ,. Birney E. Affiliations 1 author 1. Share this article Share with email Share with twitter Share with linkedin Share with facebook. Abstract Systems for managing genomic data must store a vast quantity of information. Free full text. Genome Res. PMID: Author information Article notes Copyright and License information Disclaimer.
E-MAIL ku. Received Aug 8; Accepted Feb This article has been cited by other articles in PMC.
0コメント