- Split View
-
Views
-
Cite
Cite
An Xiao, Yingdan Wu, Zhipeng Yang, Yingying Hu, Weiye Wang, Yutian Zhang, Lei Kong, Ge Gao, Zuoyan Zhu, Shuo Lin, Bo Zhang, EENdb: a database and knowledge base of ZFNs and TALENs for endonuclease engineering, Nucleic Acids Research, Volume 41, Issue D1, 1 January 2013, Pages D415–D422, https://doi.org/10.1093/nar/gks1144
- Share Icon Share
Abstract
We report here the construction of engineered endonuclease database (EENdb) (http://eendb.zfgenetics.org/), a searchable database and knowledge base for customizable engineered endonucleases (EENs), including zinc finger nucleases (ZFNs) and transcription activator-like effector nucleases (TALENs). EENs are artificial nucleases designed to target and cleave specific DNA sequences. EENs have been shown to be a very useful genetic tool for targeted genome modification and have shown great potentials in the applications in basic research, clinical therapies and agricultural utilities, and they are specifically essential for reverse genetics research in species where no other gene targeting techniques are available. EENdb contains over 700 records of all the reported ZFNs and TALENs and related information, such as their target sequences, the peptide components [zinc finger protein-/transcription activator-like effector (TALE)-binding domains, FokI variants and linker peptide/framework], the efficiency and specificity of their activities. The database also lists EEN engineering tools and resources as well as information about forms and types of EENs, EEN screening and construction methods, detection methods for targeting efficiency and many other utilities. The aim of EENdb is to represent a central hub for EEN information and an integrated solution for EEN engineering. These studies may help to extract in-depth properties and common rules regarding ZFN or TALEN efficiency through comparison of the known ZFNs or TALENs.
INTRODUCTION
Engineered endonucleases (EENs) are designed to bind and cleave specific DNA sequences in vitro or in vivo. Zinc finger nucleases (ZFNs) and transcription activator-like effector nucleases (TALENs) are the two most widely used customizable types of EENs that exhibit many features in common (1). They recognize target sequences via engineerable DNA-binding domains and cut the DNA by non-specific nuclease domains (2,3). With these artificial endonucleases, site-specific DNA double-stranded breaks (DSBs) can be theoretically induced in any given genome, by which targeted genome modification could be easily achieved. Various genetic manipulations, such as gene disruption, gene correction and gene addition, have been designed and achieved in many organisms as well as in cultured cells based on EEN techniques. It is specifically essential for gene targeting in those species where no other reverse genetics approaches are available (4–8). Other types of EENs are also reported, such as triplex-forming oligonucleotides (9–12), engineered homing endonucleases (HEs) (13–15) and fusion proteins of the DNA-binding domains of zinc fingers or HE with full-length restriction endonucleases (16,17); however, they are much more difficult to customize than ZFNs and TALENs, which are based on a modular-repeat structure. Furthermore, a database called LAHEDES for HE-engineering has been reported (18).
Most ZFNs and TALENs function as dimers. Each monomer of these EENs consists of an artificially constructed hybrid protein containing a specific DNA-binding domain, which is derived from zinc finger proteins (ZFPs) for ZFNs (19), or transcription activator-like (TAL) effectors (TALEs) for TALENs (20) and a non-specific cleavage domain, which usually comes from the FokI restriction endonuclease (2) as well as a linker peptide between these two domains and other framework sequences. The DNA-binding domain of ZFPs or TALEs consists of an array of tandem repeat units, each of which recognizes and binds to one or more nucleotide targets in one strand of the DNA. As for ZFP, each zinc finger repeat approximately targets 3 nt (called a triplet), but internal and external context-dependent effects of neighboring fingers can influence the efficiency and specificity of the corresponding target-protein relations (21). No accurate relationship has been discovered and several strategies have been developed to screen for and construct efficient and specific ZFPs and ZFNs. In contrast, TALE domains show a more predictable one-to-one correspondence between repeat units and their single-nucleotide targets, and no screening is required for TALEN engineering (22,23). Although hundreds of ZFNs or TALENs have been constructed and tested, the targeting efficiency of these EENs varies significantly. In fact, the efficiencies of TALENs in the same work or even targeting the same locus may vary from low to high, or even showing no targeting activity (24–27). Unfortunately, now it is still difficult to predict the efficiency of a given ZFN of TALEN, largely due to the lacking of sufficient knowledge about the properties of these EENs. Through analyzing the rapidly expanding information of these EENs from various resources, it may help to extract common rules regarding the efficiencies of TALENs or ZFNs through comparison of all the evaluated TALENs or ZFNs, respectively. On the other hand, the specificity and potential toxicity of TALENs still need further and careful investigation. In certain applications, the technology of ZFN is more mature, e.g. ZFNs are being tested in gene therapy trials (21,28–30), and are reported recently to be able to be delivered into cultured cells in the form of purified proteins directly (31). The properties of these two types of EENs have been compared with each other in some of the recent works (26,27,32,33).
Other factors in the customizable EEN engineering for TALENs and ZFNs, such as variants of FokI cleavage domains (34–40), as well as the relationship between the length of linker peptides and the length of spacer between the two EEN monomer binding sites (41–46), can also affect the efficiency and specificity of these EENs. Methodologies for the customizable EEN construction and genome modification have been optimized for best performance. The efficiency of EENs and the detection methods as well as the specificity evaluated by off-target tests have been reported previously (47,48). Collection and comparison of the corresponding information may be useful for reviewing/reusing existing EENs and engineering new TALENs or ZFNs.
Some databases such as ZifBASE (49) and ZiFDB (50) have been developed to collect the information of some of the ZFP domains but not the nucleases themselves. So far, hundreds of efficient ZFNs and TALENs have been reported and many methods for their engineering and construction including high-throughput ones (24,51) have been developed. However, a database systematically collecting and classifying the information of all the currently known customized EENs and their targets is unfortunately lacking.
Some tools are available to predict/search for candidate EEN target sites in given DNA sequences. TALE-NT (52), ZiFiT (53) and idTALE (54) are predominant tools for TALEN/TALE target design. ZiFiT also supports ZFN target design and selection covering various strategies, whereas ZFN Sequence Tag and other web-tools (55–57) only support particular types of ZFN design. On the other hand, several strategies have been developed to screen for and construct customized EENs. Polymerase chain reaction (PCR)-based Golden Gate cloning (58,59), vector-based Golden Gate cloning (60–64), Unit Assembly (65), REAL/REAL-Fast (33), dTALE-assembly (54), FLASH (24) and ICA (51) are designed to construct the long and highly repetitive TALE DNA-binding domains, which are difficult to synthesize by ordinary PCR or modular cloning. Modular Assembly (55,66,67), OPEN (68), CoDA (69) and other public (70,71) or commercial strategies (72) are developed to screen for and obtain ZFP domains showing high target binding activity. Apart from these methods, several platforms such as Addgene (http://www.addgene.com) offer plasmids or kits and protocols for EEN engineering with several selections of different methods or strategies, which might be difficult to find and compare all the existing resources and make a proper decision if one does not work on and is not familiar with the field of EEN engineering. A knowledge base providing a route with simple descriptions for and links to these tools and resources will allow users to take it as a starting point and find the most appropriate one for themselves.
Here, we report the construction of EENdb, the first database and knowledge base dedicated to a complete and detailed collection of all the currently available and evaluated ZFN and TALEN data extracted and curated from publications and other sources. It also integrates different EEN (ZFN and TALEN) engineering resources, to provide a relatively complete solution for searching for known EENs as well as for design and construction of new customized ZFNs and TALENs. In current version of EENdb and here below, EENs refer to ZFNs and TALENs.
MATERIALS AND METHODS
Data sources
We manually extracted the information of all ZFNs and TALENs reported to be effective, either targeting endogenous loci or designed for artificial sequences, directly from published research articles, or reconstructed the necessary information if it was not given directly. Non-functional TALENs and ZFNs were also included for comparison and analysis, but numerous inactive ZFNs from screening attempts were omitted. The data of some TALENs showing no detectable activity were also collected via direct submission by the authors.
In total, more than 400 records of ZFNs, more than 300 records of published TALENs and 24 records of directly submitted TALENs have been collected and curated so far. Redundant EENs from the same or different sources were merged. The core information in the database consists of targeting site sequences, types of ZFN and TALEN forms, critical peptide sequence of the DNA-binding domains, linker peptides and frameworks, FokI variants and other alternate cleavage domains, ZFP or TALE construction strategies and genome modification methods. EEN efficiency with the detection methods and EEN specificity with off-target sites tested were also included if available. In the whole target site for a pair of EENs, the ZFP- or TALE-binding sequences (i.e. half-sites) and spacer sequences between the two half-sites were carefully differentiated.
An NIH-sponsored project aiming to target endogenous genes in zebrafish (Danio rerio) regularly constructs arrays of ZFNs and TALENs with only ZFP- or TALE-binding activities tested and releases them in Addgene (http://www.addgene.com/zfc/arrays/ and http://www.addgene.com/talengineering/TALENzebrafish/). EENdb has not collected the information of these EENs unless the cleavage activities were tested and reported. However, a prompt message appears and indicates to the project web pages when users list EENs of all species or of zebrafish in EENdb.
To complement with the EEN tables, we also collected the information of natural and artificial ZFPs (only the C2H2 type, which is used as the framework for engineered ZFPs), including those with only binding capability but not cleavage activity reported and artificial TALEs. The 7-aa variable region of each finger of ZFPs (including ZFNs) and the repeat-variable di-residue (RVD) of each repeat unit of TALEs, which are considered to be the most critical elements determining DNA-binding specificity, were extracted and displayed in association with their target nucleotides.
Design and implementation
EENdb is implemented with open source technologies. The data are stored in a MySQL relational database. The web site is written in PHP scripts and the service is provided by an Apache/PHP server.
RESULTS
EENdb consists of five interrelated sections (Figure 1). The main section is the collection and summary of the information about EENs (including TALENs and ZFNs) called ‘TALEN/ZFN’. ‘ZFP Domain’ and ‘TAL Effector’ are two sections built to complement with the main section, whereas ‘ZFP Domain’ provides the collection of the zinc finger DNA-binding domains, especially the 7-aa regions from ZFNs and other ZFPs; ‘TAL Effector’ collects the information of the TALE DNA-binding domains from TALE proteins other than TALENs. Details about the components and information of EENs, such as variants of the FokI cleavage domains, the scaffolds/frameworks and linker peptides between the DNA-binding domains and the cleavage domains, as well as the information about the construction and application of EENs, are summarized in the section of ‘Utilities’ and linked to other EEN datasets in each corresponding field. The last section named ‘Engineering Resource’ provides short descriptions of and links to all the available external resources for EEN engineering and is also linked to the EEN construction methods summarized in the ‘Utilities’.
The dataset of EENs
Collection of the essential information about EENs is the major component of EENdb. Users can list all TALENs and/or ZFNs designed to target a special species, or easily search by gene symbols, gene IDs, whole or partial target sequences (degenerate nucleotides are also supported), or reference PMIDs (PubMed IDs) or surnames of the first authors from the ‘TALEN/ZFN’ page (Figure 2A). Results of corresponding ZFNs, TALENs or integration of the two types of EENs are returned, according to the search option set by users. In the search result table or the table listed by species, summaries of EEN information are listed (Figure 2B). More details can be accessed through the detail page by clicking on the EENdb IDs in the table.
For each EEN, the following information is displayed in the detail page; some of the information is omitted or shown in abbreviation in the summary table.
EENdb ID
EEN data are collected and classified according to their target sites. ID of a ZFN record usually has the format ‘ZNxxxx’ or ‘ZNAxxx’ and ID of a TALEN has the format ‘TNxxxx’ or ‘TNAxxx’, in which an ‘x’ refers to a digit. ‘ZNxxxx’ and ‘TNxxxx’ are assigned to EENs targeting natural sites, whereas ‘ZNAxxx’ and ‘TNAxxx’ are for EENs targeting artificially synthesized DNA sequences. Groups of different EENs targeting similar sequences from the same or different literature are given serial IDs with one-letter suffixes. For example, TN0002, TN0002B, TN0002C and TN0002D all target the same or overlapped site of human CCR5 gene. An ‘-Txxx’ suffix indicates that this record contains a previously reported EEN (i.e. the same EEN) but targeting another DNA sequence. In most cases, it represents an off-target site. For example, TN0031-T002 targets CCR2, an off-target gene of TN0031, which is originally designed to target CCR5 gene. In summary tables of the list of a particular species or search results, the EENs are sorted as a default by publication PMID (approximately in the order of publication date) so that EENs from the same publication are arranged together. Alternatively, it can be changed to group by EENdb ID, which is also used in the EEN detail pages, thereby related records with same target sequence and/or off-targets of same EENs are centralized for comparison.
EEN type or form
Whether an EEN is a ZFN or a TALEN can be easily distinguished from the first letter of the ID. In addition, most of the EENs are functional in dimers and the FokI domains are usually fused in the C-terminus of the monomers, but exceptions do exist (73–78). A description of rarely used EEN form and a link to a page with detailed explanation in the ‘Utilities’ section is provided under the corresponding EENdb ID.
Target site sequence
The sequence of the whole target site recognized by EEN pairs is carefully extracted from or constructed based on publications. The half-sites (i.e. the binding site for a single EEN) of the whole target site are shown in uppercase; the strands bound by the ZFP- or TALE-binding domains are underlined. The spacers are revealed as lowercase letters. Additional one base pair outside the half-sites is also provided for it is important for ZFNs and TALENs in some cases, e.g. to distinguish whether the additional nucleotide is the most commonly used nucleotide T for TALENs (3,24), or to consider the context-dependent effect of ZFN fingers (79,80). The target sequences can be searched by either the forward or the reverse strands.
Other information of the target site
For natural targets, the English and Latin names of species and the gene or genomic locus represented by Ensembl IDs, Ensembl Genomes IDs or RefSeq Accessions are given. Length of the spacer, numbers of fingers in a ZFN and lengths of half-sites of a TALEN are calculated and displayed.
DNA-binding domains of EEN monomers
The key amino acids of the DNA-binding domains are shown here. For ZFNs, the 7-aa variable regions of each finger and a link to the ‘ZFP Domain’ section are provided. For TALENs, four most commonly used RVDs, each recognizing its own corresponding single-nucleotide target (i.e. NI for nucleotide A, HD for C, NG for T and NN for G), are considered as ‘standard code’ of RVDs (22,23); other non-standard ‘alternative code’ of RVDs (e.g. NK or NH for G, NG for mC or 5-methylcytosine) (20,81–83) and off-targeted RVDs are marked with colors different from the ‘standard’ ones for identification.
Other components of the EEN protein
The types of linker peptides between the DNA-binding domains and FokI domains, the variants of FokI cleavage domains as well as the screening and construction methods or strategies for the DNA-binding domains are included if they are known. These items are linked to the related pages in the section of ‘Utilities’.
Effectiveness of EEN
Each EEN record also contains the modification method of the target locus [e.g. non-homologous end joining (NHEJ) or homologous recombination (HR)], efficiency of the EENs and the detection methods and in vitro cultured cell lines tested if available, heritability of EEN-induced mutations at organism level and specificity of the EEN revealed by the result of off-target evaluation. The modification and efficiency detection methods are also linked to the corresponding ‘Utilities’ pages for detailed explanation. In some cases, comparisons or collaborations with other EENs are indicated as comments.
References
PMID with a link to the first publication reporting certain pair of EENs is provided, or other information is listed for directly submitted records.
User commenting system
In each detail page of the EEN records, a simple commenting system is available; any correction, complementarity and communication can be contributed by users either with this system or through contacting the administrators of EENdb by email.
The dataset of ZFP domain and TALE
Natural or artificial ZFP domains and artificial TALEs compose two other sections of EENdb in parallel to the section of TALEN/ZFN. The dataset of ZFP domains consists of more than 1000 records and provides much more information about ZFP-binding domains for data mining, e.g. discovering the relationship between the key peptide sequences and the target nucleotide sequences of the nucleases based on statistical analysis of known ZFPs. The dataset of TALEs provides a link to a list of natural TALEs and collects TALEs beyond TALENs, which are both rich in ‘alternative code’ or off-targeted RVDs and may be helpful for researchers who are interested in the analysis of TALE-binding activities and discovering more information of the DNA-binding properties of TALEs.
The utilities of EENdb
One of the aims of EENdb is to build a knowledge base about EENs. We invested a great deal of effort to organize and refine the information and utilities of EENdb, which offers a global view of the emerging and fast developing ZFN and TALEN technology and resources. In the ‘Utilities’ section, detailed information of ZFN and TALEN forms and structures, frameworks and linker peptides, FokI variants and other alternate cleavage domains, repeat units of DNA-binding domains, ZFP- or TALE-construction strategies, genome modification methods and efficiency detection methods are provided (Figure 1). These data can help users to retrieve details about EENs, compare and choose optimal parameters, etc. Most of the utilities also provide links to the ‘TALEN/ZFN’ section to filter all EENs matching defined conditions.
For example, the page of FokI variants (Figure 2C) lists all known FokI pairs that can only work in hetero-dimers to reduce non-specificity, or that can enhance catalytic activity, or that participate in building a nickase (only cleave one strand of the target DNA) rather than a nuclease (37–39). Users can list the records of all reported EENs containing specific pairs of FokI domains from the EEN dataset via the links under each FokI variants. The sequences of variants with the mutations highlighted can also be found in this page.
Other examples for the usage of the ‘Utilities’ section are the pages of repeat units of DNA-binding domains, genome modification methods and efficiency detection methods. Similar to the page for FokI variants, users can list the records of all EENs reported to induce HR in different organisms, or all TALENs with particular non-standard ‘alternative’ RVD via corresponding links. This information may help researchers to choose appropriate method for their own experiments. Only as an example of this attempt, one can easily conclude from the list of all the EENs tested in zebrafish that NHEJ is the only type of genome modification induced by EENs in this species except for one very recent report (84), and the most frequently used detection methods in this species are restriction enzyme-resistance assay (after PCR) and direct sequencing.
EEN engineering tools and resources
As described above, many EEN target prediction tools and construction resources are available for the public. The ‘Engineering Resource’ section of EENdb generates links and provides short descriptions and comments of these tools and resources, including candidate TALEN/TALE target site prediction tools, candidate ZFN/ZFP target site prediction tools, target- or off-target-site finder of given EENs, resources of EEN construction protocols and materials and other links to news sites and newsgroups. Applicable construction methods supported by any available resources are provided with links to the corresponding pages in the section of ‘Utilities’.
Access
EENdb can be freely accessed via http://eendb.zfgenetics.org/. The data of EENdb can be downloaded as tab-separated version (TSV) plain text format in the ‘Help’ section of the web site.
EENdb welcomes researchers to submit the information of their newly constructed EENs and related experimental data by email, especially the negative data not intended for publication. The feedback form and email address can be found in the ‘Help’ section of the web site.
DISCUSSION
Continuous updates of the EENdb database will be offered. As more and more EENs are expected to be reported, EENdb will expand to offer links to show custom tracking of known EEN sites in Ensembl and UCSC Genome Browser when necessary.
EENdb has collected the references of all publications first reporting a particular ZFN or TALEN. Other references related to EENs, such as new applications of a known EEN, will also be included in future releases of EENdb.
FUNDING
The 973 program [2012CB945101, 2011CBA01000 and 2011CBA01102]; National Natural Science Foundation of China (NSFC) [31110103904 and 30730056]; National Science and Technology Infrastructure Program [2009FY120100]; 111 Project [B06001]. Funding for open access charge: 111 Project [B06001].
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
We thank I.C. Bruce for language editing of the manuscript; P. Huang, Z. Wang, X. Tong Y. Zu and D. Liu for helpful discussions; P. Huang, Z. Wang, Y. Shen, W. Liang, Z. Luo, Q. Wu, W. Li and D. Liu for providing the information of some of the non-functional TALENs; J. Luo for the support and discussion on issues regarding bioinformatics and database management; Y. Shen, Y. Gao and J. Zhang for lab management and for the collection, organization and maintenance of the information and materials of the ZFNs and TALENs from our lab.
Comments