NATsDB: Natural Antisense Transcripts DataBase
Browse Search Help Download

Tutorial of NATs Database
A documentation of detailed NATs tutorial (PDF)
  • Background
    • What are NATs,cis-NATs and trans-NATs?
    • Which biological functions do NATs have?
  • Introduction
    • What is NATsDB ?
    • What is NATsDB's architecture?
      1. NATsDB's construction pipeline
      2. Recent pipeline optimization
      3. Data sources we used to construction NATs DB
      4. Statistical data on natural antisense transcripts we identified
    • What can NATsDB be used for and what are the main features of NATsDB?
    • What differences are there among our NATsDB and other databases?
  • Using NATsDB
    • Browse and search NATsDB
    • Fine-tune NATsDB display
    • Interprete NATsDB display
      1. In the genome loci page
      2. In the sequence information page
  • Other questions
    • What is a fasta sequence?
    • How to cite NATsDB?

  • Background
    • Q:What are NATs,cis-NATs and trans-NATs?

      A:Natural Antisense Transcripts (NATs) are simply RNAs containing sequences that are complementary to other endogenous RNAs. They can be transcribed in cis from opposing DNA strands at the same genomic locus (cis-NATS), or in trans from separate loci (trans-NATS).
        Two other classes of trans-acting noncoding RNA are related to trans-NATs because they recognize their target RNAs by imprecise base-pairing: microRNAs (miRNAs), which inhibit the translation of mRNAs , and small nucleolar RNAs (snoRNAs), which guide the modification of noncoding RNAs.
        Two another categories related to cis-NATs are NOB (Non-exon-Overlapping Bidirectional) and NBD (Non-BiDirectional). All of them are depicted in the following figure:

    • Q:Which biological functions do NATs have?

      A:Pairing of NATs to sense RNAs is known to regulate expression of many different genes in cells and their accessory elements: viruses, plasmids and transposons. In recent years, NATs have been implicated in many aspects of eukaryotic gene expression including genomic imprinting, RNA interference, translational regulation, alternative splicing, X-inactivation and RNA editing. Moreover, there is growing evidence to suggest that NATs might have a key role in a range of human diseases.But the role of NATs, in most cases, is poorly understood in eukaryotic organisms.
          NATs as regulators range in size from very small, such as microRNA (miRNA) which are 21-22 nts, to extremely large, such as Air transcript(108kb) associated with the imprinted Igf2r locus.
          miRNAs are the best-studied examples of antisense regulators in eukaryotes. Until 2001, hundreds micro RNAs (miRNAs) have been found in human, mouse and other organisms. Although the physiological function of most microRNAs is unknown, it is likely that most, if not all, exert their effects through base pairing with complementary target sequences.
          The role of larger NATs in the regulation of gene expression is not well established. However, several interesting examples of antisense regulators have been described. One such example involves the early to late phase transition during polyomavirus replication. High levels of readthrough transcription of late viral mRNA suppress expression of complementary early mRNA. These complementary strands form a long, perfectly matched duplex structure that is extensively modified by an enzyme know as ADAR (adenine deaminase acting of double-stranded RNA) and retained in the cell nucleus.
          Although at present there are few common themes among possible eukaryotic antisense regulators, it is useful to distinguish between cis-NATs and trans-NATs. While cis-NATs are perfectly complementary to their targets, and can potentially form extended regions of perfectly paired duplex, trans-NATs manifest more limited, imperfectly matched base-pairing interactions with their targets. In the relatively few instances where efforts have been made to demonstrate a physiological role for base-pairing interactions between NATs and targets, most results have been remained inconclusive.

  • Introduction
    • Q:What is NATsDB?

      A:NATs database: Natural Antisense Transcripts database.
          After developing a fast, integrative pipeline to identify cis natural antisense transcript (cis-NATs) at genome scale and using transcriptome and genome sequences in UniGene and GoldenPath, we applied the pipeline to identify cis-NATs in eleven eukaryotic species, screening eight of these species for the first time and bringing the number of candidate SA pairs in human to 7,246. We construct this free and publicly accessible database that allows researchers to query the dataset.

    • Q:What is NATsDB's architecture?

      A: The construction pipeline, data source and final results are summarized in the following figures. For more details, please refer to our Nucleic acids research paper (Zhang, et. al., 2006).

      1. NATsDB's construction pipeline
      2. Recent pipeline optimization

        We recently optimized the methodology from two aspects.

        1. Exclude those un-reliable GoldenPath mRNA/EST mappings more stringently
              We only kept the mappings meeting all of the following criteria: the minimum mapping length is 150 bps; the minimum identity is 96%; the minimum alignment is 97%, and the minimum coverage is 75%. For each transcript, we only retained the best mapping. We discarded the transcript, if it has multiple best mappings, i.e., identical mapping parameters. Using Entrez Gene as the cross-reference system, we dropped any mRNA/EST mapping to somatic DNA recombination hotspots of the immunoglobulin or T-cell receptor in the international ImMunoGeneTics information system (IMGT) because of the difficulty to infer the exact genomic location of these genes.
        2. Inferring the orientation of un-spliced ESTs
             We re-implemented Par G. Engstrom et al.'s strategy to infer the orientation of un-spliced ESTs. Except the evidences used in previous pipeline, we made use of directional annotation of ESTs, i.e., 3'sequencing or 5'sequencing. We also identify the direction-reliable libraries by comparison between the orientation of spliced ESTs and their direction annotation. A library was considered reliable if the proportion of correctly-oriented ESTs was estimated to be more than 99% at the 99% confidence level. Par G. Engstrom et al. proved combination of these evidences performed well to infer the orientation of unspliced ESTs. With such updates, we retained the 1,139,001 (50%) unspliced ESTs in case of human. By comparison, our previous pipeline to consider polyA tail and signal tended to be excessively strigent, which only retained 317,846 (14%) unspliced ESTs.

      3. Data source we used to construction NATsDB
        a. "#Sequence" is the total number of mRNAs and ESTs that can be mapped to genome sequences and have reliable orientation.
        b. "%Spliced-ESTs or mRNAs" is the percent of spliced ESTs and mRNAs and equal to the number of spliced ESTs and mRNAs divided by the total number of sequences for that species.
        c. "SA%" is the percent of SA clusters ,(2*"number of SA clusters"/(2*"number of SA clusters"+2*"number of NOB clusters"+"number of NBD clusters")).

      4. Statistical data on natural antisense transcripts we identified
           The following figures make statistics on SA pairs, NOB pairs and NBD sequences.

        Statistical data on SA pairs


        Statistical data on NOB pairs


        Statistical data on NBD sequences

    • Q:What can NATsDB be used for and what are the main features of NATsDB?

      A:   NATs database can serve as a repository for current knowledge and a starting point for future experimental design or in silico data mining.New technologies such as CAGE, SAGE, and genomic tiling array will identify more cis-NATs, as have already been shown to be the case for mouse .The EST-based identification strategy will continue to be useful because for many species, EST data is the only source of transcriptome data available.

          NATsDB offers the following features in one unified web-based user interface:

      1. cis-NATs identified in 11 genomes including human, mouse, fly, worm, sea squirt, chicken, rat, frog, zebrafish, cow and dog, the largest collection so far. In addition, non-exon-overlapping bi-directional clusters and non-bidirectional clusters are also included.
      2. a web-based graphical interface we developed for NATsDB that shows the alignment of all sense transcripts, antisense transcripts, and genomic sequences, with hyperlinks to related databases. It also contains many features of the sense and antisense transcripts such as phastCons conservation. Sense-antisense pairs were divided into six sub-groups according to their overlapping patterns.
      3. a web-based graphical interface for browsing by species or by chromosomes. It also allows users to easily select subsets of the data such as ESTs with polyA signals and tails or only ESTs with splicing sites.

    • Q:What differences are there among our NATsDB and other database?

      A:   Currently, there are only a few databases on cis-NATs except NATsDB, such as SADB,Sense/Antisense Database and LEADS-Antisensor. Compared to our NATsDB, the following three reasons restrict their usage. 1) They do not show the orientation evidence of the sequences, which are important for the analysis of SA pairs, especially from ESTs. 2) Due to lack of update, their data are relatively out of date. Actually, their data came from GenBank or FANTOM2 released before 2004. 3) They are limited to two model organisms, human and mouse. Obviously, our NATs data from more species with periodic updates will be valuable not only for the study of antisense regulation in corresponding organisms but also for screening of conserved or species-specific cis-NATs.

  • Using NATsDB
    • Q:Browse and search NATsDB

      A: There are two main approaches to search natural antisense transcripts you are interested in.

      1. In the Browse page, you can use dropdownlists or mouse over the figure to get information. As for dropdownlist,
        1. you should specify a type of cluster :
          • SA, Sense-Antisense pair cluster
          • NOB, Non-exon-Overlapping Bidirectional cluster
          • NBD, Non-BiDirectional cluster
        2. you can specify the "species" (seven species from human to zebrafish with complete genome), the classification "type" of SA pairs and the figure configuration "Height".
                  The classification "type" of SA pairs:
          • 55:head-head(divergent), SA gene pairs with first exon of both partners involved in the overlap;
          • 33:tail-tail(convergent), SA gene pairs with last exon of both partners involved in the overlap;
          • complete: one gene sequence completely covered by an exon of the other;
          • contained: one gene sequence completely covered by the intron and exon of the other;
          • intronic: one gene starting within an intron of the other and transcribing within and across the exons
          • others: all other SA pairs.

          • The corresponding schema is shown in the following figure:

        3. the selection box on coding potential specifies the SA pairs in terms of the coding potential of representative sequences. For example, coding/noncoding indicates one gene has CDS (coding sequence) annotation, while the other does not.
        4. if users input 'apoptosis' in query box, only clusters including at least one sequence with description containing 'apoptosis' will be displayed. As for overlap box, users specify the minimum overlapping length for SA pairs or NOB pairs. As for mousing over the figure, you can click "+" to get information.
      2. In the search page, we provide text search, chromosome location search, OMIM search(disease search) and sequence search.
        1. Text search mode supports Boolean mode.
              You can fill the "Gene search" text box with Entrez Gene name, synonym and description, such as THRA, or the "Transcript search" text box with mRNA/EST accession number and description, such as X55005
              Especially, for exactness and acceleration of search, you can click the "Name only " checkbox and choose one of seven species in the "Species" dropdownlist. Click the "Name only" checkbox, and you only type gene name other than gene description to get information. You also can choose one of seven species in the "Species" dropdownlist in "Transcript search" text box.
             The image below is results via "Gene search" or "Transcript search". Click the "ClusterID" or "Name/Accession", then go into sequence detailed information page.
        2. Chromosome location search.
              Users could specify special genomic location and retrieve clusters derived from this region.UCSC-like chromosome location format is supported, for example, chr17:35471999-35510504. "chr17", "35471999" and "35510504" indicate chromosome 17, chromosome beginning coordinate, and chromosome end coordinate, respectively. For versions of chromosomes used in current NATsDB, please see also data source section.
        3. OMIM search (disease search)
              Users could specify special disease name, such as Parkinson. NATsDB will show you gene clusters, gene names,gene descriptions and cluster types which are related to the disease.
        4. Sequence search way via blast.
              Enter your sequence in FASTA format into the textbox, and then choose programs based on your sequence. If it is protein sequence, program "tblastn" is suggested; Otherwise "blastn" is suggested. You also can choose one of seven species in the second dropdownlist to acceleration of alignment. See an example. Click the high score alignment, and then go into the sequence information page of this hit.
    • Q:Fine-tune NATsDB display

      A: NATsDB displays a sequence informtion page and a genome loci page. A genome loci page can be gotten through a text search, a sequence search, or clicking "+" on chromosome which we described above.
         Two main features in genomic loci page: The annotations tracks and a set of controls including navigation controls, display configuration buttons and display controls.

      ? ???? ?  The first time you open the NATsDB, it will use the application default values to configure the annotation tracks and just show Genome location of this cluster , mRNA the cluster contains and representative sequences.
         Manipulating the navigation, configuration and display controls
         The track display controls are gathered together that reflect the type of data in the track, e.g. isoform prediction tracks, mRNA and EST tracks.
         The track display controls use a default set of display conventions: Genome location, mRNAs this cluster contains and representative sequences.
          To change the display mode for a track, find the track's control at the top of the genome loci page, select the desired mode from the control's display menu, and then click the GO button. These options let the user restrict the data displayed within an annotation track.
      • Changing the font size in the annotation tracks
           The annotation tracks may be adjusted to display in a range of fonts from "small" to "large". To change the font size, select an option from the font size pull-down menu , then click Submit. The font size is set to "small" by default.
      • Changing the width of the annotation tracks
           By default, the width of the annotation track is set to 900 pixels. Notice that 900 pixels are also the lest pixels.To modify the width to suit your browser best, enter a new value in the image width text box , then click the GO button. For example, setting the display to 1100 pixels on a 19" monitor will increase the visible portion of the cluster and reduce the need for redraws.
      • Annotation track descriptions controls:
            Each annotation track has an associated control to make it hidden or shown.
        1. Conservation score track:
              "Show phastCons" control restricts the "Conservation Score" track. Chromosome region in this cluster is calculated by PhastCons. And the "Conservation Score" represents the degree of region's conservation.
        2. Repeat regions track
              This track, associated with the "Show repeat" control, displays whether repeats exist in the chromosome region or not. If the region you select doesn't has any repeats, even if the "show repeat" control has been clicked, there will be none shown!
        3. CpGisland track:
              The CpGisland track,associated with the "Show CpGisland" control, located below the conservation score track, indicates in this region there is a CpGisland. Click the blocks in this track, and then your browser will go to UCSC genome browser.If this region has no CpGisland, although you have clicked the "show CpGisland" control, it will not display the track.
        4. FirstExon track:
              The firstExon track, associated with the "Show FirstEF" control, indicates the prediction of PoIII promoter by FirstExon. Black line starnds for the promoter in the plus strand while red lind for the promoter in the minus strand. Click it and go to UCSC.
        5. ¡Isoform prediction tracks:
              The alternative splicing isoforms are predicted by SVAP prediction algorithm(svap.cbi.pku.edu.cn). If you click the "show isoform" box, all the isoforms in this cluster will be listed. In order to filter out those single-exon ESTs, click the option "Spliced" box. The isoforms are assembled with their all supporting transcripts. To display the relationship between the isoforms and their supporting transcripts, click the randomly generated isoforms¡¯IDs ( the mid-prefix ¡®.p.¡¯ or ¡®.m.¡¯ indicates plus strand and minus strand, respectively), and then the browser will turn to a new page to show the relationship.
        6. Transcript box
              If users input one accession number in the 'Transcript' box, for example, X72304, the browser will show the corresponding genomic region and all the sequences derived from this region. A noteworthy point is that the retrieved sequence must meet with the other controls' criteria too, such as "Subset", "Show 'hidden' transcripts", etc.
        7. Show 'hidden' transcripts controls
             Here 'hidden' transcripts are referred to those transcripts which have not any overlapping regions with their opposite strand's transcripts.Because by default "Show 'hidden' transcripts" in this box is not clicked, 'hidden' transcripts are hidden.
        8. Subset controls:
             We collect orientation-reliable sequences in UniGene, including Refseq sequences, mRNA with CDS annotation, mRNA without CDS annotation, spliced ESTs, ESTs with polyA tails and ESTs wiht polyA signals. Every kind of transcripts can be shown or hidden via corresponding controls box in the "Subset" controls.
              "polyA-EST-1" control stands for ESTs with polyA tails; These ESTs' orentations were determined to be the original orientation. "polyA-ET-2" control stands for ESTs of which standard polyA signals agreed with their direction annotations. "EST from reliable libaries" control stands for those ESTs which satisfy conditions as defined below. For each EST library, we determined the orientation of spliced ESTs and compared it with their direction annotation, i.e., 3' sequencing or 5' sequencing. If the proportion of spliced ESTs with correct direction annotation in a library was over 99% at the 99% confidence level, the library was considered "orientation reliable" and the direction annotation of the unspliced ESTs in the library was adopted.
        9. Changing the strand in the sequence track
              Cis-NATs describe RNAs containing sequences that are complementary to other endogenous RNAs from opposite DNA strand at the same genomic locus. In a genome cluster, the sequence track may be adjusted to display information from "minus" ,"plus" to "both".
        10. Other controls:
             A cluster may cover a long region of the chromosome.To display a completely different position in the cluster, enter the new query in the "start/end" text box, and then click the GO button. Besides, by default, only representative sequences are listed in a cluster. Remove the "show representative" box, and all the sequences in the cluster will be shown. And if "show 'hidden' transcripts" control is clicked, the transcripts which have not complementary transcripts will be shown.
          NOTICE:
          For we set the "show representative" control the highest priority, in order to make other controls useful, the control should not be clicked!
    • Q:Interprete NATsDB display
      A:The content from the genomic loci page is different from that from the sequence information page.
      1. In the genomic loci page
            The genomic loci page is dependant on a cluster to which we assigned a unique cluster id . Cluster ID is a unique and random number to specify a cluster. So one cluster in human has nothing to do with the cluster of the same ID in other species.
        • Genome location track:
              The genome location track, located just above the conservation score track, provides a graphical overview of chromosome coordinate, including an indication of the region currently displayed in the annotation tracks . Click the red description above the chromosome coordinate base line,and then your browser will turn to the corresponding region in UCSC genome browser .
        • Isoform tracks:
              Isofomr tracks have the same fields as "Sequence tracks" described below. Meanwhile we also annotate the isoforms with exon-intron structures, poly (A) tails, poly (A) signals, tissue expression patterns by summing up the number of all member ESTs of a variant from each specific tissue based on data parsed from BodyMap-Xs (Gupta, et al., 2004b; Ogasawara, et al., 2006) (useful for distinguishing housekeeping variants from tissue-specific ones),
        • Sequence tracks:
              Sequence tracks have three fields in order: display id, sequence structure, links to detailed information page.
          1. Display id
                If you click the box before the "display id", the sequences are chosen and displayed in the refreshed page. Click the display id in the first field or ,and your browser will turn to the sequence information page which we will discuss below.
          2. Sequence structure
                In the sequence structure, three rows describe a sequence. The type of transcript(for instance, mRNA with CDS, ployA-EST,et al), sequence' function description, sequence length(for instance, ~1k) , exon numbers(for instance, 3 blocks), standard splicing site number (for instance, IO=2) are all shown in the first row. Coding exons are represented by blocks connected by horizontal lines representing introns. The number inserted into the horizontal lines represents the intron's length. The 5' and 3' untranslated regions (UTRs) are displayed as green blocks on the leading and trailing ends of the aligning regions. Arrowhead on the connecting intron lines indicates the direction of transcription. In situations where no intron is visible (e.g. single-exon genes, extremely zoomed-in displays), the arrowhead is displayed on the exon block itself. Click the coding exon blocks, and then your browser will also turn to the corresponding region in UCSC genome browser. As for some sequences, polyA tail lengths and potential polyA signals are shown in green. Here, polyA tail was defined as a stretch of at least 10 As at 3' end of a sequence and PolyA signal was defined as hexanucleotide 'AATAAA', 'ATTAAA', 'AATTAA', 'AATAAT', 'CATAAA' or 'AGTAAA' within the last 50 bp of 3' end of a sequence after the polyA tail was trimmed. Possible polyA tail or polyA signal predicted in the reverse complement strand will be shown in red.
          3. Representative sequence
                The red information in the first row shows the sequence has been chosen as the representative sequence in this cluster.
          4. links to detailed information page
                The right fields of sequences tracks show gene name, homologene(H stands for), OMIM id(O stands for),sequence(S stands for). Click gene name,H,O,or S,and then your browser will turn into sequence information page.
                Especially, after you enter the sequence information page, homologene in cross reference provides you homologue transcripts in 11 species. Click any homologue transcripts you are interested in, and you can compare former transcript's SA pairs with latter transcript's SA pairs.
        • Unigene expression track
              We used data in BodyMap-Xs to profile the expression of transcripts in NATsDB across 13 organs, 40 tissues, and normal vs. pathological conditions.
              Only if you choose EST sequences from Sequence tracks, this track will be shown. It contains transcripts' expression profiles in different organs and tissues, and shows the proportion of transcripts from plus strand to transcripts from minus strand under different conditions.
          1. Transcripts' expression profile in different organs in "13 organs".
          2. "40 tissues" shows you transcripts' expresion profile in 40 tissues.
          3. "Normal/Tumor" unfurls transcripts' expression profile under normal condition and under tumor condition.
          4. "Organ/Condition" shows you transcripts' expression profile in 13 organs under normal condition and under tumor condition, respectively . See an example.

                The histograms above the baseline denote the transcripts expressed in 13 organs under normal condition. The green histograms stand for transcripts from minus strand, the brown for transcripts from plus strand.The histograms below the baseline denote the transcripts expressed under tumor condition.
          5. "Tissue/Condition" denotes transcripts' expression profile in 40 tissues under normal condition and under tumor condition, respectively. The meaning of the histograms above and below the baseline in the figure is the same to that of the histograms in "Organ/Condition".
      2. In the sequence information page
            Click the display id or the links of sequence tracks in the genomic loci page, and then your browser goes into sequence information page. General information(Unigene Cluster, Description,et al), local annotation, cross reference(Gene, Homologene,OMIM id) are given to supplemnt sequence's information. The items in the cross reference are linked to NCBI Gene, UniGene, HomoloGene and other homolog sequences in our databset.
  • Other questions:
    • Q:What is a FASTA sequence?
      A:   A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.
      See an example:
       >BU938537 AGENCOURT_10517608 NIH_MGC_169 Mus musculus cDNA clone IMAGE:6706694 5', mRNA sequence (806 bp)
      GGGACAGGCAGTTAAGTCCCCCCAGTCTTCCAACTGTGCCTGTTTCTGCTGCCGACGAGGGAGGGGCCTCTCGGGG
      GCCTCTTGCGGCCCCTTCCATCTCCTGCGCCACCAAATGTGGGCTCTGCAGGGCAGGCGAGTGCCGACGAGGCAAC
      CTCTGGTCTCGGCTTCCCACAATCCTCCTCCTCCCCGTTAAGAGAACTTGCGTTTCTTCATGGCTTCCGCCTTGACC
      GCCAGGCTCTTGGCACAGATGGTCAGGACCACGATGAGAACACCCACCACAAACGCCACCACCGGCCAGAGGAAGCA
      GTGCAGGAGAGCAATCTCGTCGTGTGTGCGCTGTAGGAGAACGTCCTCTGGTCTGCAAGAGAGTGGAGGACAGGAAA
      GGAAGACCAGTGAGTCTCTCAGCCCAGCACATTCTACACAACCTGCAGACGACAAGTGCTCTCCGGCGTAGGGAATA
      ATCACCCTTGAAAGTGATGACTCTGATTGCAGACACCGTCCCAGACTGCTCCACACCCATTTCCAAAAGCCGTGAGCT
      
    • Q:How to cite NATsDB?
      A:   Please cite the following article:
      Yong Zhang, XS Liu, Qing-Rong Liu and Liping Wei. Genome-wide in silico identification and analysis of cis-natural antisense transcripts (cis-NATs) in ten species. Nucleic Acids Res., 34: 3465-3475
© Center for Bioinformatics, Peking University