- Background
- What are NATs,cis-NATs and trans-NATs?
- Which biological functions do NATs have?
- Introduction
- What is NATsDB ?
- What is NATsDB's architecture?
- NATsDB's construction pipeline
- Recent pipeline optimization
- Data sources we used to construction NATs DB
- Statistical data on natural antisense transcripts we identified
- What can NATsDB be used for and what are the main features of NATsDB?
- What differences are there among our NATsDB and other databases?
- Using NATsDB
- Browse and search NATsDB
- Fine-tune NATsDB display
- Interprete NATsDB display
- In the genome loci page
- In the sequence information page
- Other questions
- What is a fasta sequence?
- How to cite NATsDB?
- Background
- Q:What are NATs,cis-NATs and trans-NATs?
A:Natural Antisense Transcripts
(NATs) are simply RNAs containing sequences that are complementary
to other endogenous RNAs. They can be transcribed in cis
from opposing DNA strands at the same genomic locus (cis-NATS),
or in trans from separate loci (trans-NATS).
Two other classes of trans-acting noncoding RNA
are related to trans-NATs because they recognize their
target RNAs by imprecise base-pairing: microRNAs (miRNAs), which
inhibit the translation of mRNAs , and small nucleolar RNAs (snoRNAs),
which guide the modification of noncoding RNAs.
Two another categories related to cis-NATs are NOB
(Non-exon-Overlapping Bidirectional) and NBD (Non-BiDirectional).
All of them are depicted in the following figure:
- Q:Which biological functions do NATs have?
A:Pairing of NATs to sense RNAs is known to regulate expression
of many different genes in cells and their accessory elements:
viruses, plasmids and transposons. In recent years, NATs have
been implicated in many aspects of eukaryotic gene expression
including genomic imprinting, RNA interference, translational
regulation, alternative splicing, X-inactivation and RNA editing.
Moreover, there is growing evidence to suggest that NATs might
have a key role in a range of human diseases.But the role of NATs,
in most cases, is poorly understood in eukaryotic organisms.
NATs as regulators range in size from very small,
such as microRNA (miRNA) which are 21-22 nts, to extremely large,
such as Air transcript(108kb) associated with the imprinted Igf2r
locus.
miRNAs are the best-studied examples of antisense
regulators in eukaryotes. Until 2001, hundreds micro RNAs (miRNAs)
have been found in human, mouse and other organisms. Although
the physiological function of most microRNAs is unknown, it is
likely that most, if not all, exert their effects through base
pairing with complementary target sequences.
The role of larger NATs in the regulation of gene
expression is not well established. However, several interesting
examples of antisense regulators have been described. One such
example involves the early to late phase transition during polyomavirus
replication. High levels of readthrough transcription of late
viral mRNA suppress expression of complementary early mRNA. These
complementary strands form a long, perfectly matched duplex structure
that is extensively modified by an enzyme know as ADAR (adenine
deaminase acting of double-stranded RNA) and retained in the cell
nucleus.
Although at present there are few common themes
among possible eukaryotic antisense regulators, it is useful to
distinguish between cis-NATs and trans-NATs.
While cis-NATs are perfectly complementary to their targets,
and can potentially form extended regions of perfectly paired
duplex, trans-NATs manifest more limited, imperfectly
matched base-pairing interactions with their targets. In the relatively
few instances where efforts have been made to demonstrate a physiological
role for base-pairing interactions between NATs and targets, most
results have been remained inconclusive.
- Introduction
- Q:What is NATsDB?
A:NATs database: Natural Antisense
Transcripts database.
After developing a fast, integrative pipeline to
identify cis natural antisense transcript (cis-NATs)
at genome scale and using transcriptome and genome sequences in
UniGene and GoldenPath, we applied the pipeline to identify cis-NATs
in eleven eukaryotic species, screening eight of these species
for the first time and bringing the number of candidate SA pairs
in human to 7,246. We construct this free and publicly accessible
database that allows researchers to query the dataset.
- Q:What is NATsDB's architecture?
A: The construction pipeline, data source and final results
are summarized in the following figures. For more details, please
refer to our Nucleic acids research paper (Zhang, et. al.,
2006).
- NATsDB's construction pipeline
- Recent pipeline optimization
We recently optimized the methodology from two aspects.
- Exclude those un-reliable GoldenPath mRNA/EST mappings
more stringently
We only kept the mappings meeting all of the
following criteria: the minimum mapping length is 150 bps;
the minimum identity is 96%; the minimum alignment is 97%,
and the minimum coverage is 75%. For each transcript, we
only retained the best mapping. We discarded the transcript,
if it has multiple best mappings, i.e., identical mapping
parameters. Using Entrez Gene as the cross-reference system,
we dropped any mRNA/EST mapping to somatic DNA recombination
hotspots of the immunoglobulin or T-cell receptor in the
international ImMunoGeneTics information system (IMGT) because
of the difficulty to infer the exact genomic location of
these genes.
- Inferring the orientation of un-spliced ESTs
We re-implemented Par G. Engstrom et al.'s
strategy to infer the orientation of un-spliced ESTs. Except
the evidences used in previous pipeline, we made use of
directional annotation of ESTs, i.e., 3'sequencing or 5'sequencing.
We also identify the direction-reliable libraries by comparison
between the orientation of spliced ESTs and their direction
annotation. A library was considered reliable if the proportion
of correctly-oriented ESTs was estimated to be more than
99% at the 99% confidence level. Par G. Engstrom et al.
proved combination of these evidences performed well to
infer the orientation of unspliced ESTs. With such updates,
we retained the 1,139,001 (50%) unspliced ESTs in case of
human. By comparison, our previous pipeline to consider
polyA tail and signal tended to be excessively strigent,
which only retained 317,846 (14%) unspliced ESTs.
- Data source we used to construction
NATsDB

a. "#Sequence" is the total number of mRNAs and ESTs that can
be mapped to genome sequences and have reliable orientation.
b. "%Spliced-ESTs or mRNAs" is the percent of spliced ESTs and
mRNAs and equal to the number of spliced ESTs and mRNAs divided
by the total number of sequences for that species.
c. "SA%" is the percent of SA clusters ,(2*"number of SA
clusters"/(2*"number of SA clusters"+2*"number
of NOB clusters"+"number of NBD clusters")).
- Statistical data on natural antisense transcripts we identified
The following figures make statistics on SA pairs,
NOB pairs and NBD sequences.
Statistical data on SA pairs

Statistical data on NOB pairs

Statistical data on NBD sequences

- Q:What can NATsDB be used for and what are the main features
of NATsDB?
A: NATs database can serve as a repository
for current knowledge and a starting point for future experimental
design or in silico data mining.New technologies such as CAGE,
SAGE, and genomic tiling array will identify more cis-NATs,
as have already been shown to be the case for mouse .The EST-based
identification strategy will continue to be useful because for
many species, EST data is the only source of transcriptome data
available.
NATsDB offers the following features in one unified
web-based user interface:
- cis-NATs identified in 11 genomes including human,
mouse, fly, worm, sea squirt, chicken, rat, frog, zebrafish,
cow and dog, the largest collection so far. In addition, non-exon-overlapping
bi-directional clusters and non-bidirectional clusters are also
included.
- a web-based graphical interface we developed for NATsDB that
shows the alignment of all sense transcripts, antisense transcripts,
and genomic sequences, with hyperlinks to related databases.
It also contains many features of the sense and antisense transcripts
such as phastCons conservation. Sense-antisense pairs were divided
into six sub-groups according to their overlapping patterns.
- a web-based graphical interface for browsing by species or
by chromosomes. It also allows users to easily select subsets
of the data such as ESTs with polyA signals and tails or only
ESTs with splicing sites.
- Q:What differences are there among our NATsDB and other database?
A: Currently,
there are only a few databases on cis-NATs except NATsDB,
such as SADB,Sense/Antisense
Database and LEADS-Antisensor.
Compared to our NATsDB, the following three reasons restrict their
usage. 1) They do not show the orientation evidence of the sequences,
which are important for the analysis of SA pairs, especially from
ESTs. 2) Due to lack of update, their data are relatively out
of date. Actually, their data came from GenBank or FANTOM2 released
before 2004. 3) They are limited to two model organisms, human
and mouse. Obviously, our NATs data from more species with periodic
updates will be valuable not only for the study of antisense regulation
in corresponding organisms but also for screening of conserved
or species-specific cis-NATs.
- Using NATsDB
- Q:Browse and search NATsDB
A: There are two main approaches to search natural antisense
transcripts you are interested in.
- In the Browse page, you can use dropdownlists or mouse over
the figure to get information.
As for dropdownlist,
- you should specify a type of cluster :
- SA, Sense-Antisense pair cluster
- NOB, Non-exon-Overlapping Bidirectional cluster
- NBD, Non-BiDirectional cluster
- you can specify the "species" (seven species from human
to zebrafish with complete genome), the classification "type"
of SA pairs and the figure configuration "Height".
The
classification "type" of SA pairs:
- 55:head-head(divergent), SA gene pairs with first
exon of both partners involved in the overlap;
- 33:tail-tail(convergent), SA gene pairs with last
exon of both partners involved in the overlap;
- complete: one gene sequence completely covered by
an exon of the other;
- contained: one gene sequence completely covered by
the intron and exon of the other;
- intronic: one gene starting within an intron of the
other and transcribing within and across the exons
- others: all other SA pairs.
The corresponding schema is shown in the following figure:
- the selection box on coding potential
specifies the SA pairs in terms of the coding potential
of representative sequences. For example, coding/noncoding
indicates one gene has CDS (coding sequence) annotation,
while the other does not.
- if users input 'apoptosis' in
query box, only clusters including at least one sequence
with description containing 'apoptosis' will be displayed.
As for overlap box, users specify the minimum overlapping
length for SA pairs or NOB pairs. As for mousing over the
figure, you can click "+" to get information.
- In the search page, we provide text search, chromosome location
search, OMIM search(disease search) and sequence search.

- Text search mode supports Boolean mode.
You can fill the "Gene search" text box with
Entrez Gene name, synonym and description, such as THRA,
or the "Transcript search" text box with mRNA/EST accession
number and description, such as X55005
Especially, for exactness and acceleration
of search, you can click the "Name only " checkbox and choose
one of seven species in the "Species" dropdownlist. Click
the "Name only" checkbox, and you only type gene name other
than gene description to get information. You also can choose
one of seven species in the "Species" dropdownlist in "Transcript
search" text box.
The image below is results via "Gene search"
or "Transcript search". Click the "ClusterID" or "Name/Accession",
then go into sequence detailed information page.
- Chromosome location search.
Users could specify special genomic location
and retrieve clusters derived from this region.UCSC-like
chromosome location format is supported, for example, chr17:35471999-35510504.
"chr17", "35471999" and "35510504" indicate chromosome 17,
chromosome beginning coordinate, and chromosome end coordinate,
respectively. For versions of chromosomes used in current
NATsDB, please see also data source section.
- OMIM search (disease search)
Users could specify special disease name,
such as Parkinson. NATsDB will show you gene clusters, gene
names,gene descriptions and cluster types which are related
to the disease.
- Sequence search way via blast.
Enter your sequence in FASTA format into the
textbox, and then choose programs based on your sequence.
If it is protein sequence, program "tblastn" is suggested;
Otherwise "blastn" is suggested. You also can choose one
of seven species in the second dropdownlist to acceleration
of alignment. See an example. Click the high score alignment,
and then go into the sequence information page of this hit.
- Q:Fine-tune NATsDB display
A: NATsDB displays a sequence informtion page and a genome
loci page. A genome loci page can be gotten through a text search,
a sequence search, or clicking "+" on chromosome which we described
above.
Two main features in genomic loci page: The annotations
tracks and a set of controls including navigation controls, display
configuration buttons and display controls.
? ???? ? The first time you open the NATsDB, it will use the
application default values to configure the annotation tracks and
just show Genome location of this cluster , mRNA the cluster contains
and representative sequences.
Manipulating the navigation,
configuration and display controls
The track display controls are gathered together that
reflect the type of data in the track, e.g. isoform prediction tracks,
mRNA and EST tracks.
The track display controls use a default set of display
conventions: Genome location, mRNAs this cluster contains and representative
sequences.
To change the display mode for a track, find the track's
control at the top of the genome loci page, select the desired mode
from the control's display menu, and then click the GO button. These
options let the user restrict the data displayed within an annotation
track.
- Changing the font size in the annotation tracks
The annotation tracks may be adjusted to display
in a range of fonts from "small" to "large". To change the font
size, select an option from the font size pull-down menu , then
click Submit. The font size is set to "small" by default.
- Changing the width of the annotation tracks
By default, the width of the annotation track is
set to 900 pixels. Notice that 900 pixels are also the lest
pixels.To modify the width to suit your browser best, enter
a new value in the image width text box , then click the GO
button. For example, setting the display to 1100 pixels on a
19" monitor will increase the visible portion of the cluster
and reduce the need for redraws.
- Annotation track descriptions controls:
Each annotation track has an associated control
to make it hidden or shown.
- Conservation score track:
"Show phastCons" control restricts the "Conservation
Score" track. Chromosome region in this cluster is calculated
by PhastCons. And the "Conservation Score" represents the
degree of region's conservation.
- Repeat regions track
This track, associated with the "Show repeat"
control, displays whether repeats exist in the chromosome
region or not. If the region you select doesn't has any
repeats, even if the "show repeat" control has been clicked,
there will be none shown!
- CpGisland track:
The CpGisland track,associated with the "Show CpGisland"
control, located below the conservation score track,
indicates in this region there is a CpGisland. Click the blocks in this track,
and then your browser will go to UCSC genome browser.If this region has no CpGisland,
although you have clicked the "show CpGisland" control, it will not display the track.
- FirstExon track:
The firstExon track, associated with the "Show FirstEF" control,
indicates the prediction of PoIII promoter by FirstExon.
Black line starnds for the promoter in the plus strand while red lind for the promoter in the
minus strand. Click it and go to UCSC.
- Isoform prediction tracks :
The alternative splicing isoforms are predicted
by SVAP prediction algorithm(svap.cbi.pku.edu.cn). If you click the "show isoform" box, all the isoforms
in this cluster will be listed. In order to filter out those single-exon ESTs, click the option "Spliced" box. The isoforms are assembled with their all supporting transcripts. To display the relationship between the isoforms and their supporting transcripts, click the randomly generated isoforms¡¯IDs ( the mid-prefix ¡®.p.¡¯ or ¡®.m.¡¯ indicates plus strand and minus strand, respectively), and then the browser will turn to a new page to show the relationship.
- Transcript box
If users input one accession number in the
'Transcript' box, for example, X72304, the browser will
show the corresponding genomic region and all the sequences
derived from this region. A noteworthy point is that the
retrieved sequence must meet with the other controls' criteria
too, such as "Subset", "Show 'hidden' transcripts", etc.
- Show 'hidden' transcripts controls
Here 'hidden' transcripts are referred to those
transcripts which have not any overlapping regions with
their opposite strand's transcripts.Because by default "Show
'hidden' transcripts" in this box is not clicked, 'hidden'
transcripts are hidden.
- Subset controls:
We collect orientation-reliable sequences in
UniGene, including Refseq sequences, mRNA with CDS annotation,
mRNA without CDS annotation, spliced ESTs, ESTs with polyA
tails and ESTs wiht polyA signals. Every kind of transcripts
can be shown or hidden via corresponding controls box in
the "Subset" controls.
"polyA-EST-1" control stands for ESTs with
polyA tails; These ESTs' orentations were determined to
be the original orientation. "polyA-ET-2" control stands
for ESTs of which standard polyA signals agreed with their
direction annotations. "EST from reliable libaries" control
stands for those ESTs which satisfy conditions as defined
below. For each EST library, we determined the orientation
of spliced ESTs and compared it with their direction annotation,
i.e., 3' sequencing or 5' sequencing. If the proportion
of spliced ESTs with correct direction annotation in a library
was over 99% at the 99% confidence level, the library was
considered "orientation reliable" and the direction annotation
of the unspliced ESTs in the library was adopted.
- Changing the strand in the sequence track
Cis-NATs describe RNAs containing
sequences that are complementary to other endogenous RNAs
from opposite DNA strand at the same genomic locus. In a
genome cluster, the sequence track may be adjusted to display
information from "minus" ,"plus" to "both".
- Other controls:
A cluster may cover a long region of the chromosome.To
display a completely different position in the cluster,
enter the new query in the "start/end" text box, and then
click the GO button. Besides, by default, only representative
sequences are listed in a cluster. Remove the "show representative"
box, and all the sequences in the cluster will be shown.
And if "show 'hidden' transcripts" control is clicked, the
transcripts which have not complementary transcripts will
be shown.
NOTICE:
For we set the "show representative" control
the highest priority, in order to make other controls useful,
the control should not be clicked!
- Q:Interprete NATsDB display
A:The content from the genomic loci page is different from
that from the sequence information page.
- In the genomic loci page
The genomic loci page is dependant on a cluster
to which we assigned a unique cluster id . Cluster ID is a unique
and random number to specify a cluster. So one cluster in human
has nothing to do with the cluster of the same ID in other species.
- Genome location track:
The genome location track, located just above
the conservation score track, provides a graphical overview
of chromosome coordinate, including an indication of the
region currently displayed in the annotation tracks . Click
the red description above the chromosome coordinate base
line,and then your browser will turn to the corresponding
region in UCSC genome browser .
- Isoform tracks:
Isofomr tracks have the same fields as "Sequence tracks" described below.
Meanwhile we also annotate the isoforms with exon-intron structures, poly (A) tails, poly (A) signals, tissue expression patterns by summing up the number of all member ESTs of a variant from each specific tissue based on data parsed from BodyMap-Xs (Gupta, et al., 2004b; Ogasawara, et al., 2006) (useful for distinguishing housekeeping variants from tissue-specific ones),
- Sequence tracks:
Sequence tracks have three fields in order:
display id, sequence structure, links to detailed information
page.
- Display id
If you click the box before the "display
id", the sequences are chosen and displayed in the refreshed
page. Click the display id in the first field or ,and
your browser will turn to the sequence information page
which we will discuss below.
- Sequence structure
In the sequence structure, three rows
describe a sequence. The type of transcript(for instance,
mRNA with CDS, ployA-EST,et al), sequence' function
description, sequence length(for instance, ~1k) , exon
numbers(for instance, 3 blocks), standard splicing site
number (for instance, IO=2) are all shown in the first
row. Coding exons are represented by blocks connected
by horizontal lines representing introns. The number
inserted into the horizontal lines represents the intron's
length. The 5' and 3' untranslated regions (UTRs) are
displayed as green blocks on the leading and trailing
ends of the aligning regions. Arrowhead on the connecting
intron lines indicates the direction of transcription.
In situations where no intron is visible (e.g. single-exon
genes, extremely zoomed-in displays), the arrowhead
is displayed on the exon block itself. Click the coding
exon blocks, and then your browser will also turn to
the corresponding region in UCSC genome browser. As
for some sequences, polyA tail lengths and potential
polyA signals are shown in green. Here, polyA tail was
defined as a stretch of at least 10 As at 3' end of
a sequence and PolyA signal was defined as hexanucleotide
'AATAAA', 'ATTAAA', 'AATTAA', 'AATAAT', 'CATAAA' or
'AGTAAA' within the last 50 bp of 3' end of a sequence
after the polyA tail was trimmed. Possible polyA tail
or polyA signal predicted in the reverse complement
strand will be shown in red.
- Representative sequence
The red information in the first row shows
the sequence has been chosen as the representative sequence
in this cluster.
- links to detailed information page
The right fields of sequences tracks show
gene name, homologene(H stands for), OMIM id(O stands
for),sequence(S stands for). Click gene name,H,O,or
S,and then your browser will turn into sequence information
page.
Especially, after you enter the sequence
information page, homologene in cross reference provides
you homologue transcripts in 11 species. Click any homologue
transcripts you are interested in, and you can compare
former transcript's SA pairs with latter transcript's
SA pairs.
- Unigene expression track
We used data in BodyMap-Xs to profile the
expression of transcripts in NATsDB across 13 organs, 40
tissues, and normal vs. pathological conditions.
Only if you choose EST sequences from Sequence
tracks, this track will be shown. It contains transcripts'
expression profiles in different organs and tissues, and
shows the proportion of transcripts from plus strand to
transcripts from minus strand under different conditions.
- Transcripts' expression profile in different organs
in "13 organs".
- "40 tissues" shows you transcripts' expresion profile
in 40 tissues.
- "Normal/Tumor" unfurls transcripts' expression profile
under normal condition and under tumor condition.
- "Organ/Condition" shows you transcripts' expression
profile in 13 organs under normal condition and under
tumor condition, respectively . See an example.
The histograms above the baseline denote
the transcripts expressed in 13 organs under normal
condition. The green histograms stand for transcripts
from minus strand, the brown for transcripts from plus
strand.The histograms below the baseline denote the
transcripts expressed under tumor condition.
- "Tissue/Condition" denotes transcripts' expression
profile in 40 tissues under normal condition and under
tumor condition, respectively. The meaning of the histograms
above and below the baseline in the figure is the same
to that of the histograms in "Organ/Condition".
- In the sequence information page
Click the display id or the links of sequence
tracks in the genomic loci page, and then your browser goes
into sequence information page. General information(Unigene
Cluster, Description,et al), local annotation, cross reference(Gene,
Homologene,OMIM id) are given to supplemnt sequence's information.
The items in the cross reference are linked to NCBI Gene, UniGene,
HomoloGene and other homolog sequences in our databset.
- Other questions:
- Q:What is a FASTA sequence?
A: A sequence in FASTA format begins with a single-line
description, followed by lines of sequence data. The description
line is distinguished from the sequence data by a greater-than (">")
symbol in the first column. It is recommended that all lines of
text be shorter than 80 characters in length.
See an example:
>BU938537 AGENCOURT_10517608 NIH_MGC_169 Mus musculus cDNA clone IMAGE:6706694 5', mRNA sequence (806 bp)
GGGACAGGCAGTTAAGTCCCCCCAGTCTTCCAACTGTGCCTGTTTCTGCTGCCGACGAGGGAGGGGCCTCTCGGGG
GCCTCTTGCGGCCCCTTCCATCTCCTGCGCCACCAAATGTGGGCTCTGCAGGGCAGGCGAGTGCCGACGAGGCAAC
CTCTGGTCTCGGCTTCCCACAATCCTCCTCCTCCCCGTTAAGAGAACTTGCGTTTCTTCATGGCTTCCGCCTTGACC
GCCAGGCTCTTGGCACAGATGGTCAGGACCACGATGAGAACACCCACCACAAACGCCACCACCGGCCAGAGGAAGCA
GTGCAGGAGAGCAATCTCGTCGTGTGTGCGCTGTAGGAGAACGTCCTCTGGTCTGCAAGAGAGTGGAGGACAGGAAA
GGAAGACCAGTGAGTCTCTCAGCCCAGCACATTCTACACAACCTGCAGACGACAAGTGCTCTCCGGCGTAGGGAATA
ATCACCCTTGAAAGTGATGACTCTGATTGCAGACACCGTCCCAGACTGCTCCACACCCATTTCCAAAAGCCGTGAGCT
- Q:How to cite NATsDB?
A: Please cite the following article:
Yong Zhang, XS Liu, Qing-Rong Liu and Liping Wei. Genome-wide in
silico identification and analysis of cis-natural antisense
transcripts (cis-NATs) in ten species. Nucleic Acids Res.,
34: 3465-3475
|