Abstract

We provide a novel web service, called rQuant.web, allowing convenient access to tools for quantitative analysis of RNA sequencing data. The underlying quantitation technique rQuant is based on quadratic programming and estimates different biases induced by library preparation, sequencing and read mapping. It can tackle multiple transcripts per gene locus and is therefore particularly well suited to quantify alternative transcripts. rQuant.web is available as a tool in a Galaxy installation at http://galaxy.fml.mpg.de. Using rQuant.web is free of charge, it is open to all users, and there is no login requirement.

INTRODUCTION

The current revolution in sequencing technologies allows us to obtain a much more detailed picture of transcriptomes via RNA-Sequencing (RNA-Seq) (1–3). Studying them under different conditions or in mutants will lead to a considerably improved understanding of the underlying mechanisms of gene expression and processing. An important prerequisite is to be able to accurately determine the full complement of RNA transcripts and to infer their abundance in the cell. Due to the various limitations and biases in next-generation sequencing (NGS) technologies, such analyses are less straightforward than one might na�vely expect. In particular, for accurate quantitation of RNA transcripts it appears essential to take the contribution of such biases and other technical limitations into account. rQuant implements this idea by simultaneously estimating the effect of biases as well as the abundance of RNA transcripts (4,5).

For the study of transcriptomes, currently available techniques rely on converting the RNA molecules in the sample into cDNA fragments prior to sequencing. A large portion of the observed distortions arise during cDNA library preparation, dependending on the used protocol (crucial factors are, for instance, priming, fragmentation and size selection) (3 and references therein). As a result, the reads are non-uniformly distributed along the transcript, influenced by the length of the transcript and the distance to the transcript boundaries (cf. Figure 1a). Moreover, we find that the read coverage also heavily depends on the sequence context of the fragments (6,7). We have observed that mono-nucleotides as well as di-nucleotides do not appear at the same frequency along the read, in particular at the 5′-end of the read (5). This is consistent with similar observations for small RNA expression profiling (8).

Transcript profiles: (a) Normalized read coverage with respect to the relative transcript position is shown grouped by five different transcript length bins for the C. elegans SRX001872 data set (16); (b) The key component of rQuant is to infer the underlying read coverage of all transcripts at one gene locus (two transcripts in this illustration on the right: transcript 1 is shown in orange and transcript 2 in green), such that the differences between the observed (grey) and expected (blue) read coverage is minimized. The expected read coverage is inferred from the transcript abundances w1 and w2 and the transcript profiles (shown in the graphs on the right), which are inferred simultaneously for several loci.
Figure 1.

Transcript profiles: (a) Normalized read coverage with respect to the relative transcript position is shown grouped by five different transcript length bins for the C. elegans SRX001872 data set (16); (b) The key component of rQuant is to infer the underlying read coverage of all transcripts at one gene locus (two transcripts in this illustration on the right: transcript 1 is shown in orange and transcript 2 in green), such that the differences between the observed (grey) and expected (blue) read coverage is minimized. The expected read coverage is inferred from the transcript abundances w1 and w2 and the transcript profiles (shown in the graphs on the right), which are inferred simultaneously for several loci.

Biases are also induced by data processing, for instance, when mapping the sequence reads to a reference genome. Depending on how well the mapping method can align spliced reads, the read coverage in proximity of splice junctions typically drops compared to other exonic regions.

Most of these biases can be estimated relatively easily when exactly one transcript per locus is expressed. If, however, one considers highly complex transcriptomes, such as the one in humans, the inference is considerably more difficult (5) and methods like rQuant are needed to simultaneously estimate biases and transcript abundances.

There exist a few other quantitation techniques for RNA-Seq experiments (9–12). Most of these approaches do not take the biases into account and also differ in their approach estimating abundances. Moreover, to the best of our knowledge, rQuant.web is currently the only web service that allows the estimation of transcript abundances from RNA-Seq experiments. This service is embedded into a suite of tools for aligning reads using a combination of PALMapper (13, 14, G. Jean et al., submitted for publication), infering new transcripts via mTiM and mGene (15), and also for testing significance between samples (currently under development).

METHODS: rQuant

To infer the abundance of given transcripts from RNA-Seq data, we previously developed a technique, called rQuant, which is based on solving quadratic programming problems (4,5). Given an annotation of (alternative) transcripts and a set of reads that have been aligned to the reference genome, rQuant infers the abundance of each annotated transcript by minimizing the deviation of the observed from the expected read coverage at each covered nucleotide (cf. Figure 1b). Formally, rQuant solves the following optimization problem:
(1)
where T is the number of transcripts, w1, … , wT are the transcript abundance estimates, P is the set of genomic positions considered (corresponding to one genic locus), Cp is the observed read coverage (i.e. the number of reads covering a position) and Dt,p is the (estimated) read density of transcript t at position p. When using rQuant without bias estimation, Dt,p = 1, if transcript t is exonic at position p, and 0 otherwise. In an extended model described in Ref. (5), the coverage of introns with spliced reads appears as an additional term in the above optimization problem.
Due to the experimental biases mentioned above, the observed read coverage is typically non-uniform over the transcript (cf. Figure 1a). Then, Dt,p may depend on the relative position in the transcript, the distance to exon boundaries and the sequence context. While the first estimation step is solved for each genic locus independently, the optimization with respect to the biases considers many loci simultaneously and builds a predictive model for the expected read density Dt,p(θ) parameterized by θ. The basic idea is to find parameters θ such that the resulting read densities fit well to the observed read coverages. rQuant solves the following optimization problem:
where L is the number of loci, Pl is the set of positions considered for each locus, Dt,p(θ) is the θ-parametrized read density for transcript t at position p and R(θ) is an appropriately chosen regularization term to avoid model overfitting. The details of this estimation step and of the density model Dt,p(θ) are given in Ref. (5).

In Ref. (4), we have tested the accuracy of rQuant using artificially generated reads. In the absence of unbiased and precise quantitation techniques to compare against, the generation of artificial reads from transcripts of known abundance appears to be the only viable option. Only in this case we indeed know the ground truth such that we can evaluate how well the method performs. To obtain artificial reads, we used the ‘FluxSimulator’ (17), a software for transcriptome and read generation that simulates the biochemical processes underlying the library preparation. We generated reads from 563 transcripts of 244 alternatively spliced genes in Caenorhabditis elegans. Subsequently, we applied several variants of rQuant to quantify the transcripts from the simulated reads. In particular, we considered quantifying without profiles [as in Ref. (9) and/or by averaging read coverages over segments as in Ref. (12)]. The results of this comparison are shown in Table 1. We observe that rQuant achieves the highest accuracy when estimating density biases and considering the read coverages at each nucleotide independently. A more detailed comparison to other methods is given in Ref. (5).

Table 1.

Evaluation of rQuant

ApproachPearson's correlation
Across genesWithin genes
Position-based with profiles0.8820.622
Segment-based with profiles0.8180.451
Position-based without profiles0.8570.511
Segment-based without profiles0.8000.402
ApproachPearson's correlation
Across genesWithin genes
Position-based with profiles0.8820.622
Segment-based with profiles0.8180.451
Position-based without profiles0.8570.511
Segment-based without profiles0.8000.402

We compared the full version of rQuant to versions that use averages of read coverages in exon segments instead of considering each position separately and/or do not estimate density biases. We used a set of simulated reads from alternative transcripts with known expression level [for details cf. (4)]. The Pearson's correlation between true and inferred abundance was calculated across all transcripts, as well as the average of Pearson's correlation within alternative transcripts of each gene.

Table 1.

Evaluation of rQuant

ApproachPearson's correlation
Across genesWithin genes
Position-based with profiles0.8820.622
Segment-based with profiles0.8180.451
Position-based without profiles0.8570.511
Segment-based without profiles0.8000.402
ApproachPearson's correlation
Across genesWithin genes
Position-based with profiles0.8820.622
Segment-based with profiles0.8180.451
Position-based without profiles0.8570.511
Segment-based without profiles0.8000.402

We compared the full version of rQuant to versions that use averages of read coverages in exon segments instead of considering each position separately and/or do not estimate density biases. We used a set of simulated reads from alternative transcripts with known expression level [for details cf. (4)]. The Pearson's correlation between true and inferred abundance was calculated across all transcripts, as well as the average of Pearson's correlation within alternative transcripts of each gene.

WEB SERVER

We provide rQuant.web within the Galaxy framework (18,19), facilitating an easy interaction between integrated tools, storing the queries and results in user histories, and conveniently using workflows for re-occurring tasks. The concept of Galaxy is to split bigger analysis tasks into smaller steps that may be reused in combination with other tools. In Ref. (20), we described a web-based genome annotation system, called mGene.web, which provides a set of tools for genome-based analyses. For rQuant.web, we partially reuse the established data structures, such as genome and annotation objects [see Ref. (20) for a detailed description of these modules]. Moreover, rQuant.web makes use of an as yet unpublished Galaxy-based NGS analysis toolbox for NGS data (available at http://getgalaxy.org).

Modules

rQuant.web currently consists of three main components: data preparation, quantitation and bias estimation.

Data preparation

As a first step when using rQuant.web, one starts with uploading the reference genome in FASTA format, a set of transcripts in GFF3 format and the alignments of reads from an RNA-Seq experiment in BAM format (21). Data can be uploaded using Get DataUpload File. For the upload, either the Browse button can be used or the URL to a file stored on e.g. a FTP server can be pasted (particularly suitable for larger files).

The tools GenomeTool and GFF2Anno convert the reference genome sequence and the annotation, respectively, into internal data structures for efficient processing [cf. (20)]. Read alignments can also be uploaded in uncompressed alignment format (SAM) and then be converted to BAM format employing the tool SAM-to-BAM in section NGS: SAM Tools; this uses the SAM toolbox (21). Taking the aligned read data in the commonly used SAM/BAM format, our toolbox is applicable to read data from different NGS platforms, e.g. Illumina GA or SOLiD. The bias model estimation is motivated by the observations based on Illumina read data and cDNA library preparation protocols used for this platform. However, similar observations have been made for other platforms when using similar library preparation protocols. Alternatively, raw reads can be uploaded in FASTQ format and can then be aligned to the reference genome by applying tools also provided within the Galaxy framework [cf. sections NGS: Mapping, based on an as yet unpublished NGS Toolbox, and NGS: QPALMA Tools (13, G. Jean et al., submitted for publication)].

Before performing the actual quantitation, the user has the option to check for the consistency of the uploaded alignments and annotation with the tool ReadStats. It generates statistics about the input files, e.g. displays the number of reads identified in the given annotation, the median read coverage per gene, the number of spliced reads and the number of spliced reads overlapping annotated introns.

Quantitation

With the three prepared inputs, the core rQuant component determines the abundance of each transcript in the given annotation. When not using read density estimates, this tool does not have any parameters that need to be specified. The output is a GTF file which contains the annotation with abundance estimates given for each annotated transcript. rQuant computes two abundance estimates: one is based on the estimated average read coverage (ARC) for each transcript and one is the number of reads per thousand bases per million mapped reads (RPKM) (3). The ARC value is the result of the optimization problem (1), i.e. corresponding to variables w1, … , wT, and the RPKM value is computed based on the ARC value, the transcript length and the total number of aligned reads.

Read density estimation

To improve the accuracy of the abundance estimation, rQuant can also be used to infer a read density model to predict the read density for considered transcripts. This is done by selecting Learn Profiles and choosing a number of iterations for estimation. Then rQuant iteratively estimates the transcript abundances as well as the read density biases over several transcripts. Typically, not more than five iterations are needed to reach convergence. The output are the abundance estimates as before and a file that contains the parameters for the read density model. This parameter file can be used later for quantitations without the need to reoptimize these parameters (select Load profiles).

Examples and instructions

A description of rQuant.web with examples is available at http://fml.mpg.de/raetsch/suppl/rquant/web and at http://galaxy.fml.tuebingen.mpg.de/ (in section NGS: Quantitation ToolsExamples and Instructions).

rQuant.web histories and workflows

A history containing the result of the steps for the provided example can be imported here. In addition, we prepared a workflow covering all steps of the example that can be imported here (this requires an account that can be obtained in the User menu at no charge).

The quantifications, estimated profiles and other objects can be easily shared with other users via Galaxy's share history functionality. Moreover, Galaxy's Data Libraries contain items such as genome sequences and parsed annotations for several organisms for convenience.

Computing time

The median run time per gene locus is a few seconds. We tested rQuant using reads from the SRX001872 RNA-Seq experiment for C. elegans with 1893 annotated genes. The quantitation took about 45 min without coverage bias density. When enabling the coverage bias estimation, the whole process took roughly 1.5 h (one iteration).

DISCUSSION AND OUTLOOK

To the best of our knowledge, rQuant.web is the first and currently the only online tool to quantitatively analyse RNA-Seq data. Our web service is an important contribution to publicly available online NGS tools, enabling RNA-Seq quantitation for any user. It facilitates the analysis of NGS data by being embedded in our Galaxy installation and is very well suited to accurately infer the abundance of alternative transcripts along with a simultaneously learned bias model.

In a next step, we are planning to incorporate several additional features into the current rQuant.web version. A straightforward extension is to allow and exploit paired-end reads. Moreover, we will have an option to estimate the bias induced by the read content (5).

FUNDING

Funding for open access charge: Friedrich Miescher Laboratory of the Max Planck Society.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

rQuant.web uses MOSEK, a commercial interior point algorithm for the solution of mathematical optimization problems (22). We thank the distributors of MOSEK for providing licences to offer this service to the public.

REFERENCES

1
Wang
Z
Gerstein
M
Snyder
M
RNA-Seq: a revolutionary tool for transcriptomics
Nat. Rev. Genet.
2009
, vol. 
10
 (pg. 
57
-
63
)
2
Mortazavi
A
Williams
BA
McCue
K
Schaeffer
L
Wold
B
Mapping and quantifying mammalian transcriptomes by RNA-Seq
Nat. Methods
2008
, vol. 
5
 (pg. 
621
-
628
)
3
Pepke
S
Wold
B
Mortazavi
A
Computation for ChIP-seq and RNA-seq studies
Nat. Methods
2009
, vol. 
6
 (pg. 
S22
-
S32
)
4
Bohnert
R
Behr
J
R�tsch
G
Transcript quantification with RNA-Seq data
BMC Bioinformatics
2009
, vol. 
10
 
Suppl. 13
pg. 
P5
 
5
Bohnert
R
R�tsch
G
Modelling biases for accurate RNA-Seq-based transcript quantitation
2010
 
Forthcoming, 2010. http://www.fml.mpg.de/raetsch/suppl/rquant (1 June 2010, date last accessed)
6
Dohm
JC
Lottaz
C
Borodina
T
Himmelbauer
H
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
Nucleic Acids Res.
2008
, vol. 
36
 pg. 
e105
 
7
Hansen
KD
Brenner
SE
Dudoit
S
Biases in Illumina transcriptome sequencing caused by random hexamer priming
Nucleic Acids Res.
2010
 
doi:10.1093/nar/gkq224 [14 April 2010, Epub ahead of print]
8
Linsen
SEV
de Wit
E
Janssens
G
Heater
S
Chapman
L
Parkin
RK
Fritz
B
Wyman
SK
de Bruijn
E
Voest
EE
et al. 
Limitations and possibilities of small RNA digital gene expression profiling
Nat. Methods
2009
, vol. 
6
 (pg. 
474
-
476
)
9
Jiang
H
Wong
WH
Statistical inferences for isoform expression in RNA-Seq
Bioinformatics
2009
, vol. 
25
 (pg. 
1026
-
1032
)
10
Richard
H
Schulz
MH
Sultan
M
Nurnberger
A
Schrinner
S
Balzereit
D
Dagand
E
Rasche
A
Lehrach
H
Vingron
M
et al. 
Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments
Nucleic Acids Res.
2010
 
doi:10.1093/nar/gkq041 [11 February 2010, Epub ahead of print]
11
Li
B
Ruotti
V
Stewart
RM
Thomson
JA
Dewey
CN
RNA-Seq gene expression estimation with read mapping uncertainty
Bioinformatics
2010
, vol. 
26
 (pg. 
493
-
500
)
12
Sammeth
M
The Flux Capacitor
2010
 
http://flux.sammeth.net/capacitor.html (1 June 2010, date last accessed)
13
De Bona
F
Ossowski
S
Schneeberger
K
R�tsch
G
Optimal spliced alignments of short sequence reads
Bioinformatics
2008
, vol. 
24
 (pg. 
i174
-
i180
)
14
Schneeberger
K
Hagmann
J
Ossowski
S
Warthmann
N
Gesing
S
Kohlbacher
O
Weigel
D
Simultaneous alignment of short reads against multiple genomes
Genome Biol.
2009
, vol. 
10
 pg. 
R98
 
15
Schweikert
G
Zien
A
Zeller
G
Behr
J
Dieterich
C
Ong
CS
Philips
P
de Bona
F
Hartmann
L
Bohlen
A
et al. 
mGene: accurate SVM-based gene finding with an application to nematode genomes
Genome Res.
2009
, vol. 
19
 (pg. 
2133
-
2143
)
16
Hillier
LW
Reinke
V
Green
P
Hirst
M
Marra
MA
Waterston
RH
Massively parallel sequencing of the polyadenylated transcriptome of C. elegans
Genome Res.
2009
, vol. 
19
 (pg. 
657
-
666
)
17
Sammeth
M
The Flux Simulator
2010
 
http://flux.sammeth.net/simulator.html (1 June 2010, date last accessed)
18
Giardine
B
Riemer
C
Hardison
RC
Burhans
R
Elnitski
L
Shah
P
Zhang
Y
Blankenberg
D
Albert
I
Taylor
J
et al. 
Galaxy: a platform for interactive large-scale genome analysis
Genome Res.
2005
, vol. 
15
 (pg. 
1451
-
1455
)
19
Blankenberg
D
Von Kuster
G
Coraor
N
Ananda
G
Lazarus
R
Mangan
M
Nekrutenko
A
Taylor
J
Galaxy: a web-based genome analysis tool for experimentalists
Curr. Protoc. Mol. Biol.
2010
, vol. 
10
 (pg. 
1
-
21
Chapter 19:Unit 19.10.1–21
20
Schweikert
G
Behr
J
Zien
A
Zeller
G
Ong
CS
Sonnenburg
S
R�tsch
G
mGene.web: a web service for accurate computational gene finding
Nucleic Acids Res.
2009
, vol. 
37
 
Suppl. 2
(pg. 
W312
-
W316
)
21
Li
H
Handsaker
B
Wysoker
A
Fennell
T
Ruan
J
Homer
N
Marth
G
Abecasis
G
Durbin
R
1000 Genome Project Data Processing Sub group. The sequence Alignment/Map format and SAMtools
Bioinformatics
2009
, vol. 
25
 (pg. 
2078
-
2079
)
22
Andersen
ED
Jensen
B
Jensen
J
Sandvik
R
Worsoe
U
Mosek version 6
Technical Report TR-2009-3
2009
MOSEK ApS
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.