Abstract

Motivation

Building gene co-expression network (GCN) from gene expression data is an important field of bioinformatic research. Nowadays, RNA-seq data provides high dimensional information to quantify gene expressions in term of read counts for individual exons of genes. Such an increase in the dimension of expression data during the transition from microarray to RNA-seq era made many previous co-expression analysis algorithms based on simple univariate correlation no longer applicable. Recently, two vector-based methods, SpliceNet and RNASeqNet, have been proposed to build GCN. However, they failed to work when sample size is less than the number of exons.

Results

We develop an algorithm called VCNet to construct GCN from RNA-seq data to overcome this dimensional problem. VCNet performs a new statistical hypothesis test based on the correlation matrix of a gene–gene pair using the Frobenius norm. The asymptotic distribution of the new test is obtained under the null model. Simulation studies demonstrate that VCNet outperforms SpliceNet and RNASeqNet for detecting edges of GCN. We also apply VCNet to two expression datasets from TCGA database: the normal breast tissue and kidney tumour tissue, and the results show that the GCNs constructed by VCNet contain more biologically meaningful interactions than existing methods.

Conclusion

VCNet is a useful tool to construct co-expression network.

Availability and Implementation

VCNet is open source and freely available from https://github.com/wangzengmiao/VCNet under GNU LGPL v3

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Large-scale quantification of gene expression has been made possible with advance in microarray and high throughput sequencing (HTS) technologies, and the importance of interactions among genes has been widely recognized. A network is a natural tool to represent functional interactions, and it also provides clues to the causes and mechanisms of biological phenomena (Wu et al., 2010; Yalamanchili et al., 2014a). At first, gene networks were built under a prior or given a framework of already recognized biological pathways. However, knowledge of biological pathway are limited and thus, such knowledge-driven network building approach limits the information gathered from the data obtained in an experiment. This drawback led to development of methods of network construction directly from experimental data without using prior biological knowledge. Before long, weighted network construction methods were also developed (Horvath, 2011; Zhang et al., 2005). Gene co-expression networks (GCN) constructed by these algorithms have provided new perspective to understand biological processes. In essence, GCN can detect the gene pairs who have similar and/or correlated expression patterns. These genes are often controlled by the same transcriptional regulatory mechanism, functionally related, or members of the same pathway or protein complex (Weirauch, 2011). Thus, GCN may reveal potentially new functional relationships of genes and biologic pathways, and help us to identify new targets for treatment (Cheng et al., 2012; Yang et al., 2014). Also, Hansen et al. (2014) extends the application to plants.

With the wide-spread application of RNA-seq to obtain expression data, building a weighted network based on GCN becomes more tricky. During the microarray era in the last decade, a genome-wide expression typically of 30 000 genes were obtained (Kerr et al., 2000). After filtering out genes with no/very low expression, experimental datasets would only have a reduced number of attributes (genes) in the range of 10 000–15 000. Nowadays, RNA-seq has become a routine method to analyse transcriptomes (Marguerat and Bähler, 2010). In RNA-seq, mRNA is converted to a library of cDNA fragments and sequenced in a high-throughput manner to produce short reads. Typically, 30–100 million reads are obtained in a sample. Then these reads are aligned to a reference genome to obtain counts of reads that are mapped onto transcribed regions of the genome. These regions could be genes, exons and transcripts (Wang et al., 2009). By this way, the expressions of transcriptional regions (typically over 100 000 exons or transcripts) are estimated, i.e. number of attributes increased by >10- to 20-folds from the previous microarray era. Several GCN algorithms that had been designed to handle microarray data were modified to accommodate RNA-seq data. However, limitations of these previous algorithms are apparent. There are two key hurdles which prevent a simple migration of old GCN algorithms to an application using RNA-seq data. First, the number of attributes increased by >10 times from 10 000 genes (in a microarray) to over 100 000 exons or transcripts (in RNA-seq). Second, expression of a gene quantified as a vector of exonic expression read counts for each gene in RNA-seq data instead of a scalar in microarray (one expression value per gene). Such exonic expression values may have correlation among themselves. For example, various exons of the same gene would have correlated expression levels. Although many microarray algorithms claimed that they could work for RNA-seq, they could only handle different entities of the same gene as separate entities and thus, could not produce a summative measure of these exons/transcripts. Other shortcomings include a requirement of large sample size, a lack of statistical basis and thus results are not robust, not applicable to a large dataset with such a large number of attributes. Therefore, new tailor made GCN methods are needed.

Only few tools of GCN reconstruction tailored for RNA-seq data exist (López-Kleine and González-Prieto, 2016). Two algorithms have been proposed to perform GCN analysis using a gene-vector basis to analyse RNA-seq data. Firstly, Hong et al. (2013) proposed a method called RNASeqNet to construct the GCN based on the read count vectors of genes. For a gene, each component of this vector is the expression value (read counts) of one exon of the gene, and the dimension of a vector is the number of exons. Then, exhaustive pairwise Canonical Correlation Analysis (CCA) are performed using vectors of all pairs of a given gene set. However, CCA is well known for its dimensional limitation, and it cannot work when the total number of exons of two genes is greater than the sample size. Otherwise, the results of canonical correlations are constrained to unity irrespective of the real population canonical correlations (Pezeshki et al., 2004). Yalamanchili et al. (2014b) tried to develop another algorithm termed SpliceNet to ease the requirement for sample size. They performed pairwise hypothesis tests on covariance matrices of all pairs of two genes using a large dimensional trace. SpliceNet procedure involves the exact computation of the inverse of two covariance matrices for gene pairs. However, this procedure is also constrained by dimensional restriction when the maximal number of exons of two genes is equal to or larger than the sample size. The sample covariance matrix is singular in this situation. In fact, according to TCGA annotation, ∼10% genes are with 25 or more exons, which is illustrated in Supplementary Figure S1 and Supplementary Table S7. However, few samples (e.g. <25 samples, N<25) are usually analysed in RNA-seq experiments by a typical researcher. TCGA represents the largest scale of RNA-seq analysis nowadays and even so, for some tumour no more than 25 samples are analysed. Therefore, this dimensional issue is a real problem, and a novel method which can handle vectors with a large number of exons in datasets with only a moderate sample size is needed.

In this paper, we propose a new algorithm called VCNet to construct GCN from RNA-seq data. VCNet performs a new hypothesis test based on transformed correlation matrix of a gene–gene pair using the Frobenius norm. The Frobenius norm is a commonly used metric of matrices, with the advantage of having an asymptotic distribution of test statistics. Similar to RNASeqNet, we use the read counts vector to represent the expression of a gene. VCNet alleviates the dimensional bottleneck of sample size because it does not require computation of inverse of any matrix. Moreover, VCNet used the asymptotic distribution of its statistics to determine the P-value. In the simulation study, the true positive rate of VCNet is higher than those of SpliceNet and RNASeqNet under the same false positive rate. VCNet is very efficient to detect gene interaction, even in situations where the other two methods failed. Besides, real RNA-seq datasets were also analysed. In the analysis of the normal breast tissue, we find the network inferred by VCNet contained more biologically meaningful edges which are in agreement with literature than those inferred by SpliceNet and RNASeqNet. In differential network analysis of kidney chromophobe cancer dataset, VCNet can identify the interactions associated with EP300 and CREBBP together with many components in the Hypoxia-Inducible Factor pathway, while SpliceNet and RNASeqNet fail to find them because the exon numbers of these two genes are larger than the sample size.

2 Materials and Methods

There are various approaches to represent gene expression data in RNA-seq datasets. Traditionally, the RPKM of gene is used to represent gene expression, which is a scalar. However, this approach does not take full advantage of the high resolution in RNA-seq. Therefore, vector-based expression is proposed. So far, there are three different ways to express RNA-seq data as vectors. First, one vector corresponds to one gene. Each entry represents counts of sequence reads mapped to corresponding exon of a gene. Therefore, the dimension of a vector is equal to the number of exons of that gene. Second, one vector corresponds to one isoform and it is similar to a exonic vector. However, each component is the read counts mapped to a corresponding exon that is specific to that isoform. Third, expression levels of various isoforms of a gene form the component of a gene-vector, in which the dimension is the number of isoforms belonging to the gene. Each component is the estimated expression of the corresponding isoform. PennSeq (Hu et al., 2014), RSEM (Li and Dewey, 2011) and WemIQ (Zhang et al., 2014) are commonly used software to estimate isoform expression. The illustration of input data is shown in Supplementary Figure S6. The detailed discussion about the four different formats of data input is in the Supplementary Material. After extensive evaluation, the performance of the exon expression of a gene is better than those of other inputs. We select the exonic expression of a gene as our input data for the main text.

2.1 Methods to determine pair-wise co-expression genes using gene vectors

Suppose that we have two genes X and Y with p and q number of exons, respectively. The exon expressions for two genes are represented by X=(X1,,Xp)T and Y=(Y1,,Yq)T, where each component is the expression of the corresponding exon in gene X and gene Y. Let R12=(rij)p×q represent the correlation coefficient matrix between X and Y, where rij is the correlation coefficient between the ith element of X and the jth element of Y. We can use R12 to define the co-expression between two genes. Actually, if (XT,YT)T follows a (p+q)-dimensional normal distribution, R12=0 equals to the independence of gene X and Y. Therefore, the detection of co-expression of two genes can be transformed to following hypothesis testing problem:
(1)
We define a statistic for the hypothesis (1) as
(2)
where f is f(r)=13ln1+3r/213r/2,r^ij is the sample correlation of the (i, j) element in R12. When R12=0,TVCNet=0. Actually, TVCNet is the squared Frobenius norm of a transformed correlation matrix using a function of f. Note that the transformation for R12 is element-wised. There are two key advantages of this transformation. First, correlation ranging from [−1, 1] is transformed to a whole real number, which facilitates subsequent calculation. Second, f belongs to the family of Bartlett type corrections (Christoph et al., 2013) and it improves the approximation for the asymptotic distribution of R12 up to the order O(n2).
We reject the null hypothesis when TVCNet is large. Under the null model, we prove that
(3)
where (λ1,,λp) and (μ1,,μq) are the eigenvalues of R11 and R22, respectively, R11 and R22 are the correlation matrices of X and Y, n2.5 is a corrected sample size (Christoph et al., 2013), and χij2 are independent χ2 random variables with the degree freedom of 1. We first prove that the sum square of elements of R12 follows the mixture χ2 distribution in equation (3) asymptotically using the conclusion of Neudecker and Wesselman (1990). The conclusion states that the vectorization of a correlation matrix follows a multivariable normal distribution under some regular conditions. Then we perform the transformation f to R12 and use Delta method to obtain the limiting distribution of the transformed R12. Similar to the derivation of the distribution for the squared Frobenius norm of R12, we get the asymptotic distribution of (n2.5)TVCNet. The detailed derivation is found in the Supplementary Material.

In practice, we use the eigenvalues of R^11 and R^22 to estimate the (λ1,,λp) and (μ1,,μq), where R^11 and R^22 are the sample correlation matrices of X and Y, respectively. A co-expression edge is kept between two genes if we reject the null hypothesis. We reject it when the observed value of (n2.5)TVCNet is larger than the critical value Tα. The critical value Tα is the αth upper quantile of i=1pj=1qλiμjχij2. The asymptotic distribution of (n2.5)TVCNet under null hypothesis is a mixture of χ2 distribution, which was approximated by Davies method (Davies, 1980). False Discovery Rate (FDR) is used to account for multiple testing.

Compared with SpliceNet and RNASeqNet, VCNet completely avoided the need to compute the inverse matrices of the covariance matrices. Therefore, it is tailored to work under large dimension and small sample size condition. In contrast, SpliceNet performs hypothesis testing on the covariance matrix Σ12 between X and Y, and is vulnerable to the heterogeneity of variances which is a common encounter in gene expression data. VCNet, as an general GCN algorithm, can be used with different inputs. The specific results are in the Supplementary Material.

3 Result

3.1 Simulation study

To make the simulation data more close to real data, we conduct the following simulation procedure. We first generated the isoform expression of genes. And then exonic expressions corresponding to this gene were obtained according to its isoform expressions. Specifically, U=(U1,U2) and V=(V1,V2) follow two-dimensional normal distributions N2(0,Σ1) and N2(0,Σ2), respectively. Σ1=(1c0c01) and Σ2=(1c0c01). Here, we assume that both of genes have two isoforms. F(1)=(U1,V1) is the isoform expressions for the first gene and F(2)=(U2,V2) is the isoform expressions for the second gene. The two genes are independent when c0=0 and are correlated when c00. We use equation X=Ap×2(1)F(1)+ϵ and Y=Aq×2(2)F(2)+ϵ to obtain the exonic expression corresponding to the two genes. F is the isoform expression of a gene. A is an indicator matrix, its row corresponding to exon and column corresponding to isoform. Aij takes one if the jth isoform contains the ith exon and zero otherwise. ϵ follows an independent normal distribution with zero mean and one variance. These two equations are an approximation to the relationship between exonic and isoform expression. Many methods to estimate isoform expression are based on them (Jiang and Wong, 2009). Note that we use X and Y as input. For gene pair X and Y, we set the exon numbers (p and q) of two genes to 5-5, 15-20, 25-25, and more extreme values of 10-100, 90-100 and the sample size to 20, 30, and 40. The specific A under different exon number setting are shown in Supplementary Table S15. For c0, we set it to 0.3, 0.4, and 0.5. In total, there are 45 combinations of settings in the simulation. For each combination, 2000 samples were generated. Half of them are from the null model (c0=0) and the remaining samples are from the alternative model (c00). We use the Receiver Operating Characteristic (ROC) curves and area under curve (AUC) to evaluate the performances of VCNet, SpliceNet and RNASeqNet. ROC curves are displayed for the three methods in Figure 1.

ROC curves comparing various algorithms using the simulation study. Six different sets of simulation experiments are shown with different number of exons in two genes under evaluation. p and q are the number of exons in the two genes, respectively. n represents the sample size and c0 represents the dependency of two vectors. In some simulation settings, SpliceNet and/or RNASeqNet failed to work and their ROC are not shown. A vertical line is drawn at false positive rate = 0.1 to compare the corresponding true positive rates among the algorithms
Fig. 1

ROC curves comparing various algorithms using the simulation study. Six different sets of simulation experiments are shown with different number of exons in two genes under evaluation. p and q are the number of exons in the two genes, respectively. n represents the sample size and c0 represents the dependency of two vectors. In some simulation settings, SpliceNet and/or RNASeqNet failed to work and their ROC are not shown. A vertical line is drawn at false positive rate = 0.1 to compare the corresponding true positive rates among the algorithms

On the top row of Figure 1, the simulation results of the setting of sample size = 40 and two genes with 5 exons. Under this condition, all three algorithms showed similar performance though the association strength c0 had some effect on the performances among them. It is observed that AUCs of all the methods were significantly increasing with the increasing association strength. When c0 is reduced to its lowest end, VCNet has higher true positive rate than the other two methods at the same false positive rate of 0.1. SpliceNet is better than RNASeqNet.

Next, we evaluate the influence of the sample size and the exon number on the performances of three methods. The second row of Figure 1, showed the simulation results of the setting of two genes with 15 exons and 20 exons, respectively. Effect of different sample size on the performance was shown when c0 was set to 0.5. When the sample size is 20, the total number (15 + 20) and the maximal number (20) of the exons for two genes are larger than or equal to the sample size. Therefore, RNASeqNet and SpliceNet failed to analyse such datasets, and there was no ROC curves for them. In contrast, VCNet shows pretty good ROC curve. An increasing trend of performance for VCNet was also observed with sample size from 20 to 40. When the sample size increased to 30 or 40, VCNet showed the largest AUC among the three algorithms. It shows the ability of VCNet to handle small samples, which is a bottleneck of many existing algorithms. We also display AUCs for all methods under different conditions for a comprehensive comparison. The detailed result is shown in Table 1.

Table 1

AUC comparison under different conditions for a simulation study

Exon expression of genec00.30.40.5
AUCN203040203040203040
p=5, q=5VCNet0.6530.7000.7780.7480.8390.8870.8320.9120.943
SpliceNet0.5810.6540.6930.6530.7540.7950.7190.8390.903
RNASeqNet0.5210.5430.5670.5440.6070.6530.5810.6940.782
p=15, q=20VCNet0.6930.7580.8190.7990.8850.9200.8820.9450.976
SpliceNet0.5190.5330.5310.5670.5560.602
RNASeqNet0.5060.5170.519
p=25, q=25VCNet0.6320.7000.7650.7210.8200.8690.8260.9050.938
SpliceNet0.5050.5060.5130.5380.5230.554
RNASeqNet
p=10, q=100VCNet0.6980.7770.8140.7900.8850.9290.8780.9500.966
SpliceNet
RNASeqNet
Exon expression of genec00.30.40.5
AUCN203040203040203040
p=5, q=5VCNet0.6530.7000.7780.7480.8390.8870.8320.9120.943
SpliceNet0.5810.6540.6930.6530.7540.7950.7190.8390.903
RNASeqNet0.5210.5430.5670.5440.6070.6530.5810.6940.782
p=15, q=20VCNet0.6930.7580.8190.7990.8850.9200.8820.9450.976
SpliceNet0.5190.5330.5310.5670.5560.602
RNASeqNet0.5060.5170.519
p=25, q=25VCNet0.6320.7000.7650.7210.8200.8690.8260.9050.938
SpliceNet0.5050.5060.5130.5380.5230.554
RNASeqNet
p=10, q=100VCNet0.6980.7770.8140.7900.8850.9290.8780.9500.966
SpliceNet
RNASeqNet

Note: p, number of exons in the first gene; q, number of exons in the second one; N, sample size; c0, dependency of two vectors.

Table 1

AUC comparison under different conditions for a simulation study

Exon expression of genec00.30.40.5
AUCN203040203040203040
p=5, q=5VCNet0.6530.7000.7780.7480.8390.8870.8320.9120.943
SpliceNet0.5810.6540.6930.6530.7540.7950.7190.8390.903
RNASeqNet0.5210.5430.5670.5440.6070.6530.5810.6940.782
p=15, q=20VCNet0.6930.7580.8190.7990.8850.9200.8820.9450.976
SpliceNet0.5190.5330.5310.5670.5560.602
RNASeqNet0.5060.5170.519
p=25, q=25VCNet0.6320.7000.7650.7210.8200.8690.8260.9050.938
SpliceNet0.5050.5060.5130.5380.5230.554
RNASeqNet
p=10, q=100VCNet0.6980.7770.8140.7900.8850.9290.8780.9500.966
SpliceNet
RNASeqNet
Exon expression of genec00.30.40.5
AUCN203040203040203040
p=5, q=5VCNet0.6530.7000.7780.7480.8390.8870.8320.9120.943
SpliceNet0.5810.6540.6930.6530.7540.7950.7190.8390.903
RNASeqNet0.5210.5430.5670.5440.6070.6530.5810.6940.782
p=15, q=20VCNet0.6930.7580.8190.7990.8850.9200.8820.9450.976
SpliceNet0.5190.5330.5310.5670.5560.602
RNASeqNet0.5060.5170.519
p=25, q=25VCNet0.6320.7000.7650.7210.8200.8690.8260.9050.938
SpliceNet0.5050.5060.5130.5380.5230.554
RNASeqNet
p=10, q=100VCNet0.6980.7770.8140.7900.8850.9290.8780.9500.966
SpliceNet
RNASeqNet

Note: p, number of exons in the first gene; q, number of exons in the second one; N, sample size; c0, dependency of two vectors.

In summary, we can see that VCNet is optimal among the three methods in handling simulation data. Overall, VCNet needs fewer samples than SpliceNet and RNASeqNet to achieve the same true positive rate.

3.2 Analysis of real data

3.2.1 Network construction

In this section, we evaluate whether VCNet can reconstruct networks that is closed to the ground truth (known biological pathways). The exon expressions of genes (level 3, exon-level RPKM) of RNA-seq data from healthy breast tissue are downloaded from TCGA data portal. In total, there are 113 normal samples. We analysed two pathways: ‘resolution of D loop structure’, ‘homologous DNA pairing and strand exchange’ from Reactome Database (Croft et al., 2014; Milacic et al., 2012) for illustration. We use Reactome and Physical interaction from GeneMANIA as gold standard to evaluate the predictions of three methods, seperately. To display the effects of sample size on the performance of network construction, we randomly re-sampled N (N=20,30,40,100) samples to produce validation datasets. VCNet, SpliceNet and RNASeqNet were then analysed using these re-sampled datasets, separately. This procedure was replicated 20 times. Twenty networks were constructed by each algorithm under each sample size setting. For VCNet and SpliceNet, we obtain the edges under the cut-off 0.05 of FDR. For RNASeqNet, we select the 5% of edges with top weights as suggested by original paper. We compare the networks to the two ground truths. We use precision, recall to evaluate the performances. F-score is also employed and Fscore=2×Precision×RecallPrecison+Recall. F-score is a balance between precision and recall. A larger F-score indicates a good behaviour. The results for the two pathways are shown in Table 2. The pathway ‘resolution of D loop structure’ using Reactome Pathway as a gold standard is shown in Table 2. For all algorithms, the precision, recall and F-score increased with increasing sample sizes. For low sample size, VCNet had the highest precision, recall and F-score as well.

Table 2

Precision, recall and F-score for the two pathways

Reactome pathway
Physical interaction
Precision
Recall
F-score
Precision
Recall
F-score
MeanVarMeanVarMeanVarMeanVarMeanVarMeanVar
Pathway 1
 N=20
  VCNet0.8030.0020.6270.0140.6970.0060.2460.0000.6390.0150.3520.001
  SpliceNet0.2330.1500.0020.0000.0120.0000.0580.0510.0010.0000.0080.000
  RNASeqNet0.6500.0160.0430.0000.0810.0000.1470.0060.0320.0000.0530.001
 N=30
  VCNet0.8110.0020.7680.0040.7870.0010.2450.0000.7750.0050.3720.000
  SpliceNet0.7620.0030.2360.0080.3520.0100.2080.0020.2120.0070.2020.002
  RNASeqNet0.8530.0050.0560.0000.1060.0000.2170.0080.0480.0000.0780.001
 N=40
  VCNet0.7970.0000.8310.0020.8130.0000.2390.0000.8330.0020.3720.000
  SpliceNet0.7570.0000.5020.0100.5980.0060.2130.0010.4680.0070.2900.001
  RNASeqNet0.8670.0050.0570.0000.1070.0000.4430.0090.0980.0000.1600.001
 N=100
  VCNet0.7700.0000.9390.0000.8460.0000.2310.0000.9410.0000.3710.000
  SpliceNet0.7590.0000.9640.0000.8490.0000.2340.0000.9940.0000.3790.000
  RNASeqNet0.8900.0020.0590.0000.1100.0000.5870.0010.1290.0000.2120.000
Pathway 2
 N=20
  VCNet0.8510.0000.6250.0090.7160.0040.2820.0000.6400.0110.3890.001
  SpliceNet0.5470.2240.0050.0000.0140.0000.2020.0760.0050.0000.0140.000
  RNASeqNet0.7040.0040.0430.0000.0810.0000.2180.0020.0410.0000.0690.000
 N=30
  VCNet0.8420.0000.7360.0030.7840.0010.2770.0000.7500.0030.4050.000
  SpliceNet0.8210.0010.2610.0060.3900.0080.2720.0010.2660.0060.2630.002
  RNASeqNet0.9220.0000.0570.0000.1070.0000.2690.0030.0510.0000.0860.000
 N=40
  VCNet0.8400.0000.8030.0020.8210.0000.2780.0000.8210.0020.4150.000
  SpliceNet0.8320.0000.5390.0040.6520.0030.2630.0000.5270.0040.3490.001
  RNASeqNet0.9160.0010.0560.0000.1060.0000.3850.0030.0730.0000.1230.000
 N=100
  VCNet0.8360.0000.9220.0000.8770.0000.2750.0000.9370.0000.4250.000
  SpliceNet0.8380.0000.9530.0000.8910.0000.2810.0000.9900.0000.4380.000
  RNASeqNet0.9960.0000.0610.0000.1150.0000.6750.0010.1280.0000.2150.000
Reactome pathway
Physical interaction
Precision
Recall
F-score
Precision
Recall
F-score
MeanVarMeanVarMeanVarMeanVarMeanVarMeanVar
Pathway 1
 N=20
  VCNet0.8030.0020.6270.0140.6970.0060.2460.0000.6390.0150.3520.001
  SpliceNet0.2330.1500.0020.0000.0120.0000.0580.0510.0010.0000.0080.000
  RNASeqNet0.6500.0160.0430.0000.0810.0000.1470.0060.0320.0000.0530.001
 N=30
  VCNet0.8110.0020.7680.0040.7870.0010.2450.0000.7750.0050.3720.000
  SpliceNet0.7620.0030.2360.0080.3520.0100.2080.0020.2120.0070.2020.002
  RNASeqNet0.8530.0050.0560.0000.1060.0000.2170.0080.0480.0000.0780.001
 N=40
  VCNet0.7970.0000.8310.0020.8130.0000.2390.0000.8330.0020.3720.000
  SpliceNet0.7570.0000.5020.0100.5980.0060.2130.0010.4680.0070.2900.001
  RNASeqNet0.8670.0050.0570.0000.1070.0000.4430.0090.0980.0000.1600.001
 N=100
  VCNet0.7700.0000.9390.0000.8460.0000.2310.0000.9410.0000.3710.000
  SpliceNet0.7590.0000.9640.0000.8490.0000.2340.0000.9940.0000.3790.000
  RNASeqNet0.8900.0020.0590.0000.1100.0000.5870.0010.1290.0000.2120.000
Pathway 2
 N=20
  VCNet0.8510.0000.6250.0090.7160.0040.2820.0000.6400.0110.3890.001
  SpliceNet0.5470.2240.0050.0000.0140.0000.2020.0760.0050.0000.0140.000
  RNASeqNet0.7040.0040.0430.0000.0810.0000.2180.0020.0410.0000.0690.000
 N=30
  VCNet0.8420.0000.7360.0030.7840.0010.2770.0000.7500.0030.4050.000
  SpliceNet0.8210.0010.2610.0060.3900.0080.2720.0010.2660.0060.2630.002
  RNASeqNet0.9220.0000.0570.0000.1070.0000.2690.0030.0510.0000.0860.000
 N=40
  VCNet0.8400.0000.8030.0020.8210.0000.2780.0000.8210.0020.4150.000
  SpliceNet0.8320.0000.5390.0040.6520.0030.2630.0000.5270.0040.3490.001
  RNASeqNet0.9160.0010.0560.0000.1060.0000.3850.0030.0730.0000.1230.000
 N=100
  VCNet0.8360.0000.9220.0000.8770.0000.2750.0000.9370.0000.4250.000
  SpliceNet0.8380.0000.9530.0000.8910.0000.2810.0000.9900.0000.4380.000
  RNASeqNet0.9960.0000.0610.0000.1150.0000.6750.0010.1280.0000.2150.000

Note: Pathway 1 is the resolution of D loop structure and Pathway 2 is the homologous DNA pairing and strand exchange.

Table 2

Precision, recall and F-score for the two pathways

Reactome pathway
Physical interaction
Precision
Recall
F-score
Precision
Recall
F-score
MeanVarMeanVarMeanVarMeanVarMeanVarMeanVar
Pathway 1
 N=20
  VCNet0.8030.0020.6270.0140.6970.0060.2460.0000.6390.0150.3520.001
  SpliceNet0.2330.1500.0020.0000.0120.0000.0580.0510.0010.0000.0080.000
  RNASeqNet0.6500.0160.0430.0000.0810.0000.1470.0060.0320.0000.0530.001
 N=30
  VCNet0.8110.0020.7680.0040.7870.0010.2450.0000.7750.0050.3720.000
  SpliceNet0.7620.0030.2360.0080.3520.0100.2080.0020.2120.0070.2020.002
  RNASeqNet0.8530.0050.0560.0000.1060.0000.2170.0080.0480.0000.0780.001
 N=40
  VCNet0.7970.0000.8310.0020.8130.0000.2390.0000.8330.0020.3720.000
  SpliceNet0.7570.0000.5020.0100.5980.0060.2130.0010.4680.0070.2900.001
  RNASeqNet0.8670.0050.0570.0000.1070.0000.4430.0090.0980.0000.1600.001
 N=100
  VCNet0.7700.0000.9390.0000.8460.0000.2310.0000.9410.0000.3710.000
  SpliceNet0.7590.0000.9640.0000.8490.0000.2340.0000.9940.0000.3790.000
  RNASeqNet0.8900.0020.0590.0000.1100.0000.5870.0010.1290.0000.2120.000
Pathway 2
 N=20
  VCNet0.8510.0000.6250.0090.7160.0040.2820.0000.6400.0110.3890.001
  SpliceNet0.5470.2240.0050.0000.0140.0000.2020.0760.0050.0000.0140.000
  RNASeqNet0.7040.0040.0430.0000.0810.0000.2180.0020.0410.0000.0690.000
 N=30
  VCNet0.8420.0000.7360.0030.7840.0010.2770.0000.7500.0030.4050.000
  SpliceNet0.8210.0010.2610.0060.3900.0080.2720.0010.2660.0060.2630.002
  RNASeqNet0.9220.0000.0570.0000.1070.0000.2690.0030.0510.0000.0860.000
 N=40
  VCNet0.8400.0000.8030.0020.8210.0000.2780.0000.8210.0020.4150.000
  SpliceNet0.8320.0000.5390.0040.6520.0030.2630.0000.5270.0040.3490.001
  RNASeqNet0.9160.0010.0560.0000.1060.0000.3850.0030.0730.0000.1230.000
 N=100
  VCNet0.8360.0000.9220.0000.8770.0000.2750.0000.9370.0000.4250.000
  SpliceNet0.8380.0000.9530.0000.8910.0000.2810.0000.9900.0000.4380.000
  RNASeqNet0.9960.0000.0610.0000.1150.0000.6750.0010.1280.0000.2150.000
Reactome pathway
Physical interaction
Precision
Recall
F-score
Precision
Recall
F-score
MeanVarMeanVarMeanVarMeanVarMeanVarMeanVar
Pathway 1
 N=20
  VCNet0.8030.0020.6270.0140.6970.0060.2460.0000.6390.0150.3520.001
  SpliceNet0.2330.1500.0020.0000.0120.0000.0580.0510.0010.0000.0080.000
  RNASeqNet0.6500.0160.0430.0000.0810.0000.1470.0060.0320.0000.0530.001
 N=30
  VCNet0.8110.0020.7680.0040.7870.0010.2450.0000.7750.0050.3720.000
  SpliceNet0.7620.0030.2360.0080.3520.0100.2080.0020.2120.0070.2020.002
  RNASeqNet0.8530.0050.0560.0000.1060.0000.2170.0080.0480.0000.0780.001
 N=40
  VCNet0.7970.0000.8310.0020.8130.0000.2390.0000.8330.0020.3720.000
  SpliceNet0.7570.0000.5020.0100.5980.0060.2130.0010.4680.0070.2900.001
  RNASeqNet0.8670.0050.0570.0000.1070.0000.4430.0090.0980.0000.1600.001
 N=100
  VCNet0.7700.0000.9390.0000.8460.0000.2310.0000.9410.0000.3710.000
  SpliceNet0.7590.0000.9640.0000.8490.0000.2340.0000.9940.0000.3790.000
  RNASeqNet0.8900.0020.0590.0000.1100.0000.5870.0010.1290.0000.2120.000
Pathway 2
 N=20
  VCNet0.8510.0000.6250.0090.7160.0040.2820.0000.6400.0110.3890.001
  SpliceNet0.5470.2240.0050.0000.0140.0000.2020.0760.0050.0000.0140.000
  RNASeqNet0.7040.0040.0430.0000.0810.0000.2180.0020.0410.0000.0690.000
 N=30
  VCNet0.8420.0000.7360.0030.7840.0010.2770.0000.7500.0030.4050.000
  SpliceNet0.8210.0010.2610.0060.3900.0080.2720.0010.2660.0060.2630.002
  RNASeqNet0.9220.0000.0570.0000.1070.0000.2690.0030.0510.0000.0860.000
 N=40
  VCNet0.8400.0000.8030.0020.8210.0000.2780.0000.8210.0020.4150.000
  SpliceNet0.8320.0000.5390.0040.6520.0030.2630.0000.5270.0040.3490.001
  RNASeqNet0.9160.0010.0560.0000.1060.0000.3850.0030.0730.0000.1230.000
 N=100
  VCNet0.8360.0000.9220.0000.8770.0000.2750.0000.9370.0000.4250.000
  SpliceNet0.8380.0000.9530.0000.8910.0000.2810.0000.9900.0000.4380.000
  RNASeqNet0.9960.0000.0610.0000.1150.0000.6750.0010.1280.0000.2150.000

Note: Pathway 1 is the resolution of D loop structure and Pathway 2 is the homologous DNA pairing and strand exchange.

Table 4

Time comparison

Time (s)Normal
Tumor
Linux
Windows
Linux
Windows
VCNetSpliceNetVCNetRNASeqNetVCNetSpliceNetVCNetRNASeqNet
Exon expression Of gene0.9202.9672.40028.7601.1660.4892.5001.490
Time (s)Normal
Tumor
Linux
Windows
Linux
Windows
VCNetSpliceNetVCNetRNASeqNetVCNetSpliceNetVCNetRNASeqNet
Exon expression Of gene0.9202.9672.40028.7601.1660.4892.5001.490
Table 4

Time comparison

Time (s)Normal
Tumor
Linux
Windows
Linux
Windows
VCNetSpliceNetVCNetRNASeqNetVCNetSpliceNetVCNetRNASeqNet
Exon expression Of gene0.9202.9672.40028.7601.1660.4892.5001.490
Time (s)Normal
Tumor
Linux
Windows
Linux
Windows
VCNetSpliceNetVCNetRNASeqNetVCNetSpliceNetVCNetRNASeqNet
Exon expression Of gene0.9202.9672.40028.7601.1660.4892.5001.490

One reason for low F-score of SpliceNet is that the networks constructed by SpliceNet are almost empty when sample size is under 20. The maximum exon number of genes is 63 and there are 5 genes whose exon numbers are >20 in the ‘resolution of D loop structure’ pathway. Therefore, SpliceNet and RNASeqNet fail to infer some edges at sample size of this range. SpliceNet is getting better when the sample size is increasing. Next, when we take physical interaction as ground truth, the performances for all methods are worse than those of Reactome Pathway since co-expression network reflects the transcriptional regulation. Similar results are obtained for the ‘homologous DNA pairing and strand exchange’ pathway. There are 8 genes whose exon numbers are >20 in this pathway. Therefore, SpliceNet and RNASeqNet performed poorly when sample size was 20.

As some genes in above pathways have large numbers of exons, so SpliceNet and RNASeqNet failed to work. Next, we evaluated three methods on pathways whose genes contain small number of exons. Therefore, all methods are valid on these pathways. The sample size is 65. The pathway information is displayed in Supplementary Table S9. The precision, recall and F-score are shown in Supplementary Table S11. We can see that VCNet is better than the other two methods. In Table 2, when the sample size is 100, SpliceNet is slightly better than VCNet under pathway1. Covariance matrix contains more information than correlation matrix. Therefore, SpliceNet may perform over VCNet under some cases. Above analysis indicated that VCNet is comparable to SpliceNet in the pathways on which all methods are valid.

3.2.2 Differential network analysis between tumour and normal tissues

Comparison between a normal network and a tumour networks is an important means to identify key genes related to pathogenesis. We applied VCNet to Chromophobe Renal Cell Carcinoma (KICH) which was downloaded from TCGA data portal. There are 25 normal samples and 65 KICH tumour samples. We analyse the ‘regulation of hypoxia inducible HIF by oxygen’ pathway and ‘Tp53 Regulates Metabolic Genes’ pathway, which are related to pathogenesis of KICH (Haase, 2012). For each method, two co-expression networks are constructed using the normal and case samples, respectively. The FDR cut-off was set at 0.05.

Like the analysis in normal breast tissue, we compare the constructed networks from VCNet, SpliceNet and RNASeqNet to the two gold standards. Precision, recall and F-score are shown in Table 3 for different sample size and conditions. We first inspect the networks constructed from the 25 normal samples. For ‘regulation of hypoxia inducible HIF by oxygen’ pathway, under the Reactome Pathway as a gold standard, SpliceNet has the largest precision and VCNet has the biggest recall. VCNet is the highest when we turn to F-score. When the physical interactions are as golden standard, the same trend is observed. As we do not know the truth network under tumour condition, we still compare the networks from tumour samples to the two golden standards. To investigate the impact of sample size on the performances of all methods, we select 25 samples from the tumour data and apply three methods to them. We also apply three methods to 65 tumour samples. The results are shown in Table 3. Under 25 tumour samples, RNASeqNet has the highest precision under two criteria. VCNet has the largest recall under two criteria. In terms of F-score, VCNet is the best. Under the 65 tumour samples, SpliceNet is better than the other two methods in term of F-score. For ‘Tp53 Regulates Metabolic Genes’ pathway, we can see that VCNet is the best in terms of F-score under both normal and tumour samples.

We compare the networks of ‘regulation of hypoxia inducible HIF by oxygen’ pathway inferred by VCNet using normal data and tumour data, which is displayed in Figure 2. For SpliceNet and RNASeqNet, the differential networks are displayed in Supplementary Figures S2 and S3. In Figure 2, the dotted line represents the edges which were only detected in tumour tissue, and the solid line represents the edges which were only detected in normal tissue. The edges presenting in both networks are removed. We find that two genes EP300 and CREBBP have interactions with HIF1A in the tumour networks of three methods, which is in agreement with literature. And String Database also supports these interactions, which is shown in Supplementary Figure S4. It has been reported that EP300 protein can bind to CREBBP protein and activate the expression of HIF1A (Ema et al., 1999). HIF1A encodes the alpha subunit of transcription factor hypoxia-inducible factor (HIF-1) (Iyer et al., 1998). HIF-1 functions as a master regulator of cellular and systemic homeostatic response to hypoxia by activating transcription of many genes (Ietta et al., 2006). And the over-expression of HIF-1 is heavily implicated in promoting tumour growth and metastasis of kidney cancer (Bos et al., 2003). Above analysis indicates that VCNet, SpliceNet and RNASeqNet can detect the biologically meaningful correlations using the large case samples. We also showed the inferred networks of VCNet, SpliceNet and RNASeqNet under normal and cancer tissues. The six networks are displayed in Supplementary Figures S7–S9. The differential networks of ‘Tp53 Regulates Metabolic Genes’ pathway are shown in Supplementary Tables S12–S14.

The differential network between normal and KICH cancer as constructed from results of VCNet. The co-expression network of normal and cancer were inferred separately. Solid lines represent those differentially co-expressed edges which were only detected in normal tissue and the dotted lines were those detected only cancer. Number of exons in two genes (EP300 and CREBBP) were larger than sample size and VCNet was able to handle their RNA-seq data. (c.f. Figures S2 and S3 in Supplementary document show differential networks of results produced by other algorithms)
Fig. 2

The differential network between normal and KICH cancer as constructed from results of VCNet. The co-expression network of normal and cancer were inferred separately. Solid lines represent those differentially co-expressed edges which were only detected in normal tissue and the dotted lines were those detected only cancer. Number of exons in two genes (EP300 and CREBBP) were larger than sample size and VCNet was able to handle their RNA-seq data. (c.f. Figures S2 and S3 in Supplementary document show differential networks of results produced by other algorithms)

EP300 and CREBBP have interaction with HIF3A instead of HIF1A in a normal network constructed by VCNet. These two edges were not present in the normal networks inferred by SpliceNet and RNASeqNet. As both EP300 and CREBBP have over 30 exons which exceed the sample size in normal tissue datasets. Both SpliceNet and RNASeqNet fail to detect them. HIF3A has a high similarity with HIF1A in the bHLH and PAS domains but lacks structures for transactivation found in the C-terminus of HIF1A (Gu et al., 1998). Since the C-terminal structure of HIF3A is different from that of HIF1A, the transfected HIF3A protein shows different characteristics from those of the transfected HIF1A. HIF3A suppresses hypoxia-inducible HIF-mediated gene expression and might be a negative regulator of hypoxia-inducible gene expression in the human kidney (Hara et al., 2001). Network constructed by VCNet suggested a differential co-expression relationship between HIF3A and other genes in cancers versus controls tissues, which underpins the ability of VCNet to identify meaningful interactions even in small samples.

As for the prediction for interplays between genes, CUL2, HIF1A have interactions in the tumour network constructed by VCNet. This interaction is also supported by String Database (Supplementary Figure S4). CUL2 is also required for HIF1A to mediate the expression of other genes in tumour samples. Some study shows that CUL2 is required for normal vasculogenesis in the zebrafish embryos, at least in part mediated by its regulation of HIF-mediated transcription (Maeda et al., 2008). Besides, both EP300 and CREBBP have interactions with CUL2. It indicates that EP300 and CREBBP may participate the expression of CUL2. Both EP300 and CREBBP are transcriptional co-activators (Grunstein, 1997; Gusterson et al., 2003) and have functions in cell differentiation (Giles et al., 1998), cell proliferation (Goodman and Smolik, 2000; Yao et al., 1998). Above analysis demonstrates that VCNet can provide rich molecular information to biological process, especially for genes whose exon numbers are large.

3.2.3 Comparison of computation time

Since SpliceNet is designed to run on Linux system and RNASeqNet is designed to run on Windows system, the running time of them cannot be compared directly. We run VCNet on Linux system and Windows system separately and then compared the running times to SpliceNet and RNASeqNet. CPU used in the Windows was Intel(R) Core(TM) i7-2600 3.40 GHz and the R version is 2.15.2. The CPU used in Linux system was Intel Xeon Processor E5-2670 2.60 GHz and the R version is 3.0.1. The dataset were same as that used in differential network analysis. And we analyse the 18 genes related to ‘regulation of hypoxia inducible HIF by oxygen’ pathway. The results are shown in Table 4. All the methods completed in a short time. Constructing network using all expressed genes (whole genome analysis) will be more demanding and take more time. Theoretically we may analyse the whole well annotated sets of 20 497 human genes, such complete pairwise network will be consists of 210 053 256 pairs. We attempted this challenge by modifying the code of VCNet for parallel computing with help of Rcpp package. The runtime of VCNet is ∼20–24 h using 43 CPUs on a Linux system. SpliceNet and RNASeqNet are not designed for parallel computing. Therefore, we did not apply them to the whole genome dataset. There are no technological constraints for VCNet. Its algorithm is simple. More importantly, VCNet has a test statistics of asymptotic distribution, therefore, we can obtain P-values from the distribution directly rather than the permutation strategy.

Table 3

Precision, Recall and F-score for two pathways

Reactome pathway
Physical interaction
PrecisionRecallF-scorePrecisionRecallF-score
Pathway 3
 Normal: 25 samples
  VCNet0.5000.4440.4710.1940.3330.246
  SpliceNet0.6000.1850.2830.2400.1430.179
  RNASeqNet0.5000.0490.0900.1250.0240.040
 Tumor: 25 samples
  VCNet0.4000.2720.3240.2360.3100.268
  SpliceNet0.5600.1730.2640.2000.1190.149
  RNASeqNet0.6250.0620.1120.2500.0480.080
 Tumor: 65 samples
  VCNet0.4660.5060.4850.2050.4290.277
  SpliceNet0.4920.7650.5990.2060.6190.310
  RNASeqNet0.7500.0740.1350.2500.0480.080
Pathway 4
 Normal: 25 samples
  VCNet0.2570.4920.3380.1240.3450.183
  SpliceNet0.2330.4680.3110.1260.3680.188
  RNASeqNet0.3160.0950.1460.2630.1150.160
 Tumor: 65 samples
  VCNet0.2360.5790.3360.1810.6440.283
  SpliceNet0.1810.8810.3000.1220.8620.214
  RNASeqNet0.2110.0630.0980.2370.1030.144
Reactome pathway
Physical interaction
PrecisionRecallF-scorePrecisionRecallF-score
Pathway 3
 Normal: 25 samples
  VCNet0.5000.4440.4710.1940.3330.246
  SpliceNet0.6000.1850.2830.2400.1430.179
  RNASeqNet0.5000.0490.0900.1250.0240.040
 Tumor: 25 samples
  VCNet0.4000.2720.3240.2360.3100.268
  SpliceNet0.5600.1730.2640.2000.1190.149
  RNASeqNet0.6250.0620.1120.2500.0480.080
 Tumor: 65 samples
  VCNet0.4660.5060.4850.2050.4290.277
  SpliceNet0.4920.7650.5990.2060.6190.310
  RNASeqNet0.7500.0740.1350.2500.0480.080
Pathway 4
 Normal: 25 samples
  VCNet0.2570.4920.3380.1240.3450.183
  SpliceNet0.2330.4680.3110.1260.3680.188
  RNASeqNet0.3160.0950.1460.2630.1150.160
 Tumor: 65 samples
  VCNet0.2360.5790.3360.1810.6440.283
  SpliceNet0.1810.8810.3000.1220.8620.214
  RNASeqNet0.2110.0630.0980.2370.1030.144

Note: Pathway 3 is the regulation of hypoxia inducible HIF by oxygen pathway and Pathway 4 is the TP53 regulates metabolic genes.

Table 3

Precision, Recall and F-score for two pathways

Reactome pathway
Physical interaction
PrecisionRecallF-scorePrecisionRecallF-score
Pathway 3
 Normal: 25 samples
  VCNet0.5000.4440.4710.1940.3330.246
  SpliceNet0.6000.1850.2830.2400.1430.179
  RNASeqNet0.5000.0490.0900.1250.0240.040
 Tumor: 25 samples
  VCNet0.4000.2720.3240.2360.3100.268
  SpliceNet0.5600.1730.2640.2000.1190.149
  RNASeqNet0.6250.0620.1120.2500.0480.080
 Tumor: 65 samples
  VCNet0.4660.5060.4850.2050.4290.277
  SpliceNet0.4920.7650.5990.2060.6190.310
  RNASeqNet0.7500.0740.1350.2500.0480.080
Pathway 4
 Normal: 25 samples
  VCNet0.2570.4920.3380.1240.3450.183
  SpliceNet0.2330.4680.3110.1260.3680.188
  RNASeqNet0.3160.0950.1460.2630.1150.160
 Tumor: 65 samples
  VCNet0.2360.5790.3360.1810.6440.283
  SpliceNet0.1810.8810.3000.1220.8620.214
  RNASeqNet0.2110.0630.0980.2370.1030.144
Reactome pathway
Physical interaction
PrecisionRecallF-scorePrecisionRecallF-score
Pathway 3
 Normal: 25 samples
  VCNet0.5000.4440.4710.1940.3330.246
  SpliceNet0.6000.1850.2830.2400.1430.179
  RNASeqNet0.5000.0490.0900.1250.0240.040
 Tumor: 25 samples
  VCNet0.4000.2720.3240.2360.3100.268
  SpliceNet0.5600.1730.2640.2000.1190.149
  RNASeqNet0.6250.0620.1120.2500.0480.080
 Tumor: 65 samples
  VCNet0.4660.5060.4850.2050.4290.277
  SpliceNet0.4920.7650.5990.2060.6190.310
  RNASeqNet0.7500.0740.1350.2500.0480.080
Pathway 4
 Normal: 25 samples
  VCNet0.2570.4920.3380.1240.3450.183
  SpliceNet0.2330.4680.3110.1260.3680.188
  RNASeqNet0.3160.0950.1460.2630.1150.160
 Tumor: 65 samples
  VCNet0.2360.5790.3360.1810.6440.283
  SpliceNet0.1810.8810.3000.1220.8620.214
  RNASeqNet0.2110.0630.0980.2370.1030.144

Note: Pathway 3 is the regulation of hypoxia inducible HIF by oxygen pathway and Pathway 4 is the TP53 regulates metabolic genes.

4 Discussion

With the advance of the next generation sequencing technology, we can obtain the most precise and accurate measurement of gene expressions possible so far, and it will enable more insights into biological processes in health and disease conditions. However, the increased data amount also created a new problem in term of computer storage, processing algorithm and not the least, challenges in the mathematical and statistical analysis of such massive data. In order to gain new insight into biological pathways, analysis of gene network represents a fundamental analysis of RNA-seq data, however, development of accurate co-expression analysis applicable to the high dimensional RNA-seq data is still in its infancy. Most methods developed previously during the microarray era suffer from the ‘curse of dimension’. As a consequence, determination of covariance matrix is singular, and its inverse matrix is undefined when elements of vectors exceed the number of samples. Considering this problem, we proposed a new approach using Frobenius norm to construct a new test statistics and provided its asymptotic distribution. This method can provide an adequate analysis of gene vectors co-expression network of RNA-seq data. This new test statistic computes squared Frobenius norm of the transformed correlation coefficient matrix between two random vectors. Therefore, it does not need the inverse of any matrix and provide the unique advantage for VCNet in the analysis of vectors of gene expression. We use Frobenius norm to evaluate whether the transformed R12 is equal to zero matrix. There are several advantage in the application of Frobenius norm. One is that the test statistics based on the norm are relatively easier to determine than other norms, in particular, in term of the asymptotic distribution of the test statistics. Furthermore, Frobenius norm is robust against the noise. VCNet shows high AUC in the simulation study and high F-scores in real data analysis especially when the sample size is moderate. Only VCNet can detect the interactions between genes with large vector attributes (e.g. large exon numbers) while SpliceNet and RNASeqNet miss these edges. Moreover, we provide support for an asymptotic distribution of the new test statistic under null model, which reduces the computation complexity of P-values a lot.

When we prepare for this paper, another group proposed another method to test the independence of two random vectors (Li et al., 2015). Their method is similar to ours. However, they performed the hypothesis test using the covariance matrix. As we state in Section 2, the hypothesis test on the covariance matrix Σ12 is easily influenced by the heterogeneity of variances. The assumption of equal variances of exons vectors in the RNA-seq experiments would be frequently violated. We compared VCNet to their method and the results are shown in the Supplementary Material. Among most simulation conditions, VCNet has larger AUC than theirs. Specht et al. (2015) proposed a generalization of the Pearson’s correlation coefficient to construct GCN. However, their method is confined to the total read counts mapped onto a gene, instead of using vector of read counts mapped to individual exons. We think that the read count vectors of genes contain more information, in particular it would encompass the features of alternative splicing. We also investigate the effects of imbalance between p and q on the performances of three methods. The results are shown in Supplementary Table S10, which does not give a clear trends with the ratio of p and q.

VCNet also has another advantage of robustness even when the data do not follow a multivariate normal distribution. The simulation procedure can be found in the Supplementary Material, and the results are shown in Supplementary Figure S5. It showed that the ROC curves are similar to those in Figure 1, which demonstrate the test statistic is robust even under non-Gaussian distribution.

Like the case in all co-expression analysis, VCNet cannot distinguish if the interaction between two genes is direct interaction or indirect interaction. The inference of direct interaction is another field of causative analysis. Traditionally, direct interaction network may be inferred by the inverse of the covariance matrix in graphic model theory. In the case of microarray data, the (i, j) element in the inverse matrix corresponds to the gene pair of (i, j). If some element of the inverse matrix is non-zero, we say that the corresponding gene pair is directly associated. Feizi et al. (2013) proposed a network deconvolution method to obtain the direct interactions, while their method is still based on the gene expression of a scalar. On the contrary, in the case of RNA-seq data, the expression of a gene is denoted by a vector. How to define the direct interaction between two vectors is still unsettled and is a topic of further research.

In conclusion, VCNet based on an asymptotic distribution of Frobenius norm provides a good algorithm to compare two vectors of gene expressions. In this report, the elements of the vectors are read counts of each exons. It can be extended to analyse the isoform–isoform co-expression network, where various exons of an isoform form the vector. Moreover, 95% of human genes undergo alternative splicing (Pan et al., 2008), it will be a suitable data for analysis. Further application of VCNet may include lncRNA–mRNA interaction or microRNA–mRNA interaction. Prediction of interactions between lnRNA and its target mRNAs will help us uncover the mechanisms of biological processes.

Acknowledgement

Part of the analysis was performed on the Computing Platform of the Center for Life Science.

Funding

This work was supported by the National Key Basic Research Project of China [grant numbers 2015CB910303, 2016YFA0502303] and the National Natural Science Foundation of China [grant numbers 31171262, 31428012, 31471246].

Conflict of Interest: none declared.

References

Bos
 
R.
 et al.  (
2003
)
Levels of hypoxia-inducible factor-1α independently predict prognosis in patients with lymph node negative breast carcinoma
.
Cancer
,
97
,
1573
1581
.

Cheng
 
F.
 et al.  (
2012
)
Prediction of drug-target interactions and drug repositioning via network-based inference
.
PLoS Comput. Biol
.,
8
,
e1002503.

Christoph
 
G.
 et al.  (
2013
). Accurate approximation of correlation coefficients by short edgeworth-chebyshev expansion and its statistical applications. In:
Prokhorov and Contemporary Probability Theory. Proceedings in Mathematics & Statistics
, 33, 239–260.
Springer, Berlin, Heidelberg
.

Croft
 
D.
 et al.  (
2014
)
The reactome pathway knowledgebase
.
Nucleic Acids Res
.,
42
,
D472
D477
.

Davies
 
R.B.
(
1980
)
Algorithm as 155: the distribution of a linear combination of χ 2 random variables
.
J. R. Stat. Soc. C
,
29
,
323
333
.

Ema
 
M.
 et al.  (
1999
)
Molecular mechanisms of transcription activation by hlf and hif1α in response to hypoxia: their stabilization and redox signal-induced interaction with cbp/p300
.
EMBO J
.,
18
,
1905
1914
.

Feizi
 
S.
 et al.  (
2013
)
Network deconvolution as a general method to distinguish direct dependencies in networks
.
Nat. Biotechnol
.,
31
,
726
733
.

Giles
 
R.H.
 et al.  (
1998
)
Conjunction dysfunction: Cbp/p300 in human disease
.
Trends Genet
.,
14
,
178
183
.

Goodman
 
R.H.
,
Smolik
S.
(
2000
)
Cbp/p300 in cell growth, transformation, and development
.
Genes Dev
.,
14
,
1553
1577
.

Grunstein
 
M.
(
1997
)
Histone acetylation in chromatin structure and transcription
.
Nature
,
389
,
349
352
.

Gu
 
Y.-Z.
 et al.  (
1998
)
Molecular characterization and chromosomal localization of a third alpha-class hypoxia inducible factor subunit, hif3alpha
.
Gene Expr
.,
7
,
205
213
.

Gusterson
 
R.J.
 et al.  (
2003
)
The transcriptional co-activators creb-binding protein (cbp) and p300 play a critical role in cardiac hypertrophy that is dependent on their histone acetyltransferase activity
.
J. Biol. Chem
.,
278
,
6838
6847
.

Haase
 
V.H.
(
2012
)
Renal cancer: oxygen meets metabolism
.
Exp. Cell Res
.,
318
,
1057
1067
.

Hansen
 
B.O.
 et al.  (
2014
)
Elucidating gene function and function evolution through comparison of co-expression networks of plants
.
Front. Plant Sci
.,
5
,
394.

Hara
 
S.
 et al.  (
2001
)
Expression and characterization of hypoxia-inducible factor (hif)-3α in human kidney: suppression of hif-mediated gene expression by hif-3α
.
Biochem. Biophys. Res. Commun
.,
287
,
808
813
.

Hong
 
S.
 et al.  (
2013
)
Canonical correlation analysis for rna-seq co-expression networks
.
Nucleic Acids Res
.,
41
,
e95–e95.

Horvath
 
S.
(
2011
).
Weighted Network Analysis: Applications in Genomics and Systems Biology
.
Springer Science & Business Media, Springer New York Dordrecht Heidelberg London
.

Hu
 
Y.
 et al.  (
2014
)
Pennseq: accurate isoform-specific gene expression quantification in rna-seq by modeling non-uniform read distribution
.
Nucleic Acids Res
.,
42
,
e20
e20
.

Ietta
 
F.
 et al.  (
2006
)
Dynamic hif1a regulation during human placental development
.
Biol. Reprod
.,
75
,
112
121
.

Iyer
 
N.V.
 et al.  (
1998
)
The human hypoxia-inducible factor 1α gene: Hif1astructure and evolutionary conservation
.
Genomics
,
52
,
159
165
.

Jiang
 
H.
,
Wong
W.H.
(
2009
)
Statistical inferences for isoform expression in rna-seq
.
Bioinformatics
,
25
,
1026
1032
.

Kerr
 
M.K.
 et al.  (
2000
)
Analysis of variance for gene expression microarray data
.
J. Comput. Biol
.,
7
,
819
837
.

Li
 
B.
,
Dewey
C.N.
(
2011
)
Rsem: accurate transcript quantification from rna-seq data with or without a reference genome
.
BMC Bioinformatics
,
12
,
1.

Li
 
W.
 et al.  (
2015
). Testing the independence of two random vectors where only one dimension is large. Statistics,
51
, 141–153.

López-Kleine
 
L.
,
González-Prieto
C.
(
2016
)
Challenges analyzing rna-seq gene expression data
.
Open J. Stat
.,
6
,
628
636
.

Maeda
 
Y.
 et al.  (
2008
)
Cul2 is required for the activity of hypoxia-inducible factor and vasculogenesis
.
J. Biol. Chem
.,
283
,
16084
16092
.

Marguerat
 
S.
,
Bähler
J.
(
2010
)
Rna-seq: from technology to biology
.
Cell. Mol. Life Sci
.,
67
,
569
579
.

Milacic
 
M.
 et al.  (
2012
)
Annotating cancer variants and anti-cancer therapeutics in reactome
.
Cancers
,
4
,
1180
1211
.

Neudecker
 
H.
,
Wesselman
A.M.
(
1990
)
The asymptotic variance matrix of the sample correlation matrix
.
Linear Algebra Appl
.,
127
,
589
599
.

Pan
 
Q.
 et al.  (
2008
)
Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing
.
Nat. Genet
.,
40
,
1413
1415
.

Pezeshki
 
A.
 et al.  (
2004
). Empirical canonical correlation analysis in subspaces. In Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference on, Vol. 1, pp. 994–997. IEEE, Piscataway, NJ.

Specht
 
A.T.
 et al.  (
2015
)
Estimation of gene co-expression from rna-seq count data
.
Stat. Interface
,
8
,
507
515
.

Wang
 
Z.
 et al.  (
2009
)
Rna-seq: a revolutionary tool for transcriptomics
.
Nat. Rev. Genet
.,
10
,
57
63
.

Weirauch
 
M.T.
(
2011
)
Gene coexpression networks for the analysis of dna microarray data. In: Dehmer,M. et al (eds)
.
Appl. Stat. Netw. Biol. Methods Syst. Biol
. Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany, pp.
215
250
.

Wu
 
G.
 et al.  (
2010
)
Research a human functional protein interaction network and its application to cancer data analysis
.
Genome Biol
.,
11
,
R53.

Yalamanchili
 
H.K.
 et al.  (
2014a
)
Ddgni: dynamic delay gene-network inference from high-temporal data using gapped local alignment
.
Bioinformatics
,
30
,
377
383
.

Yalamanchili
 
H.K.
 et al.  (
2014b
)
Splicenet: recovering splicing isoform-specific differential gene networks from rna-seq data of normal and diseased samples
.
Nucleic Acids Res
.,
42
,
e121.

Yang
 
Y.
 et al.  (
2014
)
Gene co-expression network analysis reveals common system-level properties of prognostic genes across cancer types
.
Nat. Commun
.,
5
.

Yao
 
T.-P.
 et al.  (
1998
)
Gene dosage–dependent embryonic development and proliferation defects in mice lacking the transcriptional integrator p300
.
Cell
,
93
,
361
372
.

Zhang
 
B.
 et al.  (
2005
)
A general framework for weighted gene co-expression network analysis
.
Stat. Appl. Genet. Mol. Biol
.,
4
,
1128.

Zhang
 
J.
 et al.  (
2014
)
Wemiq: an accurate and robust isoform quantification method for rna-seq data
.
Bioinformatics
, 31,
878–885
.

Author notes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data