A challenge for scientists is how to compare gene expression information from microarrays to RNA-Seq with different sequencing depths (Rao et al. 2018; Zhang et al. 2015; Zhao et al. 2014). Although gene expression values such as fluorescence intensity, FPKM, and TPM have been used for differential gene expression (DGE) analysis, expression values are incomparable across studies, especially for RNA-Seq data which may involve different normalization methods (Abbas-Aghababazadeh et al. 2018; Eilertsen et al. 2020). The use of gene rank rather than gene expression value makes the comparison between transcriptome datasets and diseases possible (Huo et al. 2021; Ye et al. 2021). The current gene expression analysis framework still relies on pair-wise or multiple sample comparisons. The inherent gene rankings in individual samples provide important information which is usually ignored. An obvious example is tissue-specific gene expression data. Genes that rank first in tissues liver, lung, and placenta are Haptoglobin, Sftpc, Csh1, which are involved in hepatic free hemoglobin clearance, pulmonary surfactant protein, and placental lactogen (Uhlen et al. 2015). The tissue-specific information can be inferred from gene ranking from an individual sample. No DGE is needed. Here, we used SVD to construct a reduced space for tissue RNA-Seq data. We provided the user-friendly TissueSpace tool for such gene ranking-based data analyses. Features from different datasets/platforms can also be compared after geting feature representation matrix.
Figure 1 Workflow of the analysis. The GTEx dataset was used to construct a tissue space, where the dimension of the original transcriptome was reduced to 100 features. New sample transcriptome data can be projected to the tissue space to produce a 100-dimensional representation. The new feature-sample matrix can be correlated with external data. For example, feature matrices can be used in feature-trait correlation analysis, survival analysis, and differential feature analysis. Finally, the tissue space was deployed in a webserver to help users without bioinformatics skills to get the feature matrix for downstream analysis.
TissueSpace tool includes the following functions:
- convert different gene ID types to Ensembl gene IDs;
- project any human transcriptome profile to get vector representation for downstream analysis;
- functional enrichment for each of the 100-dimensional vector features.
First, if your gene expression matrix is not using genes with Ensembl IDs, please use the Tab "Convert ID to Ensembl" to convert the IDs. TissueSpace ACCEPT ONLY Ensembl IDs!
Second, the rank-based method is normalization independent, users can submit gene expression matrix directly to the tool. TissueSpace will convert the expression values into ranks for downstream analysis. If users want to integrate different datasets, they should combine gene expression matrices as one before submission.
Save your gene expression matrix as an csv file, then upload it.
Figure 2 Upload data
Figure 3 Feature annotation view. The top 30 genes within each features are submitted to the g:Profiler server for function enrichment ananlysis. These genes were also listed with outlinks to HPA and GeneCard databases.
Case studies overview
These datasets were used in TissueSpace tool: Normal human tissue RNA-Seq data was downloaded from GTEx Portal (GTEx Analysis V8) (Mele et al. 2015). It contains 17384 samples from 30 tissue types. GSE45878 containing Affymetrix expression data from GTEx was downloaded from NCBI GEO. Two additional RNA-Seq datasets containing 122 and 43 normal tissue samples were downloaded from a publication and Human Protein Atlas (HPA) database (Uhlen et al. 2015).
Case studies use the following datasets: Nonalcoholic fatty liver disease (NAFLD) datasets GSE130970, GSE49541, GSE89632, liver tumor dataset GSE36376, and critically ill patient dataset GSE65682 were downloaded from NCBI GEO. The hepatocellular carcinoma (HCC) dataset was downloaded from The Cancer Genome Atlas (TCGA) through GDC Data Portal.
Figure 4 Parameter k selection and a screenshot of the TissueSpace web tool. (A) The curve shows the relationship between different k selection and Kendall rank coefficient that measures the similarity between the reconstructed and the original gene-sample matrix. (B) TissueSpace tool has a good capability to assign correct tissue labels to both microarray and RNA-Seq datasets. GSE45878 is a dataset contains normal tissue transcriptomes from GTEx. The other two datasets were download from publication (Uhlen et al. 2015) and Human Protein Atlas database. (C) The linear relationship between data size and computation time. (D) The tool contains mainly three tabs. In the first tab, users can upload expression matrix data. In the second tab, functional annotation information is provided for the feature vector. In the third tab, users can convert ID to Ensembl, which is required for TissueSpace.
TissueSpace case study 1: Analysis of transcriptomic profiles of NAFLD datasets GSE130970, GSE49541, and GSE89632
Figure 5 The application of TissueSpace to NAFLD datasets. (A) The feature-trait relationship shows that features are correlated with clinical parameters in the NAFLD dataset GSE130970. The numbers in the colored cells are the correlation value and its statistical significance value. (B) The ROC shows feature 71 can robustly discriminate patients with mild and advanced fibrosis with an AUC of 0.94 in NAFLD dataset GSE49541.
Figure 6 Feature-trait relationship for NAFLD dataset GSE89632. The numbers in the colored cells are the spearman correlation value and its statistical significance value in parenthesis.
TissueSpace case study 2: Analysis of gene expression profiles of both tumor and adjacent non-tumor liver tissue dataset GSE36376
Figure 7 The application of TissueSpace to liver tumor dataset GSE36376. ROC shows the performance of features 12 and 13 with AUCs of 0.98 and 0.96 in separating tumor and adjacent tissues.
TissueSpace case study 3: Analysis of TCGA HCC RNA-Seq datasets
Figure 8 The application of TissueSpace to TCGA HCC dataset. Survival curves for features 54, 66, and 98 show their performance in discriminating patients with different overall survival times. The survival differences between groups were calculated by the log-rank test. The red line indicates low expression and green line indicates high expression.
TissueSpace case study 4: Analysis of the blood transcriptome of critically ill patients
Figure 9 The application of TissueSpace to the blood transcriptome of critically ill patients GSE65682. (A) Two significant features 60 and 98 are associated with sepsis patient mortality. The survival differences between groups were calculated by the log-rank test. (B) Features 71 and 79 show statistically different expressions in patients from different groups, community-acquired pneumonia (cap), hospital-acquired pneumonia (hap), and non-infectious cap control patients (no-cap). The mean difference between groups was calculated using one-way ANOVA. Asterisk indicates the statistical significance is smaller than 1E-6.
The user interface for TissueSpace is simple. It is deployed by the R shiny web framework. The process requires R packages “lsa”, “parallel”, and “datatables”. Package “lsa” was used to construct the tissue space, “parallel” was used to speed up the data processing when multiple CPU cores are available, “datatables” was used for data presenting in the frontend (Chang et al. 2015; Liu 2020).