Tissue, age, sex, and disease patterns of matrisome expression in GTEx transcriptome data

Retrieval and correlation of label free quantification/genomic datasets

We acquired protein and protein-coding gene data⁸ found in their Supplementary Tables S1 and S2 respectively. The proteins were analyzed by mass spectrometry using identification and intensity based absolute quantification (iBAQ)⁴⁸. Methods of RNA extraction of the tissues, including library preparation and sequencing are described in Uhlén et al.⁴⁹. Our analysis used their fragments per kilobase of transcript per million mapped reads (FPKM) data.

All analyses were completed in R v.4.0.2 unless stated otherwise. For each tissue, we linked their protein iBAQ and gene FPKM data together. The data was filtered down to matrisome core and regulatory protein-gene pairs that had non-0 values in their gene and protein columns. Using base R, Spearman’s rank correlations were generated for the following comparisons: correlations between gene transcripts and proteins within each tissue; and a distribution of correlations for each tissue, randomly sampling N non-0 protein-gene pairs (without replacement) 10,000 times using dplyr 1.0.2. The N of samples is the number of viable matrisome gene transcripts for each tissue. This process was repeated for each matrisome category.

Retrieval and correlation of GTEx proteomic-transcriptomic data

Protein data and gene data were retrieved from Supplementary Tables S2D and S3A from Jiang et al.⁹. The protein data used was “Cleaned relative protein abundances in log2 scale” and the gene data was “Cleaned and normalized RNA log2 (TPM) expression for all protein-coding genes across matched samples”. We also acquired their gene across tissue correlation data from Supplemental Table S4A.

From the proteomic and transcriptomic data, samples from individuals with proteomic and transcriptomic data were selected for the analysis. The protein and gene samples were joined and filtered, requiring the protein-gene pairs to have non-0 and non-NA values in each column. We then generated a Spearman’s rank correlation for each sample pair on all gene transcripts, and the core and regulatory matrisome.

Using the body wide gene tissue correlations from Jiang et al. (Supplemental Table S4A; 1-s2.0-S0092867420310783-mmc5.xlsx, sheet 2) each protein-gene pair correlation was labeled with its matrisome category or as “Non-Matrisome”. Each matrisome category’s correlations were compared against the correlation of all other genes using a Mann–Whitney U test. The resulting p-values were corrected for multiple tests using Benjamini–Hochberg method, and the data was presented using ggplot2 (v.3.3.2).

Retrieval and processing of GTEx bulk sequencing and phenotype data

The gene read counts of the RNA-Seq GTEx version 8 data set (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz) were downloaded from the GTEx Portal (https://gtexportal.org/home/datasets), along with the de-identified sample and subject annotations (GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt, GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt). GTEx v8 median TPM for each tissue was downloaded through the GTEx portal (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz). Numeric ages of the GTEx subjects were acquired through dbGap with approval. Pertaining to human data, all methods were carried out in accordance with relevant guidelines and regulations.

For the median tissue expression, raw read counts were normalized together using the VST (variance stabilizing transformation) feature in DESeq2 (v.1.22.1) in R version (v.3.6.117). This method incorporates estimated size factors based on the median-ratio method, and transformed by the dispersion-mean relationship.

The median VST normalized read counts for core matrisome and ECM regulator genes were used to develop a tissue median expression profile. The median expression of all separate brain tissues were used to form the median BRAIN tissue. A clustered heatmap of tissues were generated using a Kendall’s τ correlation matrix. The heatmap was plotted using Pheatmap (v.1.0.12) and the tissues were clustered using a 1 − Kendall’s τ correlation matrix based on their Euclidian distance. Matrisome genes were labelled with their respective categories, and each category was separately summed for each tissue, and then divided by the total TPMs within each tissue. These values were plotted using ggplot2.

For the age and sex analysis raw read counts, by tissue type, were filtered using the filterByExpr function from edgeR (v.3.32.1)⁵⁰ before being analyzed using the voom method⁵¹ in limma (v.3.46.0)⁵² in R version (v.4.0.2). GTEx tissues were filtered to only core matrisome and ECM regulator genes. Independently, each of these genes were modeled as a linear function of scaled age (using R’s base scale function) and sex corrected for ischemic time (SMTSISCH), Hardy Scale (DTHHRDY) and experimental batch (SMGEBTCH) using limma’s lmFit function. The results for all tissues were collated together and had a FDR correction applied to their p-values. Only estimates with a > 0.05 corrected p-value were used for further analysis. The data was plotted using ggplot2, limited to genes with smallest p-values for the given covariate in each tissue.

Sex based differences of adipose tissue and transverse colon

For both ADPSQB and ADPVSC, previously VST normalized genes were required to have a mean normalized read count > 5 and an across sample variance > 1.5 for inclusion in the cluster analysis. The filtered genes were clustered using hierarchal clustering on a distance generated by 1 − Kendall’s rank-correlation coefficient. A τ critical value was calculated based on the number of samples and genes expressed. The correlation-based dendrogram was cut to produce gene clusters with an average cluster correlations of at least the τ critical value.

Normalized expression scores were calculated by subtracting the mean expression and dividing by the median absolute deviation of the expression values for each gene across all samples within a given tissue. The equation is as follows, where x is the VST normalized expression of gene j in sample i, t(i) is the tissue type for sample i, and J is the number of genes in a given cluster.

$${S}_{i}= frac{1}{J}sum_{j=1}^{J}frac{{x}_{ij} – {overline{x} }_{j}}{begin{array}{c}Median\ m:tleft(mright)=tleft(iright)|{x}_{mj}-{tilde{x }}_{j}|end{array}}.$$

Using the previously generated ADP-Fib.1 cluster score, 5 males and 5 females were selected from the highest and the lowest ADP-Fib.1 scores for both ADPSBQ and ADPVSC. These samples were then de-identified and randomized to prevent biased ImageJ scoring.

Previously VST normalized transverse colon, sigmoid colon, ADPSBQ, and ADPVSC samples were selected on the requirement that all four tissues were available for a given individual. This RNA-seq data was limited to only the ADIPOQ gene, and normalized reads were averaged across the tissues generating the multi-tissue ADIPOQ score. The individuals were sorted on the multi-tissue ADIPOQ, and the top 6 individuals and the bottom 8 individuals were selected to have their transverse colon analyzed using ImageJ.

The percent adipose content of each sample was determined using ImageJ 2.0.0 (FIJI)⁵³. For the transverse colon, only samples with submucosal space present were evaluated. The ImageJ freehand selection tool, captured the area of the submucosal space, excluding veins, white space, and muscle. For ADPSBQ and ADPVSC, the fat area was captured, avoiding white space and vasculature. The images were transformed to greyscale, a threshold was applied (range 199 to 255 light) and the percent of captured area being either adipose (white space) or fibrosis (colored space) was calculated.

Two linear models were applied on both ADPSQB and ADPVSC. The first model used BMI and sex as covariates for the ADP-Fib.1 score. The second model used ADP-Fib.1 score, sex, and BMI as covariates for percent adipose of the tissue samples. Individuals with more than one sample per tissue had the average of the percent adipose tissue used as their value.

For the transverse colon model, multi-tissue normalized ADIPOQ score, age, and BMI were used as covariates on percent adipose tissue in transverse colon samples. As there were multiple tissues slices per person, subject ID was used as a random effect in the model. The R package lme4 (v.1.1-26) was used to generate the linear mixed model. In the model equation (below), β₁ is the ADIPOQ Score for sample i, β₂ is the age of individual j, β₃ is the sex of individual j, β₄ is the BMI of individual j, u_n(i) is the random effect of the individual for sample i, and ϵ_i is the error term for said sample.

$${Y}_{ij}= {beta }_{0}+ {beta }_{1}{ADIPOQ Score}_{i}+ {beta }_{2}{Age}_{j}+ {beta }_{3}{Sex}_{j}+ {beta }_{4}{BMI}_{j}+{u}_{n(i)}+ {epsilon }_{i}.$$

Generating tissue-agnostic rank changes between normal and cancer samples

GTEx data and TCGA data was acquired through the R package TCGAbiolinks v.2.18.0. The normal GTEx and normal TCGA tissues downloaded were colon (n = 376 GTEx [transverse colon], n = 51 TCGA), prostate (n = 119 GTEx, n = 52 TCGA), breast (n = 92 GTEx [female], n = 111 TCGA), thyroid (n = 361 GTEx, n = 59 TCGA), lung (n = 374 GTEx, n = 110 TCGA), ovary (n = 108 GTEx), stomach (n = 204 GTEx), pancreas (n = 197 GTEx), esophagus (n = 331 GTEx [ESPMCS]), and liver (n = 136 GTEx), with their respective TCGA cancers being COAD, PRAD, BRCA, THCA, LUAD, ESCA, OV, STAD, PAAD, LIHC. Using TCGA histological types, COAD was limited to colon adenocarcinoma (n = 426), prostate to prostate adenocarcinoma acinar type (n = 489), breast to infiltrating ductal carcinoma (n = 788), thyroid to thyroid papillary carcinoma (classical) (n = 357), ovary to serous cystadenocarcinoma (n = 430), stomach to both stomach adenocarcinoma and stomach intestinal adenocarcinoma (n = 306), pancreas to pancreas-adenocarcinoma ductal type (n = 147), and liver to hepatocellular carcinoma (n = 356). Lung adenocarcinoma (n = 542) and esophageal carcinoma (n = 184) did not require further filtration. Normal TCGA samples for esophagus, ovary, stomach, liver, and pancreas were excluded.

Separately, all tissues and their datasets, GTEx normal (GTEx-N), TCGA normal (TCGA-N), and TCGA cancer (TCGA-C) were filtered down to genes with an across sample sum > 5 raw counts (N_min/N_max = 49,463–56,114 genes, liver and stomach respectively) and normalized using the DESeq2 VST function. After normalization, genes were further filtered to include only those with a sample mean > 5 normalized counts (N_min/N_max = 20,881–25,075 genes, liver and thyroid respectively) generating a total gene whitelist for the analysis. This allowed inclusion of genes that were lowly expressed in normal tissue, but had higher expression in cancer tissue.

The above analysis was replicated after normalization steps and whitelist formation, filtering down to core matrisome and ECM regulatory genes (N_min/N_max = 353–410, liver and lung respectively). GTEx-N, TCGA-N, and TCGA-C datasets were combined together and quantile normalized using the package preprocessCore (v.1.52.0) (https://github.com/bmbolstad/preprocessCore), before being separated again. The quantile normalized mean counts for genes in each dataset were calculated. An average was taken between the GTEx-N’s and TCGA-N’s mean quantile normalized counts to form a joined normal mean (Joined-N; Supplementary Fig. S7). Each gene list was ordered on their mean expression, generating for each tissue a list of ranked matrisome genes from GTEx-N, TCGA-N, TCGA-C, and Joined-N used to calculate a numerical rank change for each gene, between normal and cancer tissues (Joined-N compared to TCGA-C, Supplementary Fig. S8). For esophagus, ovary, stomach, liver, and pancreas, GTEx-N was used in place of Joined-N.

Relative rank changes of genes between TCGA-C and Joined-N were determined for lung, breast, colon, thyroid, and prostate tissue. This was replicated between TCGA-C and GTEx-N for esophagus, ovary, stomach, liver, and pancreas. The TCGA-C to Joined-N rank changes per gene were summed across the five tissues to establish a multi-tissue normal-to-cancer rank change. This highlighted genes that consistently increased or decreased across cancer samples, while lowering the effect of tissue specific changes or changes inadvertently caused by the rank change of other genes. The rank changes were segmented into deciles and the top and bottom 10 genes from the multi-tissue rank change were subsetted out and plotted using Pheatmap.

Six separate microarray studies were used to recapitulate the matrisome cancer findings. Two lung datasets (GSE31210^54,55, GSE19188⁵⁶), two breast datasets (GSE15852⁵⁷, GSE109169⁵⁸), an esophagus dataset (GSE161533 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE161533]), and a colon dataset (GSE44076⁵⁹) were obtained through GEOquery 2.58.0. DE genes were discovered using the limma 3.46.0 R package’s eBayes function⁵² with Holm’s adjustment for p-value correction. For each gene, if ≥ 50% of probes for a given gene were called as significantly DE, the gene was considered DE.

Single-cell phenotype and normalized expression fibroblast data was downloaded from http://scope.lambrechtslab.org/ (visited on 9/22/2021), from their “Pan-cancer blueprint of tumor microenvironment” study²⁶. The downloaded data was limited to the 20 analyzed rank change genes (COL10A1, MMP11, COL11A1, COMP, SPP1, MMP9, MMPO1, CST1, CTHRC1, MMP7, MASP1, PCOLCE2, SRPX, WISP2, LGI4, MMRN1, OGN, TNXB, DPT, and ADAM33). In R, all cells marked as “Low quality” were removed from the dataset before the cells were split into CAF cells and non-CAF cells (FIB) based on their “Phenotype” column. The gene correlation heatmap was created using R’s base correlation function and pheatmap function, clustering the data on the Euclidian distance of 1 − Pearson correlations. The gene’s normalized values were compared between groups using a Wilcoxon Rank Sum Test, and multiple testing was accounted for by using the Holm method.

Acquisition, processing, and gene variance clustering of IPF data

From GEO, the design, gene annotations, and raw counts for GSE134692 were downloaded, including their IPF, ALI, and normal samples (n = 46, 8, and 26 respectively)²⁹. The data was processed as the adipose tissue above, however the variance filter was a > 2 variance in this analysis. This returned two sub-clusters of high variance matrisome genes for the IPF dataset.

Histological images of GTEx lung samples were previously categorized as normal or ventilator injured by a lung pathologist¹². From this group, 10 samples were randomly selected from both the ventilator injury samples and the normal samples.

GTEx lung samples and samples from Sivakumar et al. were combined into one DESeqDataSet the design including both disease status (IPF, ALI, Ventilator Injury, and normal) and Batch (Sivakumar-1, Sivakumar-2, and GTEx). Sivakumar et al. states they correct for these batches, but do not give the process further description. The data was normalized using DESeq2’s VST transformation. After transformation, the data was filtered to only contain the high variance IPF gene clusters (n = 25) before being centered on the average gene expression across samples. The samples and genes were clustered using Euclidian distancing and graphed using the Pheatmap package.

Source link

Vasiprak Blog

Retrieval and correlation of label free quantification/genomic datasets

Retrieval and correlation of GTEx proteomic-transcriptomic data

Retrieval and processing of GTEx bulk sequencing and phenotype data

Sex based differences of adipose tissue and transverse colon

Generating tissue-agnostic rank changes between normal and cancer samples

Acquisition, processing, and gene variance clustering of IPF data

You might also like

Novel assay finds new mechanism underlying red blood cell aging — ScienceDaily

Impaired-driver sensor could pave the way for safer vehicles — ScienceDaily

Recent trends also highlight global disparities in cancer burden — ScienceDaily

Stay tuned!