Spatial Transcriptomics to define transcriptional patterns of zonation and structural components in the mouse liver

Ethical statement

The Regional Animal Research Ethical Board, Stockholm, Sweden, approved experimental procedures and protocols involving extraction of organs from mice (N135/15, N78/16 and 9707-2018), following proceedings described in EU legislation (Council Directive 2010/63/EU). The mice used in this study were kept at a 12 h night/day cycle and 22 °C ambient temperature, with free access to food and water under specific pathogen‐free conditions at the Experimental Core Facility, Stockholm University and were euthanized between 8 and 12 weeks of age.

Total RNA extraction

To test for the RNA quality of the tissue for further downstream analysis, the tissue was sectioned and up to eight 10 µm sections were placed in Lysing Matrix D tubes (MPBiomedicals, cat.no.: 116913050-CF) containing Buffer RLT Plus (Qiagen, cat.no.: 1053393) and ß-Mercaptoethanol (Thermofisher, cat.no.: 31350010) and homogenized in a Fastprep instrument (ThermoSavant). The flowthrough was collected through a gDNA Eliminator column and 250 ul of pure ethanol was added. Total RNA was further extracted using the RNAeasy mini kit (Qiagen, cat.no.: 74104) according to manufacturer’s instructions. The RNA integrity number (RIN) for each sample was assessed performing Bioanalyzer High Sensitivity RNA Analysis (Agilent cat.no.: 5067-1535).

Collection and preparation of liver samples

Female C57BL/6 mice (Charles River), were euthanized between weeks 8 and 12 and livers were collected, and four lobes were separated. Each lobe was segmented so cryosections would fit on the 6,200 × 6,400 µm areas of the Codelink-activated microscope slides and frozen in −30 °C 2-Methylbutane (Merck, cat.no.: M32631-1L). The frozen liver samples were embedded in cryomolds (10×10 mm, TissueTek) filled with pre-chilled (4 °C) OCT embedding matrix and frozen (CellPath, cat.no.: 00411243). For downstream experiments, the frozen samples were sectioned at 10 µm thickness with a cryostat (Cryostar NX70, ThermoFisher). Each subarray on the slide is covered with 1934 spots with a 100 µm diameter, containing approximately 200 million uniquely barcoded oligonucleotides with poly-T₂₀ VN capture regions per spot (Barcoded slides were manufactured by 10X Genomics Inc, probes were manufactured by IDT). The full protocol, including sequencing and computational analysis was performed for 8 sections across 3 samples. Sample 1 and sample 3 each include three sections of the caudate lobe. Sample 2 includes two sections of a liver piece of the right liver lobe. All sections of all samples have undergone the same treatment.

Histological staining and annotations

We performed the ST workflow according to Ståhl et al. and Vickovic et al., respectively^60,61. After 10 minutes of formalin fixation of the tissue on the slides they were dried with isopropanol and stained with Mayer’s hematoxylin (Dako, cat.no.: S330930-2) and bluing buffer (Dako, cat.no.: CS70230-2) followed by Eosin (Sigma-Aldrich, cat.no.: HT110216-500ML) (H&E), diluted in Tris/acetic acid (pH 6.0). The stained sections were mounted with 85% glycerol (Merck Millipore, cat.no.: 8187091000) and covered with a coverslip. Bright-field images were acquired at 20x magnification, using Zeiss AxioImager.Z2 microscope and the Metafer Slide Scanning System (Metasystems). The liver images were assessed by a mouse liver expert (NVH) who annotated the portal (blue) and central (red) veins, based on the presence of bile ducts and portal vein mesenchyme (PV) or lack thereof (CV). When the quality of the sample did not allow for annotation, “ambiguous vein” (green) was reported.

Immunofluorescence assay (IFA)

For immunofluorescence labeling, liver sections were (co-)stained with antibodies directed against GS (1:1000; ab73593, Abcam), SOX9 (1/100; AB5535, Millipore), F4/80 (1:100; MCA497RT, BioRad), HNF4ɑ (1/100; sc6556, Santa Cruz) or CD31 (1:200; MEC13.3, BioLegend). Secondary antibodies were either anti-rabbit AlexaFluor 488; or anti-rabbit, anti-goat or anti-rat AlexaFluor 647 (1:1000, Invitrogen). Hoechst (1:10 000, 62249, Thermo Scientific) was used to counterstain the nuclei. Liver sections were imaged using the Zeiss AxioImager.Z2 microscope.

Permeabilization, cDNA synthesis, tissue removal, and probe release

Next, the slides were put in mask holders (ArrayIT) to enable separated on-array reactions in each chamber as described previously⁶¹. Each tissue section was pre-permeabilized using Collagenase I for 20 min at 37 °C. Permeabilization was performed using 0.1% pepsin in 0.1 M HCl for 10 min at 37 °C. cDNA synthesis was performed overnight at 42 °C. Tissue removal from the arrays prior to probe release was performed using Proteinase K in PKD buffer at a 1:7 ratio at 56 °C for 1 h. Last, the surface probes were released and cDNA library preparation followed by sequencing was performed.

cDNA library preparation and sequencing

Released mRNA-DNA hybrids were further processed to generate cDNA libraries for sequencing. The sequencing libraries were prepared as described in Jemt et al.⁶². In short, the 2nd strand synthesis, cDNA purification, in vitro transcription, amplified RNA purification, adapter ligation, postligation purification, a second 2nd strand synthesis and purification were done using an automated MBS 8000+ system. To determine the number of PCR cycles needed for optimal indexing conditions a qPCR was performed, as described previously⁶¹. After the determination of the optimal cycle number for each sample, the remaining cDNA was indexed and amplified. The indexed libraries were then purified using an automated system as previously described⁶³. The average length of the indexed cDNA libraries was determined with a 2100 Bioanalyzer using the Bioanalyzer High Sensitivity DNA kit (Agilent, cat.no.:5067-4626), concentrations were measured using a Qubit dsDNA HS Assay Kit (Thermofisher, cat.no:Q32851) and libraries were diluted to 4 nM. Paired-end sequencing was performed on the Illumina NextSeq500 platform, with 31 bases from read 1 and 46 bases from read 2 resulting in the generation between 15 and 32.1 million raw reads per sample. To assess the quality of the reads fastqc (v 0.11.8) reports were generated for all samples.

Spot visualization and image alignment

The staining, visualization, and imaging acquisition of spots printed on the ST slides were performed as previously described⁶⁰. Briefly, spots were hybridized with fluorescently labeled probes for staining and subsequently imaged on the Metafer Slide Scanning system, similar to the previous acquisition of the H&E images. The previously obtained brightfield of the tissue slides and the fluorescent spot image were then loaded in the web-based ST Spot Detector tool⁶⁴. Using the tool, the images were aligned and the spots under the tissue were recognized by the built-in recognition tool. Spots under the tissue were slightly adjusted and extracted.

Computational analysis

Mapping, gene counting and demultiplexing

Processing of raw reads was performed using the open-source ST Pipeline (v 1.7.6)⁶⁵. In short, quality trimming was performed and homopolymer stretches longer than 15 bp were removed. The reads were subsequently mapped to the annotated reference genome (GRCm38 v86 and corresponding GENCODE annotation file) using STAR (v 2.6.1e)⁶⁶. After filtering, PCR duplicates were removed and gene count matrices were generated.

Dimensionality reduction and clustering

Main computational analysis of spatial read-count matrices was performed using the STUtility package (v 0.1.0)⁶⁷ in R (v 4.0.2). The complete R workflow can be assessed and reproduced in R markdown (see code availability section). First, count matrices and metadata were loaded, translating Ensembl IDs to gene symbols simultaneously. Reads of individual samples were filtered to keep only protein-coding genes and subsequently normalized using the SCTransform function in Seurat^68,69. The created objects were then integrated using the canonical correlation analysis (CCA) with the MultiCCA function provided in https://github.com/almaan/ST-mLiver⁷⁰. Normalization of integrated data was performed, regressing out sample identities using the SCTransform function in Seurat. Thereafter, the CCA vectors were subjected to shared-nearest-neighbor (SNN) inspired graph-based clustering via the FindNeighbors and FindClusters functions. For modularity optimization, the louvain algorithm was used and clustering was performed at a resolution of 0.3 for clustering granularity.

Visualization and spatial annotation of clusters

To visualize the clusters in low-dimensional space and on the spot coordinates under the tissue, non-linear dimensionality reduction was performed using UMAP with the CCA vectors as input. Visualization and annotation of identified clusters in UMAP space, on spot coordinates as well as superimposed on the H&E images was performed using the Seurat and STUtility package.

Differential gene expression analysis (DGEA) and expression programs

Differential gene expression analysis of genes in identified clusters was performed using the function FindAllMarkers from the Seurat package. Following the default option of the method, differentially expressed genes for each cluster were identified using the non-parametric Wilcoxon rank-sum test. Initial thresholds were set to a logarithmic fold-change of 0.25 to be considered differentially expressed in a cluster and to be present in at least 10% of the spots belonging to the same cluster. Representative markers for each cluster were further selected, by choosing genes with a positive logarithmic threshold above 0.5 and an adjusted p value below 0.05. P value adjustments are based on Bonferroni correction using all genes in the dataset.

After the identification of marker genes of the individual clusters, we identified expression programs of genes (module scores) for clusters we identified to have spatial distribution in our data. These were cluster 1 (periportal cluster), cluster 2 (pericentral cluster) and cluster 5. The creation of expression programs was performed using the AddModuleScore function in Seurat. In brief, we stored the marker genes of each cluster in a list to serve as input for the function. From this input, the average expression of each program (list of markers) was calculated for each spot under the tissue and subtracted by the aggregated expression of a control gene set. Here, the control gene set included all genes present in our data. All analyzed genes were then binned based on averaged expression and with the default number of 24 bins for the function, and 100 control genes of the control feature set were randomly selected from each bin. Higher scores indicate more marker genes of the program to be highly expressed in a spot, while lower scores indicate that no or only a small number of genes is expressed at low levels in the spot.

Zonation based DGEA of immune system process and metabolic pathway markers

To infer on DEG of immune system processes and metabolic pathways (ha-ras, chronic hypoxia and pituitary hormone metabolic pathways) between cluster 1 and cluster 2, DGEA between genes associated with immune system processes (GO:0002376) and metabolic pathways (KEGG) was performed. Each resulting list was cross-referenced with the normalized spatial data and the expression matrix was subset according to the respective cell type followed by DGEA between cluster 1 (portal) and cluster 2 (central) with a logFC threshold of 0.01 and significance (p_val_adj) below 0.05. The same sets of genes were used to perform bivariate expression by distance analysis on (see “Bivariate expression by distance analysis” and “Model comparison”).

Spatial autocorrelation

To explore the correlation between spatial distribution and expression of all genes in our data spatial autocorrelations using the CorSpatialGenes function of the STUtility package was performed. The method is based on building a connection network from the spot-coordinates for each spot and the four surrounding neighbors at a maximum distance of 150 µm. Thereafter, individual connection networks are combined to a tissue-wide connection network to compute autocorrelations for the whole dataset. Based on the neighbor groups of each spot, lag vectors for all input features are calculated, essentially being the sum expression of the respective feature in the neighbor spots. This considered, neighboring spots with high spatial autocorrelation of features demonstrate similar expression levels. This allowed us to compute the correlation score between the lag vector and the actual expression vector to estimate spatial autocorrelations.

scRNA-seq data

Publicly available scRNA-seq data were analyzed to compare and complement the spatial data in our studies. For this purpose, we used the scRNA-seq dataset of cells originating from liver tissue from the Mouse Cell Atlas (accessed 2020-10-06)⁴¹.

For comparative analysis and visualization, scRNA-seq data of the Mouse Cell Atlas was analyzed using the Seurat package (v 3.2.2). The count-data was first filtered for mitochondrial genes and normalized using the SCTransform function. Dimensionality reduction was performed using PCA and graph-based clustering was performed using the FindNeighbors and FindClusters function with a resolution of 0.8 for clustering granularity. Visualization of the clusters in low-dimensional space was performed using non-linear dimensionality reduction (UMAP). Clusters were grouped by the cell type annotations provided by the metadata of the single-cell dataset. The second dataset used for comparative analysis was extracted from single-cell spatial reconstruction data¹⁵. Differential gene expression data between layers of zonation was compared to markers for pericentral or periportal zonation in our dataset using R (v 4.0.2).

Correlation analysis

Correlation analyses between genes of clusters were performed using Pearson correlation, establishing linear correlations between differentially expressed genes of the clusters in base R. Visualization of correlation values was carried out using the corrplot package (v 0.84). The correlation coefficients of the matrix were ordered using the method “FPC”, describing the first principal component order of the correlation coefficients.

To explore the correlation relationship between single cells (assigned to the classes “pericentral hepatocytes” and “periportal hepatocytes”) and the spatial transcriptomics “pericentral (cluster2)” and “periportal (cluster1)” clusters, Spearman rank correlation coefficients were calculated. First module scores of genes assigned to each cluster were calculated for each dataset: ST and scRNA-seq of the Mouse Cell Atlas. Notably, not all genes present in one dataset were present in the other, therefore only genes present in the respective dataset were considered. Thereafter, Spearman rank correlation between the scores for all groups (“pericentral hepatocytes”, “periportal hepatocytes”, “periportal (cluster1)”, “pericentral (cluster2)”) were performed. The relationships were visualized using the corrplot package, with values ordered in the original input order.

To identify genes that exhibited spatial zonation with respect to cluster 5, we employed the concept of modelling feature values (e.g., expression) as a function of the distance to a given structure (e.g., veins), see methods section “Features as a function of distance” for a more elaborate account of these ideas. In this analysis, we let cluster 5 represent our reference structure, and “measured” the minimal distance for every spot to the reference. As cluster 5 represents a small cluster with distinct localization in a specific region of the tissue and to include more data points for the reliable investigation of gradual expression from this cluster a higher threshold of 800 µm (TN = 284 pixels), compared to 400 µm for vein-distance analyses was chosen. Next, for every gene, we calculated the Spearman correlation between the spots’ expression values and distances to cluster 5. Genes with a positive Spearman correlation value have an elevated expression as the distance to cluster 5 increases, while the opposite is true for genes with a negative correlation value. Notably, the relationship between distance and expression is not necessarily linear as the Spearman correlation — in contrast to Pearson correlation — assesses monotonicity rather than linearity. The Spearman correlation does not assume normality, and accurate p-values can be analytically derived, but must (in our case) be corrected for multiple hypothesis testing, for this purpose we use the Holm–Šidák method (implemented in the statsmodels submodule stats.multitest function multipletests). Having corrected the p-values, genes with a significant (adjusted p value < 0.05) correlation value could be extracted and further examined.

Pathway analysis

Functional enrichment analysis of marker genes of clusters was performed using g:Profiler2 (v 0.1.0). For the analysis, we extracted the gene symbols of each cluster and stored them in a list. The function “gost” of the g:Profiler2 package was then used to perform gene set enrichment analysis on input marker gene lists. In short, the function maps genes to known functional information sources and detects statistically significantly enriched terms. Since our data consists of murine liver sections, the organism was set to mus musculus and the source was set to Gene Ontology (GO) biological processes. Visualization of the 5 most significantly enriched processes for cluster 5 was performed using ggplot2 (v 3.3.2). Significance was adjusted using g:SCS (Set Counts and Sizes), as originally described by the authors of the g:Profiler package⁷¹. Enrichment scores are represented as the negative log10 algorithm of the corrected p-value. For visualization of functional enrichment on the tissue coordinates, marker genes of cluster 5 were referenced against all genes belonging to the gene ontology (GO) terms for “collagen fibril organization (GO:0030199)” and “response to cytokine (GO:0034097)”, extracted from the GO browser of the Mouse Genome Informatics database. Gene expression programs were generated for genes belonging to each GO term as described before and visualized on the spots.

Single-cell data integration (stereoscope)

The spatial data were integrated with the MCA dataset using stereoscope, a probabilistic method designed for spatial mapping of cell types⁴⁰. In short, stereoscope assumes that both single cell and spatial data follows a negative binomial distribution, learns cell type-specific parameters from the (annotated) scRNA-seq data, and then uses these parameters to deconvolve the gene expression profile associated with each spot into proportion values of each cell type. stereoscope uses a stochastic gradient descent approach, leveraging the PyTorch framework, to obtain the maximum likelihood/maximum a priori estimates of both the parameter estimates and proportion values. In both steps (parameter estimation and proportion inference) a batch size of 2048 and 50000 epochs were used, a custom list of highly variable genes—see next section for details—was used rather than the full expression profiles; default values were used for all other parameters. stereoscope can be accessed at https://github.com/almaan/stereoscope, where more detailed documentation regarding the parameter values is provided. The stereoscope version used in the study was v.0.3 (commit: aacd5f775b73b138e504c35ff0cb3ffafbfc78ff).

The cell type proportion values were overlaid on the tissue section images by using the FeatureOverlay function in the STUtility package. To make our visualization more robust to outliers, we scaled all the proportion values using what we refer to as quantile scaling. Here, this procedure was performed in two steps: First, all values larger than the 0.95 quantile are changed to this quantile value (i.e., the data is clipped); then, within every cell type and section we divide the clipped values by their maximum, effectively mapping them to the unit interval [0,1]. Thereafter, the proportion values for all 20 cell types in the single-cell dataset were plotted on the spot coordinated and overlaid on the H&E stained tissue sections.

Pearson Correlation of cell-type proportions

The estimated cell-type proportion values do not comply with most of the assumptions to analytically compute confidence intervals for (e.g., normality and heteroskedasticity). Therefore, we used a bootstrap approach to compute confidence intervals (CI), and thus be able to call signals as significant (zero not being included in the CI) or not (zero being included in the CI). For each pair of cell types, we generated 10,000 bootstrap samples and let the mean of these samples constitute a representative correlation value, while a 95% confidence interval was constructed around this by using the 2.5th and 97.5th percentiles as lower and upper limits. Pairs where the confidence interval overlaps with zero, i.e., being nonsignificant, are indicated with a magenta border.

Selection of highly variable genes for stereoscope

Seurat (v 3.2.2) was used to extract a set of highly variable genes from the MCA single-cell data, following the procedure recommended in the online Seurat Clustering Tutorial [https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html]. To elaborate, the following two steps were applied to the MCA single-cell dataset in sequential order: data normalization (NormalizeData, default parameters), and identification of highly variable genes (FindVariableFeatures, selection.method = “vst”, features = 5000). The complete set of extracted genes used in the stereoscope analysis are listed in Supplementary dataset 7.

Cluster interaction analysis

To gauge the extent to which the expression-based clusters interacted in the spatial data, we constructed a simple interaction analysis based on a nearest neighbor approach. First, for every spot within each cluster, the cluster identity of the four nearest neighbors within a distance threshold are identified. To avoid confusion, note how these distances refer to the separation of spots in the ST array and not in gene expression-space. The distance threshold is used to ensure that only spots in the actual physical neighborhood are included in the count, as might not be the case for spots near the edge otherwise. Second, once neighbor identities have been registered for all members of a cluster, we convert these integer values to a fraction by dividing them with the total number of neighbors associated with the cluster. Thus, for any given cluster we have a set of n_cluster values representing the total fraction of neighbors that belong to each one of the clusters. Since spots need to be positioned somewhere in space, clusters with a large member count will by default neighbor more spots than a cluster with low member count. Hence, to assess whether an interaction seems to be present or not, one must account for cluster size and spatial organization of the spots; here done by random permutation of the cluster labels (100 times) followed by re-calculation of the same neighborhood fraction values. This allows us to put the observed neighborhood fractions into context, to what might be expected by random chance given the cluster cardinalities and spot organization. It is by this approach that Supplementary Fig. 4 was generated, where each bar represents the observed values, the dashed black line the empirical mean value from the permutation analysis, and the magenta envelopes filling the area of two standard deviations from the mean.

Features as a function of distance

To examine how certain features of interest (e.g., gene expression or proportion values) were influenced by the physical proximity to morphological structures (e.g., central and portal veins) in the tissue samples, an approach to model these values as a function of the distance to said the structure was devised. This procedure is described in detail below:

Using the brightfield H&E-images, a mask was created for each morphological structure. These masks covered all pixels considered to belong to the structure. Each structure was assigned an individual (numerical) id, and one or more class attributes related to it (e.g., “vein type”). As the spots’ (capture locations) positions relate to pixel coordinates in the H&E-image, it was possible to—computationally—measure the distance from a spot to each of these structures.

The distance (d(s,t))) from a spot s to a structure t was here defined as the minimal (euclidean) distance from the center of spot s to any pixel p belonging to the mask of t”. In other words, if M_t is the set of all pixels in the mask belonging to structure t then:

$$d(s,t)={argmi}{n}_{pin {M}_{t}}d(s,p)$$

(1)

The same procedure was used when determining the distance to a specific class attribute (e.g., vein type), except that the union of all masks associated with a structure of said class was used instead of only a single mask. That is, if M_C is the set of all pixels belonging to any structure of class C, then the distance (d(s,C)) between spot s and class C is:

$$d(s,C)={argmi}{n}_{pin {M}_{C}}d(s,p)$$

(2)

Univariate expression by distance analysis

Once distances were determined, for a feature x of interest (e.g., expression value) and a structure t, a tuple (d(s,t),x_s) was formed for each spot s; i.e. the distance for every spot was associated with the value of the feature. This set of distance-feature tuples could then be visualized in graphs, in order to depict the feature values’ dependence on their distance to the structure.

To better capture general trends in the data, scatterplot smoothing (using the loess function from the scikit-misc package, v 0.1.3, default values for all parameters), was applied to generate smoothed estimates. The smoothed values would then serve as an approximation of a function f such that x_s = f(d(s,t)), to be interpreted as if the feature value was a function of the distance to the structure. Plotting the smoothed values against their associated distances results in a visualization of the function approximation over the distance domain; these plots are referred to as “feature by distance” plots; where the feature for example could be expression or proportion.

Unless otherwise stated, feature-distance tuples across all sections were aggregated when generating features by distance/distance-ratio plots. The envelopes encapsulating the smoothed approximation represent one standard error (SE) as given by the loess algorithm.

Bivariate expression by distance analysis

To better account for synergies between structures of various classes we also introduce a bivariate model, where we model the gene expression as a function of the distance to multiple reference structures. Rather than using a nonparametric function approximation—as in the univariate case—we favored a bivariate linear model, which would allow us to compare the influence of each independent variable (vein distances) on the explanatory power of the model. We define what will be referred to as the full model as:

$${y}_{{sg}}sim {beta }_{0g}+{beta }_{1g}{d}_{{sc}}+{beta }_{2g}{d}_{{sp}}$$

(3)

Where y_sg is the gene expression of gene g in observation s, while d_sc is the distance to the nearest central vein for observation s, and d_sp is the distance to the nearest portal vein. β_ig are coefficients specific to each gene, which are to be estimated from the data. For parameter estimation we use the OLS class from the statsmodels package submodule regression.linear_models, no regularization was used.

To visualize the results, we create a grid over the domain of distances to the central and portal veins that we seek to survey. Then for every node s in the grid, we apply Eq. (3) to get a prediction of the gene expression. Finally, we plot the grid and let the intensity of the nodes be proportional to the predicted value, using linear interpolation between the grid points to color the whole domain.

Model comparison

The bivariate model also allows one to test whether the inclusion of covariates such as distance to either vein-type significantly improves the model’s performance, or if a reduced model is sufficient to use. We thus introduce the three following reduced models:

$${y}_{{sg}}sim {beta }_{0g}+{beta }_{1g}{d}_{{sc}}$$

(4)

$${y}_{{sg}}sim {beta }_{0g}+{beta }_{2g}{d}_{{sp}}$$

(5)

$${y}_{{sg}}sim {beta }_{0g}$$

(6)

We refer to the models described in Eqs. (4)–(6) as reduced since they are all nested with the full model, meaning that their performance can be compared with a likelihood ratio test; using one degree of freedom for the two reduced models with one distance covariate, and two degrees of freedom for the intercept-only model. The outcome of the likelihood ratio test (presented as a p value), states whether the full model significantly (p value < 0.05) outperforms the reduced model, accounting for the additional model parameters.

We selected genes to be subjected to bivariate expression by distance analysis in two different ways. First, in the case of metabolic pathway gene markers¹² for glucagon and Wnt targets, we extracted 12 known Wnt pathway markers genes with most elevated expression levels in the central cluster (cluster 2) in the spatial data. For the glucagon targets we chose 10 known marker genes with the most elevated levels in the portal cluster (cluster 1) and 2 genes with highest up- or downregulation (Mup20, Mdm2) in glucagon deficient mice⁴⁷. Secondly, for the remaining bivariate expression by distance analyses of gene markers (immune system process, ha-ras, chronic hypoxia, pituitary hormones), we selected 2–3 genes exhibiting most elevated expression levels for each, the central (cluster 2) and the portal cluster (cluster 1). These markers were identified as described in the Methods section: “Zonation based DGEA of cell type and immune system process and metabolic pathway markers”.

Expression-based classification

To assess whether the gene expression of a structure’s (e.g., central or portal vein) neighborhood held sufficient information to infer its class, we constructed a classifier designed to predict structure-class based on gene expression data. The steps of data processing and explicit details for the classification procedure are described below:

First, a neighborhood expression profile (NEPs) was created for each structure, representing weighted (by distance) average expression of a set of features (here genes) in the neighborhood of a structure. The neighborhood (N(t)) of a structure t was defined as the set of spots with a distance less than a threshold T_N to t. That is:

$$N(t)={sin S{{{{{rm{|}}}}}}d(s,t) , < , {T}_{N}}$$

(7)

Where S is the set of all spots, while distances between spots and structures (d(s,t)) are defined and computed as described in the section “Features as a function of distance” above. In this study, we set the distance threshold (T_N) to 142 pixels. This threshold equals 400 µm and represents the longest distance between three consecutive spot centers in the same row. Having formed the neighborhoods, their associated expression profiles for a feature (x_N(t)) were assembled accordingly:

$${x}_{N(t)}=mathop{sum}_{sin N(t)}{w}_{ts}{x}_{s}$$

(8)

Where w_ts are the distance-based weights given by:

$${w}_{ts}=frac{{hat{w}}_{ts}}{{sum }_{kin N(t)}{hat{w}}_{tk}},{hat{w}}_{ts}=exp(-d(s,t)/sigma )$$

(9)

In this analysis, σ was set to 20. As multiple features are used, NEPs are represented by a vector of N (the number of features used) elements, denoted as x_N(t). Each NEP was then given a class label, portal or central, based on the associated structure’s annotations. The task of predicting class labels from the NEPs then surmounts to a multivariate binary classification problem, for which a logistic regression model was employed. Implementation-wise the logistic regression was performed by using the LogisticRegression class from sklearn’s (v 0.23.1) linear_model module, a l2 penalty was used (regularization strength 1), the number of max iterations was set to 1000, default values were used for all other parameters.

In short, the logistic model considers the class label (z_t) of a structure t as Bernoulli variable conditioned on the NEPs, i.e:

$${z}_{t}sim {Ber}left({p}_{t}right),{log }left[frac{{p}_{t}}{1-{p}_{t}}right]=beta {x}_{Nleft(tright)}+{beta }_{0}$$

(10)

Fitting the model equates to finding the maximum likelihood estimates of β and β₀ given the observed data and regularization terms. Once fitted, the class of a structure t is taken as class 1 if p_t ≤ 0.5 and class 2 if p_t > 0.5. However, p_t is a continuous value that also can be interpreted as the probability of a structure belonging to each class (low values indicate more similarities with class 1 and vice versa)—offering a form of soft classification.

To validate performance, cross-validation strategies were implemented at two different levels: section and sample. In the former K-sections were set aside forming a test set while the remaining sections constituted the training set. The model was trained on the training set and evaluated on the K sections in the test set. This procedure was iterated for all combinations of the pairs. Cross validation on the sample level was conducted in a similar fashion, but splitting w.r.t. samples rather than individual sections; here setting aside a sample is equivalent to excluding all sections associated with the given sample.

As the number of samples—and thus structures—were fairly low, using the complete expression profiles (i.e., all genes) would likely have led to an overfitted model (n_features ≫ n_samples). Thus, a reduced set of genes were used to construct the NEPs, extracted from the set of marker genes identified in the previously described differential gene expression analysis—this set of genes can be found in Supplementary Fig. 15.

Statistics & Reproducibilty

Samples were chosen according to the number of the available experiments. Each Spatial Transcriptomics slide contains 6 sub-arrays. All samples resulting in cDNA libraries of correct size and concentration of all performed experiments were considered. No statistical method was used to pre-determine sample size. Only data that was handled the same way during samples freezing and library preparation was considered for consistent final analysis. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link

Vasiprak Blog