Preloader

Single-nuclei isoform RNA sequencing unlocks barcoded exon connectivity in frozen brain tissue

Experimental model and subject details

Cortex samples for SnISOr-Seq

Two healthy human mid-frontal cortices used here were obtained from tissue banks maintained by the Center for Neurodegenerative Disease Research and the University of Pennsylvania Alzheimer’s Disease Core Center, according to institutional review board-approved protocols. Neither subject had pre-existing neurodegenerative or neurological disease. Postmortem intervals were 14 h for Cortex1 (age 68, male) and 6 h for Cortex2 (age 61, male). Tissues were flash-frozen and kept at −80 °C until processing.

Pre-frontal cortex samples for Illumina sequencing of bulk nuclei

Pre-frontal cortex (PFC) samples from two patients with Alzheimer’s disease used for Illumina sequencing were obtained from the Human Brain Tissue Bank (HBTB; Semmelweis University), which is a member of the BrainNet Europe II. HBTB’s activity is authorized by the Committee of Science and Research Ethic of the Ministry of Health Hungary (ETT TUKEB: 189/KO/02.6008/2002/ETT) and the Semmelweis University Regional Committee of Science and Research Ethic (32/1992/TUKEB), including human brain tissue sample removal, collecting and storing and applying for research. Human brain microdissection procedures were approved by the Regional and Institutional Committee of Science and Research Ethics of Scientific Council of Health (ethical license: 34/2002/TUKEB-13716/2013/EHR) and the Code of Ethics of the World Medical Association (Declaration of Helsinki). Genetic testing and international transportation samples were authorized by the Semmelweis University Regional Committee of Science and Research Ethics (34/2002/TUKEB). The postmortem interval was 6.5 h for PFC_S1 (age 93, female) and 5 h for PFC_S2 (age 81, male). In both cases, the tissue samples were microdissected from the dorsolateral PFC (middle frontal gyrus, Brodmann area 9). The micropunch procedure consisted of slicing the PFC into serial coronal sections, micropunching from both the surface and the deep (wall of the superior frontal sulcus) parts of the gyrus and collecting tissue pellets. Until processing, the brains were frozen and kept at −80 °C.

Fetal human samples for qRT–PCR validation

Neurons (Thy1+ cells) and astrocytes (HepaCAM+ cells) were isolated from fetal human brain tissue (n = 3, gestational weeks 19–20) using the immunopanning method50. The fetal human brain tissue samples were obtained with informed consent under a Stanford University institutional review board-approved protocol.

Single-nuclei isolation and 10x Genomics 3′ library construction

Single-nuclei suspension was isolated from fresh-frozen human brain samples with modifications from a previous protocol51,52.

Next, ~30 mg of frozen tissue per sample was dissected in a sterile dish on dry ice and transferred to a 2-ml glass tube containing 1.5 ml of nuclei pure lysis buffer (MilliporeSigma, L9286) on ice. Tissue was completely minced and homogenized to nuclei suspension by sample grinding with Dounce homogenizers (Sigma, D8938-1SET) with 20 strokes with pestle A and 18 strokes with pestle B. The nuclei suspension was filtered by loading through a 35-µm-diameter filter and followed by centrifuging for 5 min at 600g and 4 °C. The nuclei pellet was collected and washed with cold wash buffer, which consisted of the following reagents: 1× PBS (Corning, 46-013-CM), 20 mM DTT (Thermo Fisher Scientific, P2325), 1% BSA (NEB, B9000S) and 0.2 U µl−1 of RNase inhibitor (Ambion, AM2682) for three times. After removing the supernatant from the last wash, nuclei were resuspended in 1 ml of 0.5 µg ml−1 of DAPI (Sigma, D9542) containing wash buffer to stain for 15 min. The nuclei suspension was prepared for sorting by filtering cell aggregates and particles out with a diameter of 35 µm. Nuclei were sorted to remove cell debris and fractured nuclei using the Sony MA900 sorter with FlowJo version 10 software (Supplementary Fig. 12a–c). These were collected by centrifuging for 5 min at 600g and 4 °C and then resuspended in wash buffer to reach a final concentration of 1 × 10−6 nuclei per milliliter after counting in trypan blue (Thermo Fisher Scientific, T10282) using a Countess II cell counter (Thermo Fisher Scientific, A27977).

10x Genomics 3′ library construction was performed by following the manufacturer’s instructions with single-nuclei suspension obtained from the last step. 10x Genomics 3′ libraries of Cortex1 and Cortex2 were loaded on an Illumina NovaSeq 6000 with PE 2 × 50 paired-end kits by using the following read length: 28 cycles Read1, eight cycles i7 index and 91 cycles Read2.

Linear/asymmetric PCR steps to remove non-barcoded cDNA

The first round PCR protocol (95 °C for 3 min, 12 cycles of 98 °C for 20 s, 64°C for 30 s and 72 °C for 60 s) was performed by applying 12 cycles of linear/asymmetric amplification to preferentially amplify one strand of the cDNA template (30 ng of cDNA generated by using 10x Genomics Chromium Single Cell 3ʹ GEM kit) with primer ‘Partial Read1’, and then the product was purified with 0.8× SPRIselect beads (Beckman Coulter, B23318) and washed twice with 80% ethanol. The second round PCR is performed by applying four cycles of exponential amplification under the same condition with forward primer ‘Partial Read1’ and reverse primer ‘Partial TSO’, and then the product was purified with 0.6× SPRIselect beads and washed twice with 80% ethanol and eluted in 30 µl of buffer EB (Qiagen, 19086). Sequences of primers: Partial Read1 (5′-CTACACGACGCTCTTCCGATCT-3′) and Partial TSO (5′-AAGCAGTGGTATCAACGCAGAGTACAT-3′). KAPA HiFi HotStart PCR Ready Mix (2×) (Roche, KK2601) was used as polymerase for all the PCR amplification steps in this paper, except for the 10x Genomics 3′ library construction part.

Exome capture to enrich for spliced cDNA

Exome enrichment was applied to the cDNA purified from the previous step by using probe kit SSELXT Human All Exon V8 (Agilent, 5191-6879) and the reagent kit SureSelectXT HSQ (Agilent, G9611A), according to the manufacturer’s manual. First, the block oligo mix was made by mixing an equal amount (1 µl of each per reaction) of primers Partial Read1 (5′-CTACACGACGCTCTTCCGATCT-3′) and Partial TSO (5′-AAGCAGTGGTATCAACGCAGAGTACAT-3′) with the concentration of 200 ng µl−1 (IDT), resulting in 100 ng µl−1. Next, 5 µl of 100 ng µl−1 cDNA diluted from the previous step was combined with 2 µl of block mix and 2 µl of nuclease free water (NEB, AM9937), and then the cDNA block oligo mix was incubated on a thermocycler under the following conditions to allow block oligo mix to bind to the 5′ end and the 3′ end of the cDNA molecule: 95 °C for 5 min, 65 °C for 5 min and 65 °C on hold. For the next step, the hybridization mix was prepared by combining 20 ml of SureSelect Hyb1, 0.8 ml of SureSelect Hyb2, 8.0 ml of SureSelect Hyb3 and 10.4 ml of SureSelect Hyb4 and kept at room temperature. Once the reaction reached to 65 °C on hold, 5 µl of probe, 1.5 µl of nuclease-free water, 0.5 µl of 1:4 diluted RNase Block and 13 µl of the hybridization mix were added to the cDNA block oligo mix and incubated for 24 h at 65 °C. When the incubation reached the end, the hybridization reaction was transferred to room temperature. Simultaneously, an aliquot of 75 µl of M-270 Streptavidin Dynabeads (Thermo Fisher Scientific, 65305) were prepared by washing three times and resuspended with 200 µl of binding buffer. Next, the hybridization reaction was mixed with all the M-270 Dynabeads and placed on a Hula mixer for 30 min at room temperature. During the incubation, 600 µl of wash buffer 2 (WB2) was transferred to three wells of a 0.2-ml PCR tube and incubated in a thermocycler on hold at 65 °C. After the 30-min incubation, the buffer was replaced with 200 µl of wash buffer 1 (WB1). Then, the tube containing the hybridization product bound to M-270 Dynabeads was put back into the Hula mixer for another 15-min incubation with low speed. Next, the WB1 was replaced with WB2, and the tube was transferred to the thermocycler for the next round of incubation. Overall, the hybridization product bound to M-270 Dynabeads was incubated in WB2 for 30 min at 65 °C, and the buffer was replaced with fresh pre-heated WB2 every 10 min. When the incubation was over, WB2 was removed, and the beads were resuspended in 18 µl of nuclease-free water and stored at 4 °C. Next, the spliced cDNA, which bound with the M-270 Dynabeads, was amplified with primers Partial Read1 and Partial TSO by using the following PCR protocol: 95 °C for 3 min, 12 cycles of 98 °C for 20 s, 64 °C for 60 s and 72 °C for 3 min. The amplified spliced cDNA was isolated from M-270 beads as supernatant and followed by a purification with 0.6× SPRIselect beads.

Library preparation for PacBio

HiFi SMRTbell libraries of Cortex1 and Cortex2 were constructed according to the manufacturer’s manual by using SMRTbell Express Template Prep Kit 2.0 (PacBio, 100-938-900). For both samples, ~500 ng of cDNA obtained by performing LAP-CAP from the previous step was used for library preparation. The library construction includes DNA damage repair (37 °C for 30 min), end-repair/A-tailing (20 °C for 30 min and 65 °C for 30 min), adaptor ligation (20 °C for 60 min) and purification with 0.6× SPRIselect beads.

Library preparation for ONT

For both samples, ~75 fmol cDNA processed through LAP-CAP underwent ONT library construction by using the Ligation Sequencing Kit (ONT, SQK-LSK110), according to the manufacturer’s protocol (Nanopore Protocol, Amplicons by Ligation, version ACDE_9110_v110_revC_10Nov2020). The ONT library was loaded onto a PromethION sequencer by using PromethION Flow Cell (ONT, FLO-PRO002) and sequenced for 72 h. Base-calling was performed with Guppy by setting the base quality score >7.

RNA extraction and cDNA synthesis for Illumina short-read sequencing

RNA was extracted from the single-nuclei suspension containing 300,000 nuclei by using the RNeasy Mini Kit (74104), which involved on-column gDNA digestion before RNA elution. cDNA was synthesized and amplified with NEBNext Single Cell/Low Input cDNA Synthesis & Amplification Module (E6421S) by following the manufacturer’s protocol.

Short-read library preparation and sequencing with Illumina

With 100 ng of cDNA input per sample, Illumina library prep was conducted with the Illumina DNA Prep (M) Tagmentation Kit (20018704) and IDT for Illumina Nextera DNA Unique Dual Indexes Set D (20027216), according to the manufacturer’s manual. Libraries were loaded on an Illumina NextSeq 500 with 2 × 150 bp Mid Output Kit by using the following read length: ten cycles Read1, ten cycles i7 index and 76 cycles Read2.

Validation of exon coordination in PTK2 using qRT–PCR

Neurons (Thy1+) and astrocytes (HepaCAM+) were isolated from fetal human brain tissue (n = 3, gestational weeks 19–20) using immunopanning50. All fetal human brain tissue samples were obtained with informed consent under a Stanford University institutional review board-approved protocol. RNA was extracted from about 5 million purified neurons or astrocytes with QIAzol Lysis Reagent (Qiagen, 79306). First-strand cDNA was reverse transcribed using SuperScript IV Reverse Transcriptase (Invitrogen, 18090050). RT–qPCR was performed using 15 ng of cDNA as template per sample, validated primers (see below) and PowerUp SYBR Green Master Mix (Applied Biosystems, A25742) on a QuantStudio 3 Real-Time PCR System (Thermo Fisher Scientific). Primers for RT–qPCR were designed by using Primer-BLAST (https://www.ncbi.nlm.nih.gov/tools/primer-blast/) and synthesized by Thermo Fisher Scientific. The primers either targeting the control exons or spanning alternatively spliced PTK2 exon 1 or exon 2 are listed below. The specificity of each primer pair was validated through the observation of a single band on an electrophoresis gel under a fixed melting temperature of PCR condition. The efficiency of each primer pair was evaluated as 85–115%. Comparisons were made using the comparative CT method53 and normalized to neurons, shown as fold change in Supplementary Fig. 10c.

RT–qPCR primers

PTK2_Alternative_exon1

5′-CACGCTGTCCGAAGTACAGT-3′ and 5′-ATGGAATAGATGAAGCCAGGG-3′

PTK2_Alternative_exon2

5′-AACCGCCAAAGCTGGATTCT-3′ and 5′-TGAAATTAGTGGGGACGAAACA-3′

PTK2_Mutual_exons (control)

5′-GCCTTCTCCAATACATCGTCCA-3′ and 5′-GATACTTACACCATGCCCTCA-3′

Proteomic validation of cell-type-specific coordination of ASD-associated exons

Mass spectrometry raw data from the ProteomeXchange dataset PXD001250 were searched against the UniProt mouse proteome (7 March 2021; 63,682 entries) with MaxQuant49 2.0.3.0 using default settings. We normalized the peptide abundance matrix (label-free quantification) by median sample abundance. Relative abundances in neurons, astrocytes and oligodendrocytes were compared between exon Ψ (PSI) and corresponding peptide abundances (proteomics) using the subset of tryptic peptides from an in silico digest of the exon sequences that were also identified and quantified in the proteomics dataset. Peptides that ambiguously map to multiple genes in an in silico digest of the UniProt mouse proteome were discarded. For both—the Ψ values and the proteomics peptide abundances (mean over replicates)—we set a relative abundance threshold at 95% (of maximum abundance over cell types) to define their respective enriched cell type(s) and subsequently tested for overlap between both data sources.

Data processing and quality control for single-cell short-read analysis

The 10x Cell Ranger pipeline (version 3.1.0) was run on raw Illumina sequencing data to obtain single-cell expression matrices that were analyzed using Seurat version 3.1.1 (ref. 31). For both samples, nuclei that had unique gene counts of >7,500 or <200 or >4% mitochondrial gene expression were discarded. This yielded 7,314 nuclei for Cortex1 and 6,486 nuclei for Cortex2. UMI numbers and mitochondrial gene expression percentages were regressed from each nucleus, and the matrix was log-normalized and scaled to 10,000 reads per cell. Next, we clustered cells using the Louvain algorithm, setting the resolution parameter to 0.6. We performed both t-distributed stochastic neighbor embedding (tSNE) and uniform manifold approximation and projection (UMAP) non-linear reduction techniques. Cell types were assigned by identifying canonical marker genes for each cluster13,54,55,56. This cell type annotation was confirmed by aligning to the Allen Brain Atlas human cortical data13.

Alignment of PacBio long-read data

Using default SMRT-Link parameters, we performed circular consensus sequencing (CCS) with IsoSeq3 with the following parameters: maximum subread length 14,000 bp, minimum subread length 10 bp and minimum number of passes 3.

Long-read CCS fastq sequences with PacBio were mapped and aligned to the reference genome (GRCh38) using STARlong and parameters described previously33

In silico simulation of poly(dT) and random hexamer priming

Using GENCODE version 35 transcripts and ten copies per transcript, we simulated cDNA synthesis: introns were retained with P = 0.15, and 30 As were added to each transcript, which were cut into 2-kb fragments and shorter ends, with 1.9-kb mean resulting fragment size. Each fragment was then classified as (1) 3′-end fragment (with polyA tail) or (2) internal fragment (without polyA tail). For both types, random hexamer priming was simulated by choosing a random (uniform) position along the transcript. The sequence to the right of that position was kept as a sequenced molecule, and the remainder was discarded. For both types, poly(dT) priming was simulated by choosing the longest A-rich sequence with ≥8 As in a 10-bp window. The fragment to the right of the A-rich sequence was kept as the sequenced molecule, and the remainder was discarded. Note that more stringent criteria (≥9 As) would lead to more fragments being lost. These sequenced molecules were then mapped back to the annotation, and the fraction of covered transcript was reported.

Alignment of ONT long-read data

Long reads sequenced on the ONT PromethION were mapped using minimap2 (version 2.17-r943-dirty) using the previously described parameters33.

Calculation of on-target rate

For both long-read technologies, the on-target rate was calculated using the ‘intersect’ function from BEDTools (version 2.27.0) with this definition:

$$On-target,rate = frac{{No.,of,mapped,reads,that,overlap,annotated,exons}}{{Total,no.of,mapped,reads}}$$

Calculation of normalized transcript coverage

Normalized transcript coverage was calculated using the ‘CollectRnaSeqMetrics’ function from Picard tools (version 2.25.7). A ‘refFlat’ gene annotation file was downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz.

Subsampling of sequencing libraries

Reads were subsampled using the ‘sample’ command from seqtk (version 1.3-r106).

Calculation of a per-read exon ratio

The expected number of exons per GENCODE gene was obtained by counting exons of each transcript and averaging for all annotated transcripts. Subsequently, the observed exons per read were divided by the expected number, yielding a ratio for each read.

In silico simulation of cDNA fragmentation

SnISOr-Seq long reads were truncated in silico so that a random number of 3′ nucleotides remained (normal distribution, mean = 250 and s.d. = 50) to simulate cDNA fragmentation. Then, 76 bp from the 5′ end of the remaining fragment were isolated to simulate the 76 bp R2 of the 10x Illumina library. Normalized sequencing coverage was then calculated.

Barcode detection and identification of unique molecules from PacBio data

Cellular barcodes (16 nt) were detected using the ‘GetBarcodes’ function in scisorseqr33 (version 0.1.2). Given PCR duplication, one transcript per molecule—that is, barcode+UMI+gene—was chosen for analysis.

Barcode detection for long-read transcripts obtained from ONT

Perfect matching barcodes were obtained similarly to the PacBio reads, however with some tolerance for sequencing errors, using the mapping information per read with white-listed UMIs as done previously57 with modifications:

  • For each Illumina-sequenced UMI, all barcode–UMI 28mers were grouped by gene as a reference set.

  • For every mapped ONT read, we compared only to the reference list for that gene.

  • Sliding windows identified barcode–UMI candidates allowing ≤1 mismatch in the first 22 bp of each reference 28mer. We then allowed only ≤2 mismatches in the 28mer.

These steps were performed using a custom script.

Identification of unique molecules from ONT data

Given the ONT error rate, reads were more likely to undergo ‘molecule inflation’—that is, errors could result in one UMI being perceived as two different ones. To combat this, we proceeded as follows:

  • Reads were grouped by barcode–UMI–gene and ordered by frequency.

  • The Levenshtein distance (LevD) to the nearest barcode–UMI pair from the same gene was obtained.

  • If LevD = 0, it was retained as an Illumina-confirmed molecule.

  • If LevD = 1 or 2, the 28-bp sequence was corrected to match the Illumina reference, and, if the barcode–UMI–gene triplet was novel with respect to ONT data, it was retained.

  • If LevD > 3 and the edit distance to any other already accepted UMI was >5, the molecule was considered novel and retained.

  • If LevD > 3 but the edit distance to any other already accepted UMI was 1 or 2, this UMI was corrected to the accepted UMI.

Following this sequencing error correction, only one read per barcode–UMI–gene triplet was retained. These steps were performed using a custom script.

Assigning TSS and polyA site to reads

We assigned the closest published TSS within 50 bp of the 5′ end of the read mapping58 as previously done33. Likewise, we assigned the closest published polyA site within 50 bp of the 3′ end of the read mapping59.

PhastCons scores for exons, TSSs and polyA sites from 17 primates

PhastCons scores for 16 primate genomes aligned to the human genome were obtained from the UCSC website47,60. The scores were averaged over internal exons, TSSs and polyA sites using the bigWigAverageOverBed script from the UCSC Utilities package.

Disease-associated exons for ASD, schizophrenia and ALS

ASD-associated exons (n = 3,482) were summarized from two studies and one review: 1,776 skipped exons from a comparison of ASD cases with controls (P < 0.05)39, 1,723 neural regulated alternatively spliced exons from ASD brains35 and 33 microexons associated with ASD and characterized functionally40. Schizophrenia-associated exons that were classified as alternative exon skipping events covering 1,107 exons were collected38. The list of 506 ALS-associated cassette exons was identified by comparing C9orf72 ALS brains with control brains41.

Alternative exon counting and categorization

Using all exons appearing as internal exon in a read, we calculated:

  1. 1.

    The number of long-read UMIs containing this exon with identity of both splice sites: Xin

  2. 2.

    The number of long-read UMIs assigned to the same gene as the exon, which skipped the exon and ≥50 bases on both sides: Xout

  3. 3.

    The number of long-read UMIs supporting the acceptor of the exon and ending on the exon: Xacc_In

  4. 4.

    The number of long-read UMIs supporting the donor of the exon and ending on the exon: Xdon_In

  5. 5.

    The number of long-read UMIs overlapping the exon: Xtot

Non-annotated exons with one or two annotated splice sites, ≥70 bases of non-exonic (in the annotation) bases, were excluded as intron retention events or alternative acceptors/donors.

We then calculated

  • ({Psi}_{overall} = frac{{X_{in} + X_{acc_In} + X_{don_In}}}{{X_{in} + X_{acc_In} + X_{don_In} + X_{out}}})

  • ({Psi}_{acceptor} = frac{{X_{in} + X_{acc_In}}}{{X_{in} + X_{acc_In} + X_{out}}})

  • ({Psi}_{donor} = frac{{X_{in} + X_{don_In}}}{{X_{in} + X_{don_In} + X_{out}}})

If

  • (0.05 le {Psi}_{condition} le 0.95,where,condition in { overall,acceptor,donor})

  • (frac{{X_{in} + X_{acc_In} + X_{don_In} + X_{out}}}{{X_{tot}}} ge 0.8)

the exon was kept.

We then calculated the Ψoverall for each cell type from all long-read UMIs for that cell type if, and only if, Xtot ≥ 10 for the exon and cell type in question. Otherwise, Ψoverall for the exon and cell type was set to ‘NA’.

Exon variability analysis

For each replicate, we defined a set of alternative exons that met each of the following criteria: (1) ≥10 supporting reads (inclusion + exclusion) in the pseudo-bulk; (2) 0.05 < Ψ < 0.95 at the pseudo-bulk level; and (3) intron retention events were excluded. These steps yielded 5,855 (Cortex1) and 5,273 (Cortex2) alternative exons. We defined a subset of alternative exons with ≥10 supporting reads in each of four major cell types (excitatory neurons, inhibitory neurons, astrocytes and oligodendrocytes). We divided alternative exons into three variability categories: (1) (maxΨ − minΨ) ≤ 0.25; (2) 0.25 < (maxΨ − minΨ) ≤ 0.75; and (3) (maxΨ − minΨ) > 0.75. For each category, we plotted the exon length density using ggplot2. Then, for each disease, we compared disease-associated exons with all other alternative exons for inclusion variability. Microexons were defined as exons with a length of ≤27 bp. Novel exons were defined as exons that are not described in the GENCODE version 34 annotation. To define a subset of novel exons that show high inclusion and/or high cell type variability, we plotted (maxΨ − minΨ) against pseudo-bulk Ψ and fit a loess curve to the data.

Gene variability analysis

Genes with disease-associated exons were isolated. log-normalized expression (transcripts per million (TPM)) values were obtained from the short-read 10x data. Variability per gene was defined as the minimum value across the broad cell types considered subtracted from the maximum value. P values were obtained by a two-sided Wilcoxon rank-sum test.

Testing for exon coordination

Testing for exon coordination can be done at the pseudo-bulk level or at the cell type level. For every exon pair passing the criteria for sufficient depth, a 2 × 2 matrix of association for a given sample—that is, cell type or pseudo-bulk—was generated. This matrix contained counts for inclusion of both exons (in–in), inclusion of the first exon and exclusion of the second (in–out), exclusion of the first exon and inclusion of the second (out–in) and exclusion of both exons (out–out).

The co-inclusion score of an exon was defined as the double inclusion (in–in) divided by the total counts for that exon pair. An exon pair that was deemed ‘coordinated’ was assessed using the χ2 test of association. The effect size was calculated as the |log10(odds ratio)|. The odds ratio was calculated by setting 0 values to 0.5 and dividing the product of double inclusion and double exclusion by the product of single inclusion—that is, [(in–in) × (out–out)] / [(in–out) × (out–in)]. Finally, we used a Benjamini–Yekutieli correction for multiple testing and reported the FDR value.

Conservation analysis for exon pairs

PhastCons scores from primates were obtained as described above. For every gene used in the pseudo-bulk analysis, the exon pairs with the smallest |log10(odds ratio)| were retained. The minimum PhastCons score for each pair was extracted and plotted against the absolute value of the log-odds ratio.

Cell-type-specific conservation for exon pairs

Exon coordination count data were split by cell type, including astrocytes, excitatory neurons, inhibitory neurons, oligodendrocytes or OPCs. Together with the log-odds ratio, we calculated an exon inclusion ratio, which is defined as the number of times both exons pairs are included in the sequencing data (in–in) divided by the sum of the in–in and out–out counts per exon pair. The minimum PhastCons value for each exon pair was selected and placed into two groups ([0,0.5] and [0.5,1]). We then plotted these two groups against the |log10(odds ratio)| and the exon inclusion ratio as box plots.

Obtaining counts for exon–end site combinations

We obtained counts for exon–TSS and exon–polyA site combinations using a custom script. Specifically, per sample we counted the number of reads assigned to a given TSS and divided them into reads including a particular exon or skipping the exon. We proceeded similarly for exon–polyA site pairs. Only genes with ≥25 reads were used for further analysis.

Testing for exon–end site coordination

Testing for exon–end site coordination (here, with a χ2 test) can be done either in pseudo-bulk or in each cell type. For each test, a n × 2 matrix per internal exon was generated, with the n TSS forming rows and inclusion and exclusion counts forming columns. An exon–TSS pair was tested only if the TSS was upstream of the intron preceding the exon, and the read extended to beyond the end of the following intron. For effect size, we used Δ∏ (the maximum change in exon inclusion between reads associated to distinct TSS). Finally, we used a Benjamini–Yekutieli correction for multiple testing and reported the FDR value. We proceeded similarly for exon–polyA sites.

The χ2 criterion and testability

To categorize exon pairs or exon–end site pairs as testable, we employed the following metrics. For each matrix M with elements mij,

  • The expected value for each element mij was defined as (frac{{mathop {sum }nolimits_{k = 1}^i m_{kj} cdot mathop {sum }nolimits_{k = 1}^j m_{ik}}}{{{sum} M }}).

  • If the expected value in 80% (rounded to nearest integer) of elements is ≥5, and the expected value of all elements is ≥1, the χ2 criterion is met, and the P value is calculated.

  • If the median expected value is <5 in any row or any column, then the RNA variable (that is, TSS, exon or polyA site) in that row or column is said to be constitutive.

  • If none is met, we classify them as ‘low counts’.

Conservation analysis for exon–end site pairs

PhastCons scores for all TSSs were extracted as described above. For every gene in the pseudo-bulk analysis, the TSS–exon pair with the smallest Δ∏ was chosen. For such exons, PhastCons scores of the associated TSSs were sorted by value. The TSS with the minimum PhastCons score was reported for that exon, and the Pearson’s product–moment correlation between the PhastCons score and Δ∏ for that TSS–exon pair was calculated. Similar analysis was conducted for the exon–polyA site pairs.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link