RNA modifications detection by comparative Nanopore direct RNA sequencing

Cell culture and KD/KO experiments

The RNA from WT and METTL3 KD MOLM13 cells was obtained from Barbieri et al.³². Briefly, cells were cultured in RPMI1640 (Invitrogen) supplemented with 10% FBS and 1% penicillin/streptomycin/glutamine. Conditional knock-downs (KD) using METTL3-targeting or scrambled shRNAs were performed as previously described³². For lentivirus production, 293T cells were transfected with PLKO.1 lentiviral vector containing the shRNA sequences (Table S2), together with the packaging plasmids psPAX2 (Addgene Plasmid #12260), and VSV.G (Addgene Plasmid #14888) for METTL3 KD or Pax2 (Addgene Plasmid #35002), at a 1:1.5:0.5 ratio, using Lipofectamine 2000 reagent (Invitrogen) according to the manufacturer’s instructions. Supernatant was harvested 48 and 72 h after transfection. 1 × 10⁶ cells and viral supernatant were mixed in 2 ml culture medium supplemented with 8 μg/ml polybrene (Millipore), followed by spinfection (60 min, 900g, 32 °C) and further incubated overnight at 37 °C. The medium was refreshed on the following day and the transduced cells were cultured further. MOLM13 cells (5 × 10⁵) were infected using PLKO-TETon-Puro lentiviral vectors expressing shRNAs. After 24 h of infection, the cells were replated in fresh medium containing 1 μg/ml of puromycin and kept in selection medium for 7 days. shRNA expression was induced by treatment with 200 ng/ml doxycycline for 4 days for METTL3 KD. Near complete loss of METTL3 RNA and protein was confirmed by Western Blot and qPCR by Barbieri et al.³². For METTL3 knock-out (KO) experiments, lentiviruses were produced in HEK293 cells using ViraPower Lentiviral Expression System (Invitrogen) according to manufacturer’s instructions. MOLM13 cells stably expressing Cas9 were transduced with lentiviral gRNA vectors expressing either empty or METTL3 gRNAs (Table S2) and selected with puromycin from day 2 to day 5. At day 5 post-transduction, the cells were suspended in fresh medium without puromycin. At day 6, cells were harvested for RNA extraction.

The diploid S. cerevisiae strains used for generating the ime4Δ mutant were derived from the SK1 background. The ime4Δ strain was generated using the one-step gene replacement method described previously⁴⁶.

RNA purification and in vitro transcription

Total RNA was isolated from MOLM13 cells using the RNeasy midi kit (Quiagen) and polyA+ RNA was purified from 30 μg total RNA using the Dynabeads mRNA Purification Kit (Thermo Fisher Scientific) according to the manufacturer’s instructions. For production of unmodified 7SK RNA, synthetic double stranded DNA template for in vitro transcription (IVT) was produced by hybridization of synthetic Megamer® Single-Stranded DNA Fragments (IDT) containing the 7SK sequence downstream of a T7 promoter (Table S3). 500ng of double stranded DNA template were used in 20 μl IVT reactions for 1h using the TranscriptAid T7 High Yield Transcription Kit (Thermo Fisher Scientific), following the manufacturer’s instructions. The RNA product was purified using the RNA Clean & Concentrator kit (Zymo Research). Wild Type and ime4Δ yeast cells were collected after 4h in sporulation medium and total RNA was extracted with acid phenol:chloroform:isoamyl alcohol as previously described⁴⁷. polyA+ RNA was purified from total RNA using the Dynabeads mRNA Purification Kit (Thermo Fisher Scientific) as above.

miCLIP

miCLIP was performed in duplicates with RNA isolated from wild type and METTL13 KO MOLM13 cells. The protocol is conceptually related to the original m6A miCLIP protocol⁴⁸, but uses total RNA as input and follows a more recent variant of iCLIP protocol⁴⁹. 4 μg of total RNA were fragmented with RNA fragmentation reagents (ThermoFisher) following the manufacturer’s instructions. Fragmented RNA was then incubated with 2.5 μg anti-m6A antibody (Abcam, ab151230) in IP buffer (50 mM Tris-HCl pH 7.4, 100 mM NaCl, 0.05% NP-40) at 4 °C for 2 h, in rotation. Subsequently, the solution was placed in 6-well plates on ice and irradiated twice with 0.3 J cm−2 UV light (254 nm) in a Stratalinker crosslinker. 30 μl protein G beads (Dynabeads) per sample were washed twice with IP buffer and then incubated with the RNA-antibody solution at 4 °C for 1.5 h, in rotation. After the IP, the RNA-antibody-beads complexes were washed twice with High-Salt Wash buffer (50 mM Tris-HCl pH 7.4, 1M NaCl, 1 mM EDTA, 1% Igepal CA-630, 0.1% SDS, 0.5% sodium deoxycholate), once with IP buffer and once with PNK Wash buffer (20 mM Tris-HCl pH 7.4, 10 mM MgCl₂, 0.2% Tween-20). The beads then proceeded to 3′ dephosphorylation and the rest of the iCLIP protocol. The 3’ adapters for on-bead ligation carry the sequences found in Table S4. Samples were mixed after the adapter removal step. Following the SDS-PAGE gel, the membrane was cut from 45 kDa to 185 kDa and RNA was extracted. The following sequence of the RT primer was used: /5Phos/WWW CGTAT NNNN AGATCGGAAGAGCGTCGTGAT/iSp18/GGATCC/iSp18/TACTGAACCGC. cDNA libraries were sequenced with single end 100bp reads on Illumina HiSeq4000.

Nanopore direct-RNA sequencing (DRS)

RNA sequencing was performed following the instruction provided by Oxford Nanopore Technologies (Oxford, UK), using R9.4 chemistry flowcells (FLO-MIN106) and direct-RNA chemistry sequencing kits (SQK-RNA001 or SQK-RNA002). For polyA+ transcriptome sequencing, we followed the conventional DRS protocol using the provided polyT (RTA) adapter. For the targeted sequencing, we ordered custom reverse transcription adapters complementary to the 3’ end of 4 selected noncoding RNAs, and followed the sequence-specific DRS protocol (Table S5). For library preparation of IVT 7SK, we used 500ng of unmodified IVT RNA prepared as described above, using the adapter complementary to the 3′end of 7SK.

m⁶A RNA immunoprecipitation and qRT-PCR

Cell nuclei were obtained from MOLM13 WT (six independent biological replicates) or METTL3-KD cells (six independent biological replicates for each shRNA) six days after doxycycline administration. Cell lysis was performed in 10 mM TRIS pH = 7.8, 140 mM NaCl, 1.5 mM MgCl₂, 10 mM EDTA, 0.5% NP40 and RNase inhibitor (RNaseOUT™, Thermo Fisher Scientific, 10777019, lot # 2232786) for 30 min on ice followed by centrifugation at 3,000 × g for 3 min. Nuclear RNA fraction was then purified using the RNAeasy midi kit (Qiagen). Successively, 4 μg of nuclear RNA were fragmented for 3 min and 30 second at 70 °C using the RNA fragmentation Reagents (Thermo Fisher Scientific, AM8740, lot # 00786992). Fragmented nuclear RNA was then purified using the RNA Clean & Concentrator™-5 kit (Zymo Research, R1016). meRIP qRT-PCR was performed, as previously described⁵⁰ with some modifications. Briefly, for each immunoprecipitation reaction 1 μg of fragmented nuclear RNA was incubated 2 h at 4 °C in rotation with anti-m⁶A (Abcam, ab151230, lot #GR3319501-1) or anti-GFP Antibodies (Abcam, ab290, lot #GR3321575-1) in a final volume of 1ml RIP Buffer (RIP buffer 5×, ddH₂O, RNaseOUT™)⁵⁰, and subsequently incubated 2h at 4 °C in rotation with 50 µL of BSA-coated Dynabeads G (Thermo Fisher Scientific, 10004D). A total of 5% of each immunoprecipitation reaction was saved as input control. To elute RIP-RNA, beads were incubated twice 30 min at 37 °C in a thermo-shaker (1100rpm) in 40 μl of elution buffer (RIP buffer 1×, 6.7 mM N6-Methyladenosine 5′-monophosphate (Santa Cruz Biotechnology, sc-215524, lot # L1820), RNaseOUT™). Input and RIP samples were finally purified using the RNA Clean & Concentrator™-5 kit (Zymo Research, R1016). cDNA was obtained using the high-capacity cDNA reverse transcription kit (Thermo Fisher Scientific, 4368814). The levels of 7SK were measured using a QuantStudio 6 Flex real-time PCR machine and PowerUp™ SYBR™ Green PCR master mix (Thermo Fisher Scientific, A25780) according to the manufacturer’s instructions. Statistical testing for differences between KD and Control was done with the one-tailed Welch’s t-test. qRT-PCR primers: 7sk (22-73): Fwd 5′-GCGACATCTGTCACCCCATT-3′; Rev 5′-CAGCCAGATCAGCCGAATCA-3′. 7sk (50-160): Fwd 5′-GGGTTGATTCGGCTGATCT-3′; Rev 5′-GGGGATGGTCGTCCTCTT-3′ 7sk (258-308): Fwd 5’-CGTAGGGTAGTCAAGCTTCCA-3’; Rev 5’-CAGCGCCTCATTTGGATGTG-3′

Western blotting

Western blot experiments were performed as previously described (Barbieri, Nature 2017) using the following antibodies: anti-METTL3 (Abcam, ab195352, lot #GR3247121-3) and anti-beta Actin (Abcam, ab8227, lot #GR3255609-1).

In silico simulated datasets

Unmodified RNA model

We used an in vitro transcribed human RNA DRS dataset released by the Nanopore WGS consortium as a ground truth for non-modified RNA bases (https://github.com/nanopore-wgs-consortium/NA12878). This dataset contains all possible 5-mers on average 58,307 times. The reads were aligned on gencode release 28 human reference transcriptome with Minimap2 v2.14 and we realigned the signal to the reference sequence using Nanopolish eventalign v0.10.1 followed by NanopolishComp Eventalign_collapse v0.5 . Next, we collected the median intensity and dwell time data for each 5mers and tried to fit 44 different distributions. We selected distributions minimising the sum of square root error for all kmers between the observed and modelled data. In addition, we also based our selection on the possibility to easily change the parameters of the distributions to simulate the presence of modifications. We selected the Wald distribution and the Logistic distribution for dwell time and median intensity, respectively. Finally, we generated a model file containing the parameters of the observed and model distributions for each 5-mer. The up-to-date model file is distributed with Nanocompore. The detailed analysis is available in the following Jupyter notebook: https://github.com/tleonardi/nanocompore_paper_analyses/blob/master/in_silico_dataset/01_IVT_Kmer_Model.ipynb.

Simulated reference sequence

We generated a set of in silico reference sequences. In order to maximise the sequence diversity and kmer coverage we used a “guided” random sequence generator. In brief, the sequences are generated base per base using a random function, but the program keeps track of the number of times each kmer was already used. The sequence is extended, based on a random function with a weighted probability for each kmer inversely proportional to their occurrence in the sequences already generated. This ensures that all kmers are represented as uniformly as possible, but it leaves some space to randomness. We generated a set of 2000 sequences 500 bases long each maximising the 9-mers coverage. We excluded any homopolymers longer than 5 bases, as they are likely to be miscalled in nanopore data. Kmer coverage in the final sequence set are summarised in Table S6. The detailed analysis is available in the following Jupyter notebook: https://github.com/tleonardi/nanocompore_paper_analyses/blob/master/in_silico_dataset/02_Random_guided_ref_gen.ipynb.

Simulated modified and unmodified datasets

Nanocompore comes with a companion tool called SimReads which can generate simulated read data based on a fasta reference and a kmer model file. Essentially, SimReads walks along the reference sequence and generates intensity and dwell time values corresponding to each 5-mers. To do so, it uses a probability density random generator using the kmer model values (location and scale) bounded by the extreme observed values. This tool can also offset the model mean by a fraction of the distribution standard deviation to simulate the effect of RNA modifications. This can be done for all the reads or only on a subpopulation of reads. SimReads generates files similar to the output of NanopolishComp EventalignCollapse. This means that the datasets can be directly used as input for NanoCompore SampComp. Using Nanocompore v1.0.0rc3 with the previously described simulated reference sequence set we generated 144 in silico datasets with various amplitude of modification of the median signal intensity and the dwell time (0, 1, 2, 3, and 4 standard deviation) as well as different fractions of modified reads (10%, 25%, 50%, 75%, 90%, and 100%). All the datasets were simulated in duplicate with a uniform coverage depth of 100 reads. The detailed analysis is available in the following Jupyter notebook: https://github.com/tleonardi/nanocompore_paper_analyses/blob/master/in_silico_dataset/03_Simulated_dataset_gen.ipynb.

Analysis of simulated datasets

We compare the 144 datasets containing simulated modifications against the reference dataset generated from the unmodified model with Nanocompore v1.0.0rc3 (See Nanocompore section after). The analysis was performed with all the statistical methods supported by Nanocompore using a sequence context of 2 nucleotides (https://github.com/tleonardi/nanocompore_paper_analyses/blob/master/in_silico_dataset/04_nanocompore.sh). The result database was subsequently parsed and the predicted modified sites were compared with the position of the known simulated positions. A hit was considered true positive (TP) when we found a significant p-value within 3 nucleotides of a known modified position. A significant hit outside of this window was counted as a false positive (FP). Finally, we plotted the Receiver Operating Characteristic (ROC) curves corresponding to the TP rate compared with the FP rate for every Nanocompore comparison performed (https://github.com/tleonardi/nanocompore_paper_analyses/blob/master/in_silico_dataset/05_calc_roc.sh).

Sequencing and analysis of synthetic modified oligos

The four PAGE-purified, synthetic oligonucleotides of 100nt were ordered through Horizon Discovery LTD at a concentration of 0.2 µmol. Oligo1, 2, and 3 carried 3 modified nucleotides each, whereas Oligo4 was the unmodified control. All the oligonucleotides have the same sequence, but they contain different modifications sufficiently spaced (23 bases) to avoid interactions between modifications. The sequence was chosen in order to combine all the know consensus of the modifications in a single oligo sequence in order to be able to use a single non-modified reference for all oligos:

m6A: GGACU (strong DRACH consensus)
m6A: CGACC (Weak NRACH consensus)
m6A: CUAGC (Anti DRACH consensus)
Inosine: UUAGC (loose motif in editing-enriched regions (EERs) – from Blango and Bass 2016, and Eggington et al. 2011).
PseudoU: UGUAG (from Pus7’s UGΨAR motif, and 7SK IVT peak…)
m62A: GUGAACC (from the 18S rRNA modified sequence)
m5C: CCCGGG (from Huang et al. 2019)
m1G: CAGGTCG (from the tRNA m1G37 position)
2’OmeA: GAGAGAA (from rRNA doi: 10.1093/nar/gkw810)

The motifs were all expanded to 7 bases and combined in a sequence separated by a randomly generated buffer of 9 bases. We generated all possible permutations of the blocks and 1000 different versions of the randomly generated buffer sequences (disallowing homopolymers), totalling 216,000 candidate sequences. We then computationally folded all of the candidate sequences using RNAfold v2.4.15 from the Vienna package. Finally, we calculated a combined score taking into account the folding score and the base composition balance and picked the best candidate:

m6A_strong-Inosine-m62A-m6A_anti-m5C-m1G-m6A_weak-PseudoU-2OmeA|seed=802

>control

AUACUCGACAUAGAUAGGACUCUUUAGCUAGUGAACCCUAGCCUCCGGAGACAGGUCGCGACCUGUGUAGAUGAGAGAACUGAGUGCACAAAAAAAAAAA

>mod_set_1

AUACUCGACAUAGAUAGG(m6A)CUCUUUAGCUAGUGAACCCU(m6A)GCCUCCGGAGACAGGUCGCG(m6A)CCUGUGUAGAUGAGAGAACUGAGUGCACAAAAAAAAAAA

>mod_set_2

AUACUCGACAUAGAUAGGACUCUUU(I)GCUAGUGAACCCUAGCCUC(m5C)GGAGACAGGUCGCGACCUGUG(PseudoU)AGAUGAGAGAACUGAGUGCACAAAAAAAAAAA

>mod_set_3

AUACUCGACAUAGAUAGGACUCUUUAGCUAGUG(m62A)ACCCUAGCCUCCGGAGACAG(m1G)UCGCGACCUGUGUAGAUGAG(2’OmeA)GAACUGAGUGCACAAAAAAAAAAA

The full design analysis is now provided in the online companion analysis repository https://github.com/tleonardi/nanocompore_paper_analyses/tree/master/control_oligos_design

DRS libraries were prepared from 500 ng of each oligo using the SQK-RNA002 kit (ONT) and following the standard protocol. Libraries were then sequenced in individual FLO-MIN106 flowcells on a GridION instrument. The data was then basecalled with Guppy (v3.2.10) with default parameters. A known limitation of DRS is the poor data normalisation for short reads. To overcome this limitation and reduce noise, we only retained for further analysis the Guppy pass reads of at least 100nt in length (i.e., full length oligos and fusion reads). Filtered reads were then mapped to the reference unmodified sequence using minimap2 (-k 9 -m 5), the signal data was then resquiggled with Nanopolish and the aligned events table was collapsed with NanopolishComp as outlined before. The filtered datasets for Oligo1, 2, and 3 were then analysed with Nanocompore (v1.0.0rc3, -min_coverage 30, -downsample_high_coverage 5000). The Nanocompore signal peaks were generated as described in Peak Calling section using a p-value threshold of 0.01.

We then generated artificial datasets containing variable fractions of unmodified and modified reads, covering all possible combinations of 3 factors:

1.

f, the fraction of modified reads in experimental condition, ranging from 0 to 1 in 0.1 increments
2.

r, the fraction of modification reduction in control condition. Values 1, 0.8 or 0.5
3.

n, the read coverage ranging from 16 to 4096 and doubling at each step.

For each dataset to be generated, we created 4 NanopolishComp index files:

1.

A file referencing a random sample of n*f reads from the dataset containing the modification
2.

A file referencing a random sample of n*(1-f) reads from the unmodified dataset
3.

A file referencing a random sample of n*f*r reads from the dataset containing the modification
4.

A file referencing a random sample of n*(1-f*r) reads from the unmodified dataset

This procedure was repeated 100 times for each combination of n, f, and r and analysed in 81.000 distinct Nanocompore runs using the combined files 1 and 2 as the experimental sample and the combined files 3 and 4 as the reference sample. We then analysed the results of Nanocompore in order to calculate, for each combination of n, f, and r, the mean number of True Positives, False Positives, True Negatives, and False Negatives identified. For this purpose, True positives were defined as the number of known modification sites with at least 1 significant kmer; False positives were defined as the number of significant kmers outside of the known modification sites; True negatives as the number of known unmodified sites that didn’t have any significant kmer and False negatives as the number of known modification sites not supported by any significant kmer.

Direct-RNA datasets analysis

Reference files

For this study we used the following Human reference files all obtained from Ensembl:

Data preprocessing

All the datasets were preprocessed using an automated analysis NextFlow pipeline, before running Nanocompore (https://github.com/tleonardi/nanocompore_pipeline). Raw reads FAST5 files were basecalled with ONT Guppy v3.1.5 and the basecalled reads were saved in FASTQ format. A post-basecalling quality control was performed with pycoQC (v2.2.4)⁵¹ to verify the consistency of the sequencing runs. A transcriptome reference FASTA file was created from the annotation BED file and genome FASTA file with Bedparse (v0.2.2)⁵². Reads were then aligned on the transcriptome reference with Minimap2 (v2.16)⁵³ in unspliced mode (-x map-ont). The resulting aligned reads were filtered with samtools (v1.9)⁵⁴ to keep only primary alignments mapped on the forward strand (-F 2324) and the raw signal was realigned on reads using Nanopolish eventalign (v0.11.1)⁵⁵. Finally, the data was processed by NanopolishComp Eventalign_collapse (v0.6.2)⁵⁶ to generate a random access indexed tabulated file containing realigned median intensity and dwell time values for each kmer of each read.

Signal comparison with nanocompore

Nanocompore is a Python3 package dedicated to comparative analysis of DRS nanopore sequencing raw signal in order to identify potential RNA modification sites. Signal analysis and complex statistical tests are generally resource-intensive, but Nanocompore takes advantage of a multiprocessing architecture to process transcripts in parallel and has a relatively small memory footprint. Nanocompore requires at least 1 indexed tabulated file generated with NanopolishComp Eventalign_collapse for each of the 2 conditions to compare. The program will run with a single replicate per condition, but we recommend at least 2 to take full advantage of the advanced statistical framework. The analysis flow is divided in three steps: (1) white-listing of transcripts with sufficient coverage, (2) parallel processing and statistical testing of transcripts position per position, (3) post-processing and saving.

Transcripts whitelisting

In order to reduce the computational burden, Nanocompore first filters out transcripts with insufficient coverage. This is achieved by a rapid tally of reads mapped per transcripts followed by selection of transcripts having at least 30 reads mapped in all of the samples provided. Users can modify the threshold but the default value allows to get reproducible results. Optionally, one can provide a custom list of transcripts to include or exclude.

Statistical analysis

White-listed transcripts are processed in parallel to take advantage of multi-threaded architecture. First, the data corresponding to the reads mapped on each transcript is loaded in memory and transposed in the transcript space in a position-wise fashion. The current implementation of Nanocompore only uses the median signal intensity and the scaled log10 transformed dwell time, but the framework is flexible enough to aggregate more variables, such as the error rate or additional Nanopolish HMM states. The 2 experimental conditions are compared positions per position using a range of statistical tests. We included the Kolmogorov-Smirnov (KS) test as a robust univariate pairwise statistical test on current intensity and dwell time. These tests are performed independently on the median intensity and the dwell time. We also implemented a Gaussian mixture model (GMM) clustering-based method. For a given position we fit a bivariate 2 components GMM to all the data points observed (x=median intensity, y=dwell time), irrespective of the sample label. We then assign each data point to one of the two clusters and test for differences in the distribution of reads between clusters across conditions. To this purpose, testing is implemented in two ways: 1) by default, we fit a Logit model to the data using the formula predicted_cluster~1+sample_label and report the coefficient’s p-value. 2) As an optional alternative we do a one-way ANOVA test comparing the log odds of data points belonging to cluster one between the two conditions. After testing, it’s optionally also possible to aggregate the p-values of neighbouring kmers to account for the fact that modified bases affect the signal of multiple kmers. To this end, and due to the fact that neighbouring p-values are non-independent, we implemented in python a method that extends the Fisher’s statistic X=-2log(P₁^w1 P₂^w2 … P_k^wk) to approximate the distribution of the weighted combination of non-independent probabilities²⁶. The combined p-values are computed all along the sequence using a sliding window of a given length. This method greatly reduces the prediction noise (false positive rate) at the expense of spatial resolution, while giving more weight to sites for which the effect of RNA modifications on the signal is spread over several kmers.

Post-processing, saving and data exploration with Nanocompore interactive plotting API

Results generated by the statistical module are collected and written in a simple key/value GDBM database. Although this data structure has limitations in terms of portability and concurrent access, it is natively supported by python and allows storing complicated data structures. For each test previously performed p-values are temporarily loaded in memory and corrected for multiple tests with the Benjamini-Hochberg procedure. Users can then obtain a tabulated text dump of the database containing all the statistical results for all the positions in the transcripts space or a BED file with the positions of significant hits found by Nanocompore converted in the genome space. Finally, we provide a convenient python wrapper over the GDBM database, allowing users to interactively access simple high level functions to plot and export the results (https://nanocompore.rna.rocks/demo/SampCompDB_usage/). The wrapper was initially developed for Jupyter but can essentially work with any python IDE. At the time of publication the wrapper allows to generate 6 different types of publication ready plots for a given transcript including (1) the distribution of p-values, (2) the distribution of signal intensity and dwell time, (3) the overall coverage per sample, (4) the nanopolish HMM states, (5) the kernel density of the signal and dwell time for a specific position and (6) the sharkfin plot of the p-values compared with Log Odds Ratio (for the GMM method).

Downstream analyses

The code for all generic analyses, plots and metrics is available at https://github.com/tleonardi/nanocompore_paper_analyses/. The transcript intersection plot for the MOLM13 polyA dataset had been generated with UpsetR^57,58.

Peak calling

Given that a single modification can affect the signal for multiple overlapping kmers, we developed a peak calling method to refine our predictions. Briefly, we first converted p-values in -log10 so that peaks correspond to positions with higher probability of being modified. We then defined a dynamic threshold per transcripts corresponding to the median of all the values above 2 (p-values <0.01). In the case where no significant p-values were found, the threshold was set to 2. Peaks were called using scipy.signal.find_peaks using the dynamic threshold described before as a minimal height and a minimal distance of 9 between 2 peaks (5 overlapping 5-mers). Examples can be found in Figure S12.

Metagene m6A coverage

The metagene m6A coverage analysis was done considering all nanocompore kmers with GMM logit p-value<0.01 and a log odds ratio >0.5. The plot was produced in R/Bioconductor with the Guitar package using the TxDb.Hsapiens.UCSC.hg38.knownGene package for the human transcriptome annotation and the SK1 reference transcriptome GFF for the yeast annotation.

Motif enrichment analysis of m6A sites

For the motif enrichment analysis of m6A sites identified by Nanocompore analysis of METTL3 KD, we extracted the sequence of all kmers tested by Nanocompore and having a p-value<0.5 (GMM-logit). The sequences were then sorted by p-value and analysed with Sylamer for the identification of over-represented words, using a word size of 5 and a growth parameter of 100. The Sylamer results were then imported in R for plotting. To produce a combined profile of the DRACH motif, the per-window p-values of all DRACH kmers were combined using Fisher’s method. For visualisation purposes, the final plot only reports the lines for the top 100 motifs with the greatest area under the sylamer curve, with the top one represented in colour.

Single molecule identification of m6A sites

To assign an m6A probability at A652, A1324 and A1535 for each read covering the β-actin transcript, we developed a dedicated post-processing script available at https://github.com/tleonardi/nanocompore_paper_analyses/m6acode/parse_sampcomdb.py. Briefly, for each of the three positions of interest, we extract the GMM model saved in sampCompDB, and for each read we then predict the probability that it belongs to each of the two clusters. To define which of the two clusters corresponds to m6A modified reads, we consider which of the two clusters has negative log odds of data points belonging to it in the KD condition (i.e., we consider which of the two clusters shrinks in the KD condition). To test the independence of the methylation events at these three sites, we performed a chi-squared test of independence comparing the expected number of molecules for each of the 8 combinations of modifications to the observed number of molecules. The results reported are obtained using a probability threshold of 0.75 (as predicted by the GMM) to consider a read as methylated. However, to ensure robustness of these results, the chi-squared test was repeated for all thresholds between 0.1 and 1 (0.05 steps) and p-values were adjusted accordingly using the Benjamini–Hochberg procedure. Adjusted p-values were >0.39 for all thresholds used.

7SK structures

The 7SK multiple alignments and consensus secondary structure were obtained from Rfam (RF00100). Secondary structure plots were produced with R2R^49,59 and a custom python script to annotate p-values as colour shading (available at https://github.com/tleonardi/nanocompore_paper_analyses/blob/master/ncRNAs_structures/create_annotations.py)

miCLIP analysis

miCLIP data and corresponding input data was analysed using the iMaps web server (https://imaps.genialis.com/). Briefly, raw reads were demultiplexed and trimmed (for adaptors and quality), before being mapped to a tRNA and rRNA index using STAR (v2.4.0.1)⁶⁰. Unmapped reads were then mapped to GRCh38 GENCODE primary assembly, using GENCODE annotation v30. STAR parameter -alignEndsType Extend5pOfRead1 was used to ensure no soft clipping of cDNA start sites. PCR duplicates were removed based on unique molecular identifier (UMI) and mapping position. cDNA start -1 positions were taken as crosslink sites. Significant Nanocompore clusters were determined by merging overlapping kmers with a GMM_logit_contex_2 p-value < 0.001 using bedtools merge (v2.28.0). Control sites were selected as those with a context 2 p-value of 1 and split into those that did or did not contain DRACH within a 100nt window around the center of the cluster. Due to large differences in library size, miCLIP crosslinks were first filtered to remove intergenic and ncRNA sites and then subsampled using GNU coreutils shuf, to generate libraries equal in size to the smallest library, totalling 47,012 crosslinks. Crosslink counts were divided by gene TPMs calculated from either WT or KO mock miCLIP samples. BigWig files were generated from the normalised bedgraphs, which were used as the input to deepTools⁶¹ (v3.3.0) computeMatrix and plotHeatmap to generate metaprofiles -1000 to +1000bp around the center of Nanocompore clusters with a bin size of 2bp. The resulting tabular output was further analysed in R. Shaded regions on the plot represent the mean +/- the standard deviation at each position in the profile (WT miCLIP n=4, KO n=2). Both the mean and bounds were smoothed using loess regression with a span of 0.6. In order to test for a significant difference between WT and KO profile, mean values from WT and KO miCLIP between positions -20 to +20 around nanopore sites were subjected to a Mann–Whitney U test.

Modification prediction Comparison with MetaCompore

In order to compare Nanocompore against most of the other tools available for RNA modification detection in a reproducible way, we wrote a snakemake pipeline called MetaCompore (https://github.com/a-slide/MetaCompore). For this study, we used MetaCompore v0.1.2, which includes the latest version of following tools : Epinano 1.2.0, Eligos 2.0.0, Tombo 1.5.1, differr_nanopore_drs (latest version), Mines (latest version) and Nanocompore 1.0.3. MetaCompore preprocess the data for all the tools, including Basecalling with ONT-Guppy 4.2.2 (except for Epinano which required the older 3.1.5 version), read alignment to the reference transcriptome with Minimap2 2.17, alignments filtering with pyBiotools 0.2.7 and signal realignment with f5c 0.6. For portability and reproducibility reasons, every module of MetaCompore is provided within its own singularity container and all the options used for a run are tracked in a YAML configuration file. Nanocompore and Epinano are the only tools to support experimental replicates. For all the other tools we merged the data obtained from replicates. Since every tool outputs a different kind of statistics/format, MetaCompore filters the data following the respective authors recommendations and when possible converts the result in a similar format containing the significant site associated with their p-value and Effect size. For Nanocompore and Tombo which both work in signal space, we added a peak calling denoising step to narrow down the results.

For the comparison in this paper we used a Yeast SK1 dataset comparing 2 replicates of WT yeast against 2 replicates of an IME4 KO mutant (m6A writer in Yeast). We used the Yeast SK1 reference transcriptome (https://www.yeastgenome.org/strain/SK1). Prior to modification detection, we ran an optional pipeline step to filter out any reference transcript with less than 30 reads in all replicates. The command line options used for all the tools are available in the MetaCompore configuration file provided as supplementary material.

Benchmarks of MetaCompore results against known yeast m6A sites

We compiled an orthogonal reference set of m6A sites from SK1 yeast by taking m6A-Seq sites from Schwartz et al.³⁰ and MAZTER-seq sites from Garcia-Campos et al.²⁹. The sites and surrounding sequence were mapped to the MvO SK1 genome fasta to obtain the equivalent genomic coordinates. ACA sites annotated with MAZTER-seq confidence group > 1 or supported by m6A-Seq were taken as single nucleotide positions. Non-ACA sites were taken from m6A-Seq. If an m6A-Seq window overlapped with one or more single nucleotide sites it was removed from the reference set. In total this produced a set of 882 single nucleotide positions, and 415 80nt windows, amounting to 1297 reference m6A positions.

The comparison of each method in MetaCompore with this orthogonal reference dataset was based on our S. cerevisiae DRS data and limited to transcripts with a coverage of at least 30 reads. The calculations of the TPR/FPR/F1 score/Precision of each method was done at a p-value threshold of 0.01. For each method we constructed a confusion matrix using the following criteria:

True Positives: the number of ground-truth m6A sites overlapping at least one significant kmer according to the given method. The True Positive Rate was further defined as the number of True Positives divided by the total number of m6A sites in the ground-truth set.
True Negatives: the number of not significant DRACH kmers in the transcriptome (limited to transcripts present in the DRS dataset)
False Positives: the number of significant kmers that do not overlap a ground-truth m6A site. The False Positive Rate was further defined as the number of False Positives divided by the sum of False Positives and True Negatives.
False Negatives: the number of ground-truth m6A sites not overlapped by any significant kmer

For the purposes of the calculations above, we used the results tables produced by each method (prior to Metacompore postfiltering) and applied the following criteria to consider a kmer as significant:

Eligos2: reported p-value<=0.01 and odds ratio>1.2 (as recommended by the authors)

Diff_err: reported p-value<0.01 (diff_err results are already filtered by p-value and G-test)

MINES: all sites (MINES only reports significant sites)

Epinano: sites classified as modified (modification probability >0.5)

Tombo: reported p-value<0.01 after Benjamini-Hochberg adjustment

Nanocompore: reported p-value<0.01 and GMM log odds ratio>0.5 (for GMM method only)

For the benchmarks above, the single nucleotide sites identified by each method were extended to 10nt prior to overlapping them with the ground-truth set.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source link

Vasiprak Blog