Spectrum preprocessing
pGlyco3 can process HCD and ‘HCD+ETxxD’ (HCD-pd-ETxxD, or HCD followed by ETxxD) data for N-/O-glycopeptide identification; here, ETxxD could be either ETD, EThcD or ETciD. Note that pGlyco3 is optimized for both glycan and peptide fragment analysis using sceHCD; hence, sceHCD is always recommended, as discussed in our previous work8. If HCD and ETxxD spectra are generated for the same precursor, pGlyco3 will automatically merge them into a single spectrum for searching. Peaks from HCD and EThcD spectra are merged on the basis of a specified mass tolerance set by the users (±20 ppm by default) in the spectrum preprocessing step. pGlyco3 then deisotopes and deconvolutes all MS2 spectra and removes the precursor ions. pGlyco3 uses pParse to determine the precursor mono ions and to export chimera spectra. The pParse module has been also used in peptide, glycopeptide and crosslinked peptide identification41,42. pGlyco3 filters out nonglycopeptide spectra by checking glycopeptide-diagnostic ions, then searches the glycan parts with the glycan ion-indexing technique. The diagnostic ions can be defined by the users, the default ion is 204.087 m/z for both N-glycosylation and O-glycosylation.
Glycan-first search and glycan ion-indexing
In pGlyco3, each glycan structure is represented by a canonical string in the glycan database. pGlyco3 provides quite a few built-in N- and O-glycan databases, and it supports the conversion of the glycan structures of GlycoWorkbench36 into the canonical strings of pGlyco3. For modified saccharide units, pGlyco3 will automatically substitute one or several unmodified monosaccharides in each canonical string into modified forms, making it very convenient for the modified saccharide unit analysis (Supplementary Note 1).
For each peak in an MS2 spectrum, pGlyco3 assumes it is a Y ion of a glycopeptide with the peptide attached to the database search. In pGlyco 2.0, we calculated the peptide mass for each glycan in the glycan database by ‘precursor mass − glycan mass’, and theoretical Y ion masses of each glycan could be deduced and matched against the MS2 spectrum. This glycan search algorithm has to match the same spectrum for N times (O(N), N is the number of glycans in the glycan database). But pGlyco3 searches for Y-complementary ions (‘precursor mass − Y ion mass’) instead of Y ions; the key observation was that:
$${it{mathrm{peak}}},{it{mathrm{mass}}} = {it{mathrm{peptide}}},{it{mathrm{mass}}} + {it{mathrm{glycan}}},{it{mathrm{Y}}},{it{mathrm{ion}}},{it{mathrm{mass}}} + {it{mathrm{peak}}},{it{mathrm{mass}}},{it{mathrm{error}}}left( { + {it{mathrm{proton}}}} right)$$
$${it{mathrm{precursor}}},{it{mathrm{mass}}} = {it{mathrm{peptide}}},{it{mathrm{mass}}} + {it{mathrm{glycan}}},{it{mathrm{mass}}} + {it{mathrm{precursor}}},{it{mathrm{mass}}},{it{mathrm{error}}}left( { + {it{mathrm{proton}}}} right)$$
$$begin{array}{*{20}{c}} { Rightarrow {mathrm{precursor}},{mathrm{mass}} – {mathrm{peak}},{mathrm{mass}}} & = & {mathrm{glycan},{mathrm{mass}} – {mathrm{glycan}},{mathrm{Y}},{rm{ion}},{mathrm{mass}} + {mathrm{mass}},{mathrm{error}}} \ {} & = & {mathrm{Y-}}{{mathrm{complementary}},{mathrm{mass}} + {rm{mass}},{mathrm{error}}} end{array}$$
To accelerate the Y-complementary ion search, pGlyco3 builds the ion-indexing table for Y-complementary ions of all possible Y ions for the glycan database (Supplementary Note 2). The Y-complementary ion composition is defined as ‘full glycan composition – Y-ion glycan composition.’ A table of all possible unique Y-complementary ion compositions is generated, and the list of glycan IDs where the Y-complementary compositions originate is also recorded in the table. For the glycan-first search, core Y ions (for example, trimannosyl core Y ions in N-glycans; Supplementary Table 2) are the key ions for N-glycan scoring. Hence, pGlyco3 encodes the glycan ID using 31 bits of the 32-bit integer and uses the extra bit to record whether the corresponding Y ion is the core ion of the glycan (Supplementary Note 2). The table is then sorted and hashed by the masses for a fast query (within the O(1) query time). As a result, it takes only O(#Peak) time to obtain the matched ion counting scores as well as the matched core ion counting scores of all glycans for every spectrum; here, #Peak refers to the number of peaks in a spectrum. pGlyco3 retains the glycans with ≥ n core Y ions matched (n = 2 for an N-glycan and n = 1 for an O-glycan). Glycans are further filtered out if they contain a specific monosaccharide but are not supported by corresponding monosaccharide-diagnostic ions (Supplementary Table 1). Finally, pGlyco3 retains the top 100 candidate glycan compositions (sum of the ion counting score and the core ion counting score) for the peptide search. pGlyco3 also retains all small glycans (number of monosaccharides ≤ 3) for the peptide search because small glycans have too few Y ions to obtain high glycan scores.
Peptide search and glycopeptide FDR estimation
In protein sequence processing, every Asn (N) with the sequon ‘N-X-S/T/C (X is not P)’ in all protein sequences is converted to ‘J’ while keeping the same mass and chemical elements as N. J is then the candidate N-glycosylation site, and S/T are the candidate O-glycosylation sites. Proteins are digested into peptide sequences, and peptide modifications are added to the peptide sequences. Modified peptides are also indexed by their masses for O(1) time access. For the given spectrum and each candidate glycan, the peptide mass is deduced by ‘precursor mass − glycan mass’. pGlyco3 then queries the peptide mass from the mass-indexed peptides. For the peptide search, pGlyco3 considers b/y ions and ‘b/y + HexNAc’ in the HCD mode. pGlyco3 further considers c/z ions as well as their hydrogen rearrangement in the merged spectra for the HCD+ETxxD mode. The candidate glycan is also fine-scored by the matched Y ions. The glycan and peptide scoring schemes of pGlyco3 are the same as those of pGlyco 2.0, but some parameters were tuned in pGlyco3 to obtain better identification performance (Supplementary Fig. 6). Only the top-ranked glycopeptide is retained as the final result for each spectrum. For potential chimeric spectra, pGlyco3 removes unreliable mixed glycopeptides by determining whether one’s precursor is another’s isotope. For example, if NeuAc(1) and Fuc(2) are simultaneously identified in the same MS2 scan but with different precursors, the Fuc(2)-glycopeptide will be removed because ‘NeuAc(1) + 1 Da = Fuc(2).’ pGlyco3 also uses the pGlyco 2.0 method to estimate the FDRs for all GPSMs at the glycan, peptide and glycopeptide levels. pGlyco3 skips the glycan FDR estimation step for small glycans.
Fast glycosylation site localization with pGlycoSite
pGlyco3 determines not only the glycosylation sites but also the composition of the glycan attached to each site. For a given spectrum with the identified peptide and glycan composition, enumerating all possible glycopeptide forms and generating their c/z ions for site localization is not computationally easy. The worst computation complexity could be (Oleft( {L times mathop {prod }limits_{i = 1}^T C_{S + G_i – 1}^{G_i – 1}} right)), where L is the peptide length, T is the number of monosaccharide types, S is the number of candidate glycosylation sites, and Gi is the number of the ith monosaccharide type (Supplementary Note 3). The enumeration complexity would be exponentially large as S and Gi increase, as illustrated in Supplementary Note 3.
In pGlyco3, the pGlycoSite algorithm is designed to avoid enumeration. The key observation for the pGlycoSite algorithm is that, regardless of how many glycopeptide forms there are for a given peptide and glycan composition, the number of all possible c or z ions is. at most. F × (L − 1). Here, F is the number of subglycan compositions for the identified glycan composition, and the subglycan is defined in Supplementary Note 3. pGlyco3 generates a c/z ion table in which each cell contains the glycan-attached c/z ions (Supplementary Note 3). After removing all Y and b/y ions from the given spectrum, the ion table with c/z ions is then matched and scored against the spectrum (called the ScoreTable; Fig. 3b and Supplementary Note 3). pGlycoSite currently uses c/z ion counting scores for each cell of the table, but other comprehensive scoring schemes could be supported in the table if they could achieve better performance.
The best-scored path starting from bottom left [g0,0] to top right [G,L] (Fig. 3c and Supplementary Note 3) is then calculated by a dynamic programming algorithm:
$$begin{array}{l}mathrm{BestPath}left[ {g,p} right] =\ left{ {begin{array}{*{20}{c}} {begin{array}{*{20}{c}} {mathop {{max }}limits_{forall g_s le g} mathrm{BestPath}left[ {g_s,p – 1} right] + mathrm{ScoreTable}left[ {g,p} right]} & {mathrm{if},mathrm{IsValidPath}left( {g_s,g,p} right)} end{array}} \ {begin{array}{*{20}{c}} times & {mathrm{if},mathrm{not},exists g_s le g,mathrm{IsValidPath}left( {g_s,g,p} right)} end{array}} end{array}} right.end{array}$$
where G is the identified full glycan composition, g refers to a subglycan composition of G, g0 refers to the zero-glycan composition and p refers to pth position of the peptide sequence. Here, all glycan compositions (from g0 to G) are represented as vectors and hence can be compared with each other. Therefore, gs ≤ g means that gs is the subglycan of g. (mathrm{IsValidPath}left( {g_s,g,p} right)) is designed to check whether the path starting from (left[ {g_s,p – 1} right]) to [g,p] is valid (Supplementary Note 3). pGlycoSite sets BestPath[g,0]=0 ((forall g:g_0 le g le G)) and iteratively calculates the BestPath table for all g0 ≤g ≤G and 0 <p ≤ L. BestPath[G,L] is then the final best path score that will be solved. Finally, pGlycoSite deduces all the paths that can reach the BestPath[G,L] score by backtracking the BestPath table from [g0,0] to [G,L]. If the best-scored path contains the cell [gs,p − 1] and [g,p] with gs < g, then the pth amino acid is localized as a site with glycan g − gs. pGlycoSite introduces the ‘site-group’ if multiple paths can achieve the same BestPath[G,L] score (Fig. 3c and Supplementary Note 3). The time complexity of SSGL in pGlycoSite, including the dynamic programming and backtracking for a GPSM, is only O(L × F2) (Supplementary Note 3).
Site localization probability estimation with pGlycoSite
Glycosylation site probability refers to the probability that a site is correctly localized. As the peptide and glycan compositions have been identified for a given MS2 spectrum, an incorrect localization would result from the random assignment of randomly selected subglycans to random sites for the same peptide and glycan compositions. To simulate the incorrect localization for each localized site, pGlycoSite randomly samples 1000 paths from bottom left to top right on the ScoreTable. For a given site or site-group to be estimated, the random paths could overlap with the BestPath, but they must not contain the path that can determine this site or site-group (that is, path from [gs,pi] to [g,pj] for site pi(j = i + 1) or site-group {pi,pj–1} (j > i + 1)). pGlycoSite then calculates 1,000 ion counting scores of these paths and estimates a Poisson distribution from these random scores. It estimates the P values on the basis of the Poisson distribution for the BestPath[G,L] and the best random score (denoted as RandomBest) and then estimates the probability as follows:
$${mathrm{Prob}}_{mathrm{Poisson}} = frac{{{{{mathrm{log}}}}left( {P{mathrm{value}}left( {{mathrm{BestPath}}left[ {G,L} right]} right)} right)}}{{log left( {P{mathrm{value}}left( {{mathrm{BestPath}}left[ {G,L} right]} right)} right) + {{{mathrm{log}}}}left( {P{mathrm{value}}left( {mathrm{RandomBest}} right)} right)}}$$
To ensure the localized glycopeptide-spectrum-matching quality, pGlycoSite adds a regularization factor to the estimated ProbPoisson, and the final localized probability becomes
$${mathrm{Prob}} = {mathrm{Prob}}_{mathrm{Poisson}} times r = {mathrm{Prob}}_{mathrm{Poisson}} times left( {frac{{mathrm{BestPath}left[ {G,L} right]}}{{2left( {L – 1} right)}}} right)^alpha ,$$
where α is set as a small value (0.05) to ensure that it does not affect the value of ProbPoisson. However, when BestPath[G,L] obtains a very small score, r will be close to zero, hence limiting the final Prob value. L − 1 is the number of considered c/z ions.
Validation of the N-glycopeptide search with 15N-/13C-labeled fission yeast data
The protein sequence database used was the fission yeast protein sequence database (Schizosaccharomyces pombe, Swiss-Prot, 2018_08) concatenated with the mouse protein sequence database (Mus musculus, Swiss-Prot, 2018_08). Identified GPSMs with mouse peptides would be falsely identified and, hence, mouse peptide GPSMs could be used to test the peptide-level error rates. The N-glycan database for MetaMorpheus, MSFragger and Byonic is the 182-glycan database, which includes 74 NeuAc-containing N-glycan compositions. The N-glycan database for pGlyco3 is the built-in mouse N-glycan database, which contains 1,234 N-glycan compositions (6,662 structures) and has 659 NeuAc-contained compositions. NeuAc-containing N-glycan compositions identified in fission yeast data would be falsely identified and thus could be used to test glycan-level error rates. The detailed search parameters are listed in Supplementary Data. For each software tool, all spectra were regarded as unlabeled spectra while searching, and the identified GPSMs were then validated by using their 15N-/13C-labeled precursor signals in the MS1 spectra (Fig. 2a). This validation method was also used in our previous works for peptide, glycopeptide and crosslinked peptide identification8,41,42. Peptide-level and glycan-level FDRs were also tested by using mouse peptides and NeuAc-containing glycans, respectively (Fig. 2a).
Validation of the O-glycopeptide search with IHMO data
In inhibitor-initiated homogenous mucin-type O-glycosylation (IHMO), an O-glycan elongation inhibitor, benzyl-N-acetyl-galactosaminide (GalNAc-O-bn), was applied to truncate the O-glycan elongation pathway during cell culture, generating cells with only truncated HexNAc(1) or HexNAc(1)NeuAc(1) O-glycans. sceHCD-pd-EThcD spectra were generated after O-glycopeptides were enriched by FASP43, and experimental details are shown in Supplementary Note 4. IHMO in HEK-293 cells was then verified by laser confocal microscopy, as displayed in Supplementary Fig. 4. Spectra were then searched by pGlyco3, MetaMorpheus, MSFragger and Byonic, and the search parameters are listed in Supplementary Data. For all software tools, Hex-containing O-glycopeptides could be still identified due to the inhibitor’s imperfect efficiencies. The Hex-contained O-GPSMs were further validated by the summed intensities of the Hex-diagnostic ions (163.060 and 366.139 m/z) in their HCD spectra. The summed intensity threshold was set as 10% of the base peak.
Validation of the pGlycoSite algorithm
For the given SSGL probabilities (Prob) of all identified GPSMs, the SSGL-FDR could be estimated as follows:
$$widehat {mathrm{FDR}}_{SL}left( x right) = frac{{mathop {sum }nolimits_{forall i:1 le i le N,,mathrm{Prob}_i ge x} left( {1 – mathrm{Prob}_i} right)}}{{mathop {sum }nolimits_{i = 1}^N Ileft( {mathrm{Prob}_i ge x} right)}},$$
where (widehat {mathrm{FDR}}_{SL}(x)) is the estimated SSGL-FDR for a given probability threshold x, N is the total number of localized sites and I(bool) is the indicator function that returns 1 when bool is true and 0 otherwise. It is not easy to validate the estimated SSGL probability for a given site, but we can validate the accuracy of (widehat {mathrm{FDR}_{SL}}left( x right)), enabling SSGL probability validation from another perspective. In this work, we designed four methods to validate (widehat {mathrm{FDR}}_{SL}(x)): synthetic double-site N-glycopeptide validation, multienzyme-based validation, entrapment-based validation and OpeRATOR-based validation.
A double-site N-glycopeptide ‘NVN[H(5)N(4)]ISYTVN[H(5)N(4)]DSFFPQRPQK’ was synthesized and its 3+ and 4+ HCD+ETxxD spectra were continuously acquired by using PRM. To enable SSGL validation, we searched the spectra against the human glycan database and a protein database with only the synthetic peptide sequence. Thus, we could always identify the same peptide sequence with different localized glycans. SSGL would be true if the localized glycan was H(5)N(4); otherwise, it was false. The real SSGL-FDR could then be calculated as ({mathrm{FDR}}_{syn} = frac{# {mathrm{False}}}{# {mathrm{False}} space + space # {mathrm{True}}}).
To validate SSGL for double-site N-glycopeptides under more comprehensive situations, we used pGlyco3 to identify and localize the double-site N-glycopeptides with the peptide sequence ‘K.THTN(272)ISESHPN(279)ATF.S’ of IGHM digested with chymotrypsin and trypsin. For all identified site-specific N-glycans, the theoretical masses of further ‘trypsin+Glu-C’-digested glycopeptides were calculated. Then, we used PRM to trigger the HCD spectra of ‘trypsin+Glu-C’-digested glycopeptides and identified these single-site spectra to verify the double-site N-glycopeptides. The SSGL would be true if the localized N-glycan on ‘K.THTN(272)ISESHPN(279)ATF.S’ could be identified by the single-site spectra; otherwise, it was false. The Glu-C-suggested SSGL-FDR could then be calculated as ({mathrm{FDR}}_{mathrm{GluC}} = frac{# {mathrm{False}}}{# {mathrm{False}} space + space # {mathrm{True}}}).
For entrapment-based validation, after N-glycopeptide data were searched, the sites of GPSMs were localized using pGlycoSite by regarding the candidate sites as ‘J/S/T’, which could be enabled by setting ‘glycosylation_sites=JST’ in the search parameter file. For a given GPSM, it would be a true positive (TPtrap) if J were the only localized sites; otherwise, all sites were false positives (FPtrap). The entrapment-based SSGL-FDR could be calculated as
$${mathrm{FDR}}_{mathrm{trap}}left( x right) = frac{{# {mathrm{FP}}_{mathrm{trap}}left( {mathrm{Prob}}space gespace x right)}}{{# {mathrm{FP}}_{mathrm{trap}}left( {{mathrm{Prob}}space gespace x} right)space +space {# {mathrm{TP}}_{mathrm{trap}}}left( {{mathrm{Prob}}space ge space x} right)}}$$
for a given probability threshold x. Then, the SSGL probabilities of pGlycoSite could be validated by comparing (widehat {mathrm{FDR}}_{mathrm{SL}}left( x right)) with FDRtrap(x).
For OpeRATOR-based validation, we used the data digested by OpeRATOR37. Only the GPSMs with their peptides starting with ST at the N-terminal were used for the validation. Then, for a given GPSM, we regarded it as the true positive (TPOpR) if localized sites contained a site that was at the N-terminal S/T otherwise, it was regarded as a false positive (FPOpR). The OpeRATOR-based SSGL-FDR (FDROpR(x)) could be calculated from TPOpR(x) and FPOpR(x), which is similar to FDRtrap(x).
Comparisons of the (widehat {mathrm{FDR}}_{mathrm{SL}}left( x right)) of pGlycoSite with FDRtrap(x) and FPOpR(x) are displayed in Fig. 3d and Supplementary Fig. 2. We also compared (widehat {mathrm{FDR}}_{mathrm{SL}}left( x right)) of MetaMorpheus with FDROpR(x), as shown in Supplementary Fig. 2.
Analysis of aH-glycopeptides in yeast samples
‘aH’ is defined as a Hex with an ammonia adduct. Peptides were searched by the yeast protein sequence databases (S. pombe for fission yeast and Saccharomyces cerevisiae for budding yeast, Swiss-Prot, 2018_08). N-glycan parts were searched against the high-mannose-only N-glycan database, and O-Man glycan parts were searched against the Hex-only glycan database. aH was regarded as a modified Hex for the pGlyco3 search, and the maximal number of aHs was set as two per glycan. For the O-Man-glycopeptide search, the glycopeptide-diagnostic ion was set as Hex (163.060 m/z). The 15N-/13C-labeled fission yeast (PXD005565) results were also validated by the 15N-labeled and 13C-labeled precursor signals in the MS1 spectra. For the 15N/13C validation of aH identifications, as the ammonia adduct may be introduced during sample processing or MS steps, it could not be labeled by 15N; hence, we computationally designed a new element called ‘14N’, which would not be converted into 15N for MS1 signal extraction. The element composition is recorded in the ‘element.ini’ file in the software package, demonstrating the flexibility of pGlyco3 for the analysis of new monosaccharides or modified saccharide units. We also verified the aH-N-glycans by analyzing the MS data of released N-glycans in fission yeast samples (Supplementary Note 4).
O-Man glycopeptides of fission yeast were also analyzed by HCD followed by EThcD to investigate the O-mannosylation sites. The data were searched by the ‘HCD+EThcD’ mode of pGlyco3, and the sites were localized by pGlycoSite. See Supplementary Note 4 for details.
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

