Metagenomic analysis of hot spring sediment
We employed a sequencing-based metagenomics approach to mine CBH genes from environmental DNA that was isolated from hot spring sediments in Miyagi prefecture, Japan. The first DNA sample, named AR19, was sequenced in triplicate using 454 pyrosequencing, which included a total of 2,766,332 reads, with an average sequence length of 400 ± 55 bp, totalling 1.1 Gbp of sequencing data (Table S1). Of these, 17,991,567 reads (68.4%) were assembled into contigs ≥1 kb (595,602 contigs). The largest contig was 278,185 bp. Phylogenetic binning of all contigs and singletons in AR19 was performed using BLAST, then compared to the KEGG database36 to classify the data into bacterial, archaeal, eukaryotic, viral or unclassified sets. Contigs and singletons classified as bacteria and archaea accounted for 59.9 and 3.0 Mbp, respectively, whereas the unclassified set was 266.7 Mbp, which suggests that most (80.8%) of the obtained sequences from AR19 were unknown. Eukaryotic and viral sequences made only minor contributions.
Table S2 shows the carbohydrate-active enzyme (CAZy) annotation of cellulases predicted to have a high significance (E-value < 1 × 10−5) to correspond to an enzyme in the CAZy database5. The table also shows the CAZy family-associated protein domain (Pfam) annotation. Overall, we predicted the presence of 3378 GHs (1.96% of all open reading frames (ORFs)). A total of 75 ORFs were identified as putative cellulases (endoglucanases and cellobiohydrolases) belonging to the families GH5, GH6, GH9, GH44 and GH48. This corresponds to 6.1% of all GH enzymes. Among them, 70 ORFs, except for GH44, corresponded to 5.7% of all GHs and are considered CBH candidates.
DNA sequences of the 70 ORFs were amplified from the environmental DNA using PCR with specific primers designed for each ORF and cloned into the vectors of E. coli. Since the contig sequences were a mixture of closely resembled sequences that existed in the environmental DNA, a cloned sequence of a particular ORF may not be identical to the sequence predicted from the assembled contigs. Thus, we sequenced at least 10 clones for each ORF to confirm the sequences. In most cases, multiple genetic variants were identified for each ORF.
Enzyme characterisation
The DNA sequences that were predicted as CBHs and their genetic variants were expressed in E. coli. Enzymatic assays of hydrolase activity with phosphoric acid-swollen Avicel (PSA) were then performed. Most showed no or very weak activity toward the substrate, which coincides with reports that the expression of active CBHs is difficult in E. coli37, likely due to there being no proper assembly of the proteins in the host. As a result, none of the CBH candidates belonging to GH families 5, 9 or 48 were PCR-cloned or expressed in E. coli. Nevertheless, two GH6 CBH genes showed significant activities, and were eventually identified and cloned for heterologous expression in E. coli. The observed activity was also confirmed using crystalline cellulose Avicel. The catalytic domains of the two genes shared a high degree of amino acid sequence identities (80% identity over 344 equivalent residues).
Among these, a GH6-family hydrolase, named HmCel6A (hot spring metagenome-derived cellulase family 6A), showed the highest activity. Thus, we focused thereafter on HmCel6A and its genetic variants. The genes seemed to encode a full-length catalytic domain as a CBH. Homology searches against multiple databases showed that HmCel6A shared amino acid sequences that were 76% similar to the GH6 catalytic domain sequence from Ardenticatena maritima, a ferric iron- and nitrate-reducing bacterium belonging to the phylum Chloroflexi38. The phylogenetic tree suggests that this CBH is of bacterial origin (Fig. 1).


Bootstrap values at branch points are indicated for 10,000 replicates and shown as percentages. Scale bar = 0.1 amino acid substitutions per site. Branches corresponding to partitions that were reproduced in <50% of bootstrap replicates are collapsed. The tree is drawn to scale, with branch lengths in the same units as those of the evolutionary distances that were used for phylogenetic tree construction. The sequences of three bacterial structure-determined CBHs and CBH II from Trichoderma reesei (Hypocrea jacorina) were also included. Species belonging to high G + C Gram-positive bacteria, green non-sulfur bacteria and γ-proteobacteria are colorised with green, blue and olive, respectively.
This recombinant HmCel6A showed hydrolytic activity against the crystalline cellulose Avicel and PSA (Table S3). The optimum temperature (Topt) and the melting temperature (Tm) for PSA were 75 °C and 80 °C, respectively, at the optimum pH of 5.5 (Figs. 2a, b and S1). The addition of calcium ion to the reaction mixture improved the thermostability, as seen in a thermal shift assay39 (Fig. 2b). Further metagenomic analysis included the identification and activity characterisation of 12 genetic variants of HmCel6A (Table S4). HmCel6A-3SNP, isolated from the metagenomic sample OSJ2, had three amino acid replacements to HmCel6A (P88S/L230F/F414S), exhibited the highest Topt of 95 °C with PSA as a substrate (Fig. 2a) and had a Tm of 96.0 °C in the presence of calcium (Fig. 2b). This provided us the unique opportunity to investigate the effect of amino acid residues on thermostability.


a Optimum temperature (Topt) of HmCel6A and its 3SNP variant with the PSA substrate in the presence (solid squares and circles) and absence (white squares and circles) of 3 mM CaCl2. b Melting temperature (Tm) of wild-type and mutant enzymes in the presence (black bars) and absence (grey bars) of 3 mM Ca2+/40 mM EDTA, as determined in a thermal shift assay. Each error bar represents the standard error.
Overall structure of HmCel6A
The crystal structure showed a (β/α)8-barrel core (Figs. 3, S2a), and putative catalytic residues, such as Asp140 of the catalytic acid, that are generally conserved among the GH6 enzymes23,40,41. Although GH6 includes EG and CBH II, all the GH6 CBHs shared the active-site loop and the extended bottom loop, which formed the active-site tunnel23 (Fig. S2b). These are also known as the N-terminal and C-terminal loops in fungal enzymes. The structure of HmCel6A is more similar to three bacterial CBH II enzymes, TfCel6B from a soil cellulolytic actinomycete Thermobifida fusca23, CfCel6B from a cellulolytic facultative anaerobes Cellulomonas fimi42 and XooCbsA from a phytopathogenic bacterium Xanthomonas oryzae pv. oryzae43. In particular, the three loops located around the substrate entry and exit sites were common and characteristic among the bacterial enzymes.


a, b Two views (in ribbon representation) of the Ca2+-bound structure. The protein chain is blue to red from the N- to C-terminus. Calcium ions, CA1 and CA2, are displayed as green spheres. The active site is enclosed in a tunnel formed by interactions between the extended bottom loop and the active site loop. Catalytically important residues and disulphide bonds are shown in stick representation. a Ribbon representation of HmCel6A overlaid on the Connolly surface representation. b View showing the β/α barrel structure with a central β-barrel comprising nine numbered strands.
On the other hand, we identified several unique characteristics of this enzyme that presumably contribute to thermostability. As global properties, HmCel6A is rich in hydrophobic clusters and charge–charge interactions (Table S5). Hydrophobic clusters are mostly observed in the major lobe of the GH6 enzymes. HmCel6A has the largest dimensions of overlapping area, and HmCel6A-3SNP has the largest cluster consisting of 145 contacts among all the known GH6 CBH structures. The charge interactions were shown in an increased number of salt bridges and as the lowest free energy in protein charge–charge interactions formulated with the Tanford–Kirkwood Surface Accessibility (TKSA) model19, which accounts for the effects of solvent polarization on charged atoms in proteins. The numbers of hydrogen-bonds are not significant but specific hydrogen bond networks are observed. Further, we found several structural elements for thermostability: an additional calcium ion, a disulphide bond located on the protein surface, interactions between the active-site loop and the bottom loop, and two shortened loops located at the substrate entry and exit sites (Fig. 3). Some details are described in the next section.
Key structural elements for thermostability
One unique structural feature is calcium binding. Unlike the fungal GH6 CBHs, HmCel6A as a bacterial enzyme has metal-binding sites, whose elements were identified as calcium ions contained in the crystallization condition. In addition to the CA1 site shared with TfCel6B (Fig. S2d), a unique metal-binding site (CA2) is located on the loop between β6 and α1 (Fig. S2c). The effect of calcium was experimentally verified by adding calcium salt to the enzyme solution, which enhanced its thermostability (Fig. 2); Topt with PSA was 75 °C and 80 °C in the absence and presence of 3 mM calcium, respectively, and Tm was 80.5 °C and 85.5 °C, respectively. In the crystal structure of the 3SNP variant, neither metal ion was observed, as the crystals were grown in metal-free solution, but the effect of calcium on the enzyme activity was retained. The effect of other metal ions was also examined as shown in Fig. S3. We could not observe improvement of Topt, but manganese-enhanced enzyme activity rather than calcium, and ferric and zinc ion reduced the activity in this condition. This result is almost same with previous reports for other CBHs44,45.
Another distinctive feature of HmCel6A is its disulphide bonds. The two disulphide bonds observed in the crystal structure of HmCel6A (Cys92-Cys154 and Cys331-Cys383) are typically found in GH6 CBHs23,40,41, and presumably stabilise the tunnel-forming loops. A third additional bond (Cys295-Cys300) in HmCel6A forms a short ring structure consisting of six residues. This additional bond is not present in other GH6 CBHs (Fig. S4). The ring fills a cavity in the molecular surface, and engages in interactions with other structural elements; thereby, possibly contributing to the enzyme’s structural stabilisation (Fig. S2e). Indeed, when the ring was opened by the C295A mutation, Tm was decreased by 6.5 °C, and deletion of the ring itself further decreased Tm by 11.5 °C (Fig. 2b).
Three mutations in the highest thermostable 3SNP variant only affected the local structure, although some hydrophobic interactions were replaced by charged interactions relative to the wild-type enzyme. Phe414 located in a hydrophobic core was replaced with Ser to introduce the Trp409-Gln23-Glu415-Thr27 hydrogen bond network at the molecular surface. This replacement was the most effective from the three mutations, since it led to an increase in Tm via a single mutation. Ser88 introduced an intramolecular water molecule and might compensate cavity around the residue. Phe230 might incorporate π-π and/or anion-π interactions with the neighbouring residues, Tyr352 and Glu231. Together, these structural features appear to improve the thermostability of HmCel6A, and could be engineered into other GH6 enzymes; however, this replacement reduced the relative activity to 20%–30% at Topt of wild-type (WT) (Fig. 2a).
Structural basis of catalytic cycle
The catalytic cycle in GH6 CBHs consists of four modes: pre-slide mode, slide mode, Michaelis complex and substrate-product complex35. We identified three modes in HmCel6A, but were unable to identify the slide mode using the crystal structures. The mobility of the well-conserved active-site loop and its open/close flexibility is thought to contribute to its processive hydrolysis, in order to rotate the catalytic cycle. Ser97, the key residue for the motion, forms a hydrogen bond with the main chain atoms of Gly99, and the proton-acceptor Asp222, in open conformation and is moved toward the subsite −1 in close conformation after the substrate slides to subsite −1 and −235.
The pre-slide mode was identified by the complexing of the crystal structure’s chains B and C with cellotriose (Glc3). In this complex, the substrate only occupied the +1 to +3 subsites and each glucose moiety was similar to those observed in other GH6 enzymes, with their 4C1 conformation. The active-site loop took on the open conformation. In HmCel6A, Ser97 uniquely formed a hydrogen bond at its main chain carbonyl with Lys378 Nε located in the bottom loop. Thus, the active-site loop slightly opened to the solvent region, in the so-called ‘even more open’ conformation.
The Michaelis complex mode was observed when the crystal complexed between cellohexaose (Glc6) and the inactivated enzyme, which mutated at the catalytic acid residue Asp140Ala. The substrate occupied subsites −3 to +4, with the partial occupation of both its ends. The ligand binding affected the active-site loop, in which Asn98 side chain was in close form. The puckering conformation of the glucose moiety was 2SO at the −1 subsite. This was well observed, since it played a central role in the activation of the substrate and product expulsion.
The substrate-product complex was obtained as the structure of chain A in the Glc3 complex, in which one cellotriose bound to subsites +1 to +3, and another cellotriose bound to two binding modes at either subsites −4 to −2, or subsites −3 to −1 (Fig. 4). Even though the electron density at the −1 subsite fluctuated by partial occupancy, it seemed to digest the covalent bond between +1 and −1, and to take a skew-boat conformation (2H1 or 2E: φ = 105.421°, θ = 50.914°, Q = 0.682). This conformation was unlike the chair conformation (2SO), but similar to 2,5B, as observed in HiCel6A when complexed with a cellobiose derivative46. The broad electron density around the O1 atom, and the residual electron density around the C1 atoms were considered to partly include the Michaelis complex (Fig. S5a).


a Glc3 oligomers bind to wild-type enzyme. b Glc6 oligomer binds to D140A mutant of HmCel6A. Electron density was calculated as omit Fo-Fc map and contoured with 3σ.
While the active-site loop is generally indispensable to catalysis in the GH6 CBH II enzymes, the bottom loop might only contribute the tunnel formation. Its sequences are diverse and provisionally categorised into seven groups47. The bottom loop of HmCel6A could not be assigned into any of these groups, and we found some unique structural features regarding its rigidity: (i) it uniquely contained three prolines (QPGIVDPDDPNKK), (ii) it had Lys378, which maintained the open conformation of the active-site loop and (iii) it had three ionic interactions (Asp371 formed a salt-bridge with Arg408, the main chain carbonyl of Asp373 joined with Arg50, and Asp374 joined with Lys377). Nonetheless, the bottom loop moved cooperatively to close the active-site loop, and introduced some hydrogen bonds to fixate the reaction intermediates such as Arg50-Asp374, Asn198-Lys378, Asp380-Ala95 NH2, Asn376-Gly99 CO and Arg90-Ser97 CO.
The smooth expulsion of the reaction product from each subsite is essential to avoid product inhibition and to obtain the highest enzyme efficiency. GH6 CBH enzymes might have originally evolved from EG, which has created high binding affinity at subsites −3 and −4 that enhances to stay the end product there. TfCel6A introduces the extended exit loop, as a “gatekeeper,” which largely moves to the region via substrate binding but without any direct interaction with the saccharide at subsite −247. Most fungal enzymes have no exit loop, so the product binding cannot be ignored as observed in a crystal structure of HiCel6A, PDB-ID 1OCB (24). HmCel6A has a short exit loop fixated with a salt-bridge between Arg58 in the loop and Asp15 in the α1 helix of the β/α barrel core, thus its mobility and gatekeeper role might be lost. Asp49, Arg50 and Glu360 possibly contribute saccharide binding at the subsite −3. All these residues are found in TfCel6A, and the aspartate and the glutamate are also observed in a bacterial EG, TfCel6B. To observe the role of the extended exit loop in TfCel6A, its insertion in HmCel6A was examined. The extension tended to cause activity reduction (Fig. S6), which might relate to product inhibition at higher temperatures, but further investigation, is required.
Characteristics of substrate recognition
As described above, the substrate recognition scheme of HmCel6A is almost the same as other GH6 CBHs. We analysed the degree of polymerisation for substrate against the WT and catalytically deficient D140A mutants using the surface plasmon response (Fig. 5a), in order to reveal more detail of the HmCel6A substrate recognition scheme. The results of the WT enzyme might underestimate the dissociation constant rate (koff) given the existence of the product binding. Cellobiose (Glc2) and Glc3 were not hydrolysed or were hydrolysed quite slowly by the enzyme. PSA and cellotetraose (Glc4) or longer substrates can thus be hydrolysed, and product binding can occur similar to Glc2 and Glc3, which may act as the product.


a SPR kon–koff plot for various degrees of polymerisation in substrates. Cyan and orange ellipsoids correspond to being with and without catalytically deficient mutations (D140A), respectively. Each error bar represents the standard error. b Enzyme activity against temperature. Relative activity against the wild-type enzyme and enzyme unit are drawn in bar and line, respectively. Each of the two tryptophans affect turnover at lower temperatures and enzyme stability at higher temperatures.
Glc2 is not hydrolysed even in the catalytic enzymes, therefore its affinities directly relate to product inhibition. Its affinities for WT and D140 were almost identical. This is reasonable, since almost all the substrates stayed off of the subsite −1 in the reported crystal structures of catalytic GH6 CBH, where Glc2 was mostly between the +1 and +2 subsites41. Of course, Asp140 contributes to substrate recognition at the +1 binding subsite, therefore the affinity at subsite +1 might be lowered in the mutant. On the other hand, subsites −1 or +1 generally have a lower affinity, because the saccharide ring distorts to a conformation that is energetically unfavourable for the activation of the scissile bond. The other sites, especially the −2 or +2 subsites, adapt short chains and harness the sugar chain to be fixated for activity35. Glc3, to which HmCel6A exhibits almost ignorable catalytic activity, showed a slightly higher affinity due to the increased interactions of one added sugar with the enzyme. In fact, the structure of the Glc3-complex in the WT showed that Glc3 was found at subsites +1 to +3, as described above. This binding mechanism at the exit site might not be directly related to the product inhibition by Glc2. If this ligand came from not the entry side of the substrate tunnel but its exit side, the binding of Glc2 at subsites +2 and +1 requires that Glc2 occupy subsites +3 and +2 first and then not only slide but also rotate.
Total affinity increased with length in the longer and hydrolysable oligomers from Glc4 to Glc6, as clearly observed in the results from the inactivated D140A mutant. As described above, the −2 and +2 subsites may predominantly play a role in the affinity of the enzyme, since tetra-saccharide has a stronger affinity than the shorter oligomers. Of course, even if the D140A mutation inactivates an enzyme, it might affect to binding affinity at the −1 or +1 subsites. In fact, the sugar was skewed at the −1 subsite of our Glc6 complex crystal structure. An important residue to maintain this skewing is Tyr169 in TrCel6A and Tyr85 in HmCel6A, because its phenylalanine replacement introduces space around the catalytic site and reduces the constriction of the sugar40. Similarly, the Asp140 short side chain that flanks the tyrosine might somewhat increase the affinity of the −1 and +1 subsites using the same mechanism. The WT enzyme showed mostly similar binding affinities for all these oligomers, which gradually degrade during measurement. The koff values produced shorter chains, while the kon values were maintained, despite being impacted by the shorter chains. The PSA koff value was slightly higher than that of Glc6 in the D140A mutant. It is unclear if the difference was caused by the presence of additional subsites. We concluded, that the differences in the kon values depend on the crystallinity of these substrates.
Effects of tryptophan on catalysis
Tryptophan is a well-observed residue that supports saccharide binding in GHs. Its effect has been confirmed in both CBH I and II48. In HmCel6A, there are five tryptophans around the active-site tunnel (Trp47, Trp189, Trp192, Trp255 and Trp330). Among them, we focused Trp192 and Trp255 located on the entry side of the tunnel to confirm the further catalysis details (Fig. S5b). Its reason is that the cellulose chain binding at entry side initiates enzymatic processing and then both residues might determine the ability to capture the substrate. Trp192 is conserved well in CBH II and forms the subsite +449,50, while Trp255 is observed in bacterial CBHs and seems to bait substrates, such as an additional subsite +623. We constructed catalytic and catalytically deficient D140A mutants for both residues, then further analysed them using an enzyme assay and the surface plasmon response (Fig. 5).
Both mutations totally reduce affinity, which indicates that the tryptophans contributed to substrate recognition. In fact, the affinity for non-hydrolysable Glc2 was reduced in both mutants of the catalytic enzyme (Fig. 5a), which might simply describe its contribution to the affinity. A similar SPR experiment for a GH18 chitinase showed that tryptophan mutations consisting of +2 or −3 subsites lost 8.2 or 5.7 kJ/mol, as calculated from Kd for (GlcNAc)451. Our inactivated mutants did not show any significant differences. The reason for this is not clear, but may be due to compensation for the affinity gained by the D140A mutation at the +1 and −1 subsites, as described above.
The kon values decreased for Glc4 to 6 in the Trp192 mutations of the catalytic enzyme (Fig. 5a). In fact, both of the crystal structures complexed with cello-oligomers and showed highly-occupied +2 and +3 subsites, thus, Trp192 might contribute to the binding of Glc4 to 6. This is quite similar to how the relevant mutation in Trp332 of TfCel6B decreased Kd in 20–100-folds that were analysed by fluorescence titration50. For polymeric substrates, the mutation increased activity in PSA at 30–50 °C, as shown in Fig. 5b, since PSA is a soluble polymer with rather similar characteristics to soluble cello-oligomers. A similar phenomenon was reported in the relevant Trp272 mutation of TrCel6A, which caused an increase turnover (kcat) in cello-oligomers. This is explained by the removal of some non-productive binding mechanisms, which prolonged the retaining period of the substrate49. In Avicel, the HmCel6A mutant decreased in activity, which could be explained by the reduced absorption of the substrate at higher temperatures (Fig. S7). Similarly, in TfCel6A, the relevant mutation impairs the enzyme’s function against bacterial microcrystalline cellulose (BMCC)50. These results suggest that Trp192, which is located near the entrance of the active-site tunnel, may assist in the hydrolysis of crystalline cellulose by helping a substrate chain enter the active site.
Trp255 is located farther from the active centre than Trp192 with a limited energy gain for soluble oligosaccharides. In fact, a low but significant effect on affinity was observed (Fig. 5a). Nevertheless, the affinity and activity for the polymeric substrate reduced by the same amount as the Trp192 mutation. Furthermore, the double mutation of Trp192 and Trp255 caused an additional decrease. It has been inferred that the relevant residue, Trp394, could form a +6 subsite in TfCel6A23. In addition, the Trp394 residue has a stronger affinity for longer substrates23. It has also been argued that its previous residue, Asp393, corresponded to Glu254 in our structure. The relevant residues to the pair of acidic amino acid and tryptophan are also observed in Cel6C, which is from the basidiomycete, Coprinopsis cinereal52. In the binding model of TfCel6B to crystalline cellulose, Trp332/Trp192 supports a cellulose fibre pulled from the crystalline structure in order to introduce it into the active-site tunnel47, while Trp394/Trp255 may maintain its interaction with the surface of the crystalline. These residues may have similar roles in CBH I. Trp40 in TrCel7A, a CBH I in GH7, forms the subsite −7 at the entrance edge of the active-site tunnel, and is thought to initiate the degradation of crystalline cellulose48, even though +6 subsite of GH6 CBH II is exposed to the solvent differently.
In addition, we observed that both tryptophans contributed to the thermostability of the enzyme. The alanine mutants for W192, W255 and both showed a Tm of 75.2, 74.0 and 71.0 °C, respectively, which are 9.5 °C lower than the WT. Further investigation of these residues in the high-temperature enzymatic saccharification process is warranted, since high temperatures cause changes in the protein energy landscape.
Conclusions
The newly identified cellobiohydrolase, HmCel6A, can be expressed in the heterologous host E. coli. A variant of HmCel6A displayed its highest optimum temperature at 95 °C. This enzyme has unique structural features, such as metal binding, disulphide bonds and shortened loops around the substrate tunnel, in which the bottom loop has a novel sequence. An additional tryptophan, Trp255, is located at the enzyme’s tunnel entrance, and might contribute to catalysis and thermostability. With these features, this enzyme may contribute to the establishment of an efficient, high-temperature saccharification process for cellulose, which may allow for large-scale, industrial use. Indirectly, these features can help to improve CBH II via protein engineering techniques.

