Preloader

Learning protein fitness models from evolutionary and assay-labeled data

  • 1.

    Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR–Cas9. Science 346, 1258096 (2014).

  • 2.

    Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR–Cas9 for genome engineering. Cell 157, 1262–1278 (2014).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 3.

    Chalfie, M., Tu, Y., Euskirchen, G., Ward, W. W. & Prasher, D. C. Green fluorescent protein as a marker for gene expression. Science 263, 802–805 (1994).

    CAS 
    PubMed 

    Google Scholar 

  • 4.

    Leader, B., Baca, Q. J. & Golan, D. E. Protein therapeutics: a summary and pharmacological classification. Nat. Rev. Drug Discov. 7, 21–39 (2008).

    CAS 
    PubMed 

    Google Scholar 

  • 5.

    Pollegioni, L., Schonbrunn, E. & Siehl, D. Molecular basis of glyphosate resistance–different approaches through protein engineering. FEBS J. 278, 2753–2766 (2011).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 6.

    Joo, H., Lin, Z. & Arnold, F. H. Laboratory evolution of peroxide-mediated cytochrome P450 hydroxylation. Nature 399, 670–673 (1999).

    CAS 
    PubMed 

    Google Scholar 

  • 7.

    Heim, R. & Tsien, R. Y. Engineering green fluorescent protein for improved brightness, longer wavelengths and fluorescence resonance energy transfer. Curr. Biol. 6, 178–182 (1996).

    CAS 
    PubMed 

    Google Scholar 

  • 8.

    Binz, H. K., Amstutz, P. & Plückthun, A. Engineering novel binding proteins from nonimmunoglobulin domains. Nat. Biotech. 23, 1257–1268 (2005).

    CAS 

    Google Scholar 

  • 9.

    Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).

    CAS 

    Google Scholar 

  • 10.

    Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 11.

    Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 12.

    Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 13.

    Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).

    CAS 
    PubMed 

    Google Scholar 

  • 14.

    Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).

    CAS 
    PubMed 

    Google Scholar 

  • 15.

    Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045 (2021).

  • 16.

    Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotech. 39, 691–696 (2021).

  • 17.

    Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. In Proc. International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 773–782 (PMLR, 2019).

  • 18.

    Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    CAS 
    PubMed 

    Google Scholar 

  • 19.

    Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).

  • 20.

    Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 21.

    Dehouck, Y., Kwasigroch, J. M., Gilis, D. & Rooman, M. Popmusic 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinform. 12, 151 (2011).

    Google Scholar 

  • 22.

    Capriotti, E., Fariselli, P. & Casadio, R. I-mutant2. 0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 33, W306–W310 (2005).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 23.

    Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotech. 35, 128–135 (2017).

    CAS 

    Google Scholar 

  • 24.

    Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 25.

    Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 26.

    Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 27.

    Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation 34, 57–65 (2013).

    CAS 
    PubMed 

    Google Scholar 

  • 28.

    Mann, J. K. et al. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).

    PubMed 
    PubMed Central 

    Google Scholar 

  • 29.

    Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 30.

    Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Mol. Biol. E 33, 268–280 (2016).

    CAS 

    Google Scholar 

  • 31.

    Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc. Natl Acad. Sci. USA 109, 16858–16863 (2012).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 32.

    Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 33.

    Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 34.

    Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. RNA 19, 1537–1551 (2013).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 35.

    Wu, N. C., Dai, L., Olson, C. A., Lloyd-Smith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).

    PubMed 
    PubMed Central 

    Google Scholar 

  • 36.

    Otwinowski, J., McCandlish, D. M. & Plotkin, J. B. Inferring the shape of global epistasis. Proc. Natl Acad. Sci. USA 115, E7550–E7558 (2018).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 37.

    Shanehsazzadeh, A., Belanger, D. & Dohan, D. Is transfer learning necessary for protein landscape prediction? Preprint at https://arxiv.org/abs/2011.03443 (2020).

  • 38.

    Rao, R. et al. Evaluating protein transfer learning with TAPE. In Proc. Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 9689–9701 (Curran Associates, Inc., 2019).

  • 39.

    Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 40.

    Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

  • 41.

    Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://doi.org/10.1101/2021.07.18.452833 (2021).

  • 42.

    Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    CAS 
    PubMed 

    Google Scholar 

  • 43.

    Shamsi, Z., Chan, M. & Shukla, D. TLmutation: predicting the effects of mutations using transfer learning. J. Phys. Chem. B. 124, 3845–3854 (2020).

    CAS 
    PubMed 

    Google Scholar 

  • 44.

    Barrat-Charlaix, P., Figliuzzi, M. & Weigt, M. Improving landscape inference by integrating heterogeneous data in the inverse ising problem. Sci. Rep. 6, 37812 (2016).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 45.

    Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: long papers (eds Gurevych, I. & Miyao, Y.) 328–339 (Association for Computational Linguistics, 2018).

  • 46.

    Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1: long and short papers, 4171–4186 (2019).

  • 47.

    Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

    CAS 
    PubMed 

    Google Scholar 

  • 48.

    Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at bioRxiv https://doi.org/10.1101/2020.07.12.199554 (2020).

  • 49.

    Aghazadeh, A. et al. Epistatic net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 50.

    Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 51.

    Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29 (2011).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar 

  • 52.

    Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Tran. Inf. Syst. 20, 422–446 (2002).

    Google Scholar 

  • 53.

    Gelman, S. et al. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).

    PubMed 
    PubMed Central 

    Google Scholar 

  • 54.

    Gray, V. E., Hause, R. J., Luebeck, J., Shendure, J. & Fowler, D. M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Systems 6, 116–124 (2018).

    CAS 
    PubMed 

    Google Scholar 

  • 55.

    Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) Vol. 32 (NeurIPS, 2019).

  • 56.

    Hardt, M. & Recht, B.Patterns, predictions, and actions: A story about machine learning. Preprint at https://arxiv.org/abs/2102.05242 (2021).

  • 57.

    Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

  • 58.

    Fannjiang, C. & Listgarten, J. Autofocused oracles for model-based design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2020) Vol. 33 (NeurIPS, 2020).

  • 59.

    Sugiyama, M., Krauledat, M. & Müller, K.-R. Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8, 985–1005 (2007).

    Google Scholar 

  • 60.

    Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).

    CAS 
    PubMed 

    Google Scholar 

  • 61.

    Kawashima, S. et al. Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–5 (2007).

    PubMed 
    PubMed Central 

    Google Scholar 

  • 62.

    Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).

    CAS 
    PubMed 

    Google Scholar 

  • 63.

    Besag, J. Statistical analysis of non-lattice data. J. Royal Stat. Soc.: Ser. D. Statistician 24, 179–195 (1975).

    Google Scholar 

  • 64.

    Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015).

    PubMed 
    PubMed Central 

    Google Scholar 

  • 65.

    Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. International Conference on Machine Learning (eds Hal, D., III & Aarti, S.) 950–959 (PMLR, 2020).

  • Source link