Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR–Cas9. Science 346, 1258096 (2014).
Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR–Cas9 for genome engineering. Cell 157, 1262–1278 (2014).
Google Scholar
Chalfie, M., Tu, Y., Euskirchen, G., Ward, W. W. & Prasher, D. C. Green fluorescent protein as a marker for gene expression. Science 263, 802–805 (1994).
Google Scholar
Leader, B., Baca, Q. J. & Golan, D. E. Protein therapeutics: a summary and pharmacological classification. Nat. Rev. Drug Discov. 7, 21–39 (2008).
Google Scholar
Pollegioni, L., Schonbrunn, E. & Siehl, D. Molecular basis of glyphosate resistance–different approaches through protein engineering. FEBS J. 278, 2753–2766 (2011).
Google Scholar
Joo, H., Lin, Z. & Arnold, F. H. Laboratory evolution of peroxide-mediated cytochrome P450 hydroxylation. Nature 399, 670–673 (1999).
Google Scholar
Heim, R. & Tsien, R. Y. Engineering green fluorescent protein for improved brightness, longer wavelengths and fluorescence resonance energy transfer. Curr. Biol. 6, 178–182 (1996).
Google Scholar
Binz, H. K., Amstutz, P. & Plückthun, A. Engineering novel binding proteins from nonimmunoglobulin domains. Nat. Biotech. 23, 1257–1268 (2005).
Google Scholar
Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).
Google Scholar
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
Google Scholar
Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).
Google Scholar
Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Google Scholar
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
Google Scholar
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Google Scholar
Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045 (2021).
Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotech. 39, 691–696 (2021).
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. In Proc. International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 773–782 (PMLR, 2019).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Google Scholar
Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Google Scholar
Dehouck, Y., Kwasigroch, J. M., Gilis, D. & Rooman, M. Popmusic 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinform. 12, 151 (2011).
Capriotti, E., Fariselli, P. & Casadio, R. I-mutant2. 0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 33, W306–W310 (2005).
Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotech. 35, 128–135 (2017).
Google Scholar
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Google Scholar
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Google Scholar
Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation 34, 57–65 (2013).
Google Scholar
Mann, J. K. et al. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).
Google Scholar
Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).
Google Scholar
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Mol. Biol. E 33, 268–280 (2016).
Google Scholar
Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc. Natl Acad. Sci. USA 109, 16858–16863 (2012).
Google Scholar
Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).
Google Scholar
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Google Scholar
Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. RNA 19, 1537–1551 (2013).
Google Scholar
Wu, N. C., Dai, L., Olson, C. A., Lloyd-Smith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).
Google Scholar
Otwinowski, J., McCandlish, D. M. & Plotkin, J. B. Inferring the shape of global epistasis. Proc. Natl Acad. Sci. USA 115, E7550–E7558 (2018).
Google Scholar
Shanehsazzadeh, A., Belanger, D. & Dohan, D. Is transfer learning necessary for protein landscape prediction? Preprint at https://arxiv.org/abs/2011.03443 (2020).
Rao, R. et al. Evaluating protein transfer learning with TAPE. In Proc. Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 9689–9701 (Curran Associates, Inc., 2019).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://doi.org/10.1101/2021.07.18.452833 (2021).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
Google Scholar
Shamsi, Z., Chan, M. & Shukla, D. TLmutation: predicting the effects of mutations using transfer learning. J. Phys. Chem. B. 124, 3845–3854 (2020).
Google Scholar
Barrat-Charlaix, P., Figliuzzi, M. & Weigt, M. Improving landscape inference by integrating heterogeneous data in the inverse ising problem. Sci. Rep. 6, 37812 (2016).
Google Scholar
Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: long papers (eds Gurevych, I. & Miyao, Y.) 328–339 (Association for Computational Linguistics, 2018).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1: long and short papers, 4171–4186 (2019).
Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Google Scholar
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at bioRxiv https://doi.org/10.1101/2020.07.12.199554 (2020).
Aghazadeh, A. et al. Epistatic net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021).
Google Scholar
Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).
Google Scholar
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29 (2011).
Google Scholar
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Tran. Inf. Syst. 20, 422–446 (2002).
Gelman, S. et al. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
Google Scholar
Gray, V. E., Hause, R. J., Luebeck, J., Shendure, J. & Fowler, D. M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Systems 6, 116–124 (2018).
Google Scholar
Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) Vol. 32 (NeurIPS, 2019).
Hardt, M. & Recht, B.Patterns, predictions, and actions: A story about machine learning. Preprint at https://arxiv.org/abs/2102.05242 (2021).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Fannjiang, C. & Listgarten, J. Autofocused oracles for model-based design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2020) Vol. 33 (NeurIPS, 2020).
Sugiyama, M., Krauledat, M. & Müller, K.-R. Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8, 985–1005 (2007).
Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).
Google Scholar
Kawashima, S. et al. Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–5 (2007).
Google Scholar
Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).
Google Scholar
Besag, J. Statistical analysis of non-lattice data. J. Royal Stat. Soc.: Ser. D. Statistician 24, 179–195 (1975).
Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015).
Google Scholar
Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. International Conference on Machine Learning (eds Hal, D., III & Aarti, S.) 950–959 (PMLR, 2020).

