Datasets
In this work, a popular benchmark GB1 library was used to test CLADE. A PhoQ library that was used in an early MLDE study42 was also considered. For both datasets, their fitness values were normalized into the range [0, 1] when applied to CLADE.
The GB1 dataset4 is an empirical fitness landscape for protein G domain B1 (GB1; PDB 2GI9) binding to an antibody. Fitness was defined as the enrichment of folded protein bound to the antibody IgG-Fc. This dataset contains 149,361 experimentally labeled variants out of 204 = 160,000 at four amino-acid sites (V39, D40, G41 and V54). The fitness of the remaining 10,639 unlabeled variants is imputed, but their values are not considered in this study. By normalizing the fitness to its global maximum, 92% of variants have fitness lower than 0.01 and 99.3% variants have fitness lower than 0.3 (Supplementary Fig. 3a).
For the PhoQ dataset52, a high-throughput assay for the signaling of a two-component regulatory system—PhoQ–PhoP sensor kinase and a response regulator—was developed with a yellow fluorescent protein (YFP) reporter expressed from a PhoP-dependent promoter. Extracellular magnesium concentration stimulates the phosphatase or kinase activity of PhoQ, which can be reported by YFP levels. The combinatorial library was constructed at four sites (A284, V285, S288 and T289) located at the protein–protein interface between the sensor domain and kinase domain of PhoQ. Two libraries were constructed by using different extracellular magnesium treatments. In each library, the variants with comparable YFP levels to wild type were selected by fluorescence-activated cell sorting (FACS) and used for enrichment ratio calculations. Comparable YFP levels were strictly defined by two thresholds. The PhoQ dataset was previously studied using an MLDE model42. In this work, we took the enrichment ratios from the library with high extracellular magnesium treatment as fitness. The fitness value correlates to the probability that a variant has fluorescence in the given range, with this range defined as the wild-type-like activity in the original PhoQ work52 (Supplementary Fig. 10). The fitness landscape has nearly complete coverage, with 140,517 quality read variants out of 204 = 160,000. Like GB1, the PhoQ dataset was found to be overwhelmed with low- or zero-fitness variants, with 92% of variants having fitness lower than 0.01 and 99.96% of variants having fitness lower than 0.3, and the high-fitness variants are rarer than in the GB1 dataset (Supplementary Fig. 3a).
For the MLDE algorithms alone, both GB1 and PhoQ datasets, using enrichment ratios as fitness, provide suitable labels to learn and optimize. In applications, the optimization of fitness usually intends to improve a meaningful protein property. The fitness in GB1 directly correlates to a specific protein activity, that is, the binding affinity between GB1 and its antibody lgG-Fc, serving as an excellent benchmark. However, fitness in the PhoQ dataset may only weakly correlate to protein activities, such as PhoQ–PhoP interaction strength and YFP fluorescence level. As such, the results from MLDE for the PhoQ dataset cannot be directly interpreted as a meaningful protein property.
Sequence encoding
In this work, two types of physicochemical sequence encoding method—AA and Georgiev—were used to test CLADE. The encoding matrix of the combinatorial library was standardized via StandardScalar() in scikit-learn53 before further usage. The same encoding matrix was used for both unsupervised clustering and supervised learning models (Supplementary Section 1). First, the AA encoding consists of four physicochemical descriptors: molecular mass, hydropathy, surface area and volume (Supplementary Table 2). Molecular mass, hydropathy and surface area were obtained from the AAindex database43 and volume from experimental work54. This encoding was previously used in protein stability change predictions12. Instead of picking a subset of the AAindex database, the Georgiev encoding44,45 comprehensively integrated over 500 amino-acid indices in the AAindex database and gave a low-dimensional representation of these indices with 19 dimensions.
Gaussian process
The GP regression model55 was used to infer the value of an unknown function f(x) at a novel point x, given a set of observations X with labels Y. The posterior distribution of f(x) given by GP can be predicted with mean μ(x) and standard deviation σ(x). The GP regression was implemented by scikit-learn package53. The default radial basis function (RBF) kernel and other default parameters were used.
The next round of sequence selection was prioritized by the values of acquisition functions α(x), where the sequence with the largest acquisition in the unlabeled set X0 will be screened first:
$${x}_{* }={arg max }_{xin {X}_{0}}{alpha (x)}.$$
(2)
Specifically, in this work, we selected a batch of unlabeled sequences with top values in acquisition functions for the next batch of screening.
The designs of the acquisition function depend on the posterior mean and variance. The simple greedy acquisition is defined by the posterior mean, which can maximize and exploit the expected fitness at each round:
$${alpha }_{rm{g}}(x)={mu (x)}.$$
(3)
On the other hand, with the acquisition identical to the posterior variance we can explore the uncertain regions to increase the knowledge and accuracy of the regression model. To balance the exploration–exploitation dilemma for these two extreme cases, ϵ-greedy acquisition takes the combination of them38:
$${alpha }_{epsilon }{(x)}=left{begin{array}{rlr}&{mu (x)},,,{{rm{with}},{rm{probability}}},{{1}-{epsilon}},&\ &{sigma}{(x)},,,{{rm{with}},{rm{probability}}},{epsilon}.end{array}right.$$
(4)
where ϵ is a hyperparameter to mediate this trade-off. In this work, we took ϵ as a constant and explored its values, while an alternate design would let ϵ decreases sequentially to enhance exploitation.
Another popular UCB acquisition can both exploit samples with large mean and explore samples with large variance, which has substantial theoretical background39. This takes the form
$${alpha }_{{{rm{UCB}}}}{mu (x)}+{sqrt{beta }{sigma}{(x)}}.$$
(5)
The trade-off parameter β decides the size of the confidence interval to be considered. For example, the acquisition function considers a 95% confidence interval when β = 4.
Thompson sampling exploits the label through random sampling according to the posterior mean and variance. The acquisition function is sampled from a normal distribution:
$${alpha }_{rm{T}}{(x)} sim {{{mathcal{N}}}}{left({mu (x)},{sigma}{{(x)}^{2}}right)}.$$
(6)
Zero-shot predictions
The calculations of zero-shot predictions were followed by the ftMLDE package2. In this work, we tested two zero-shot predictors using EVmutation40 and MSA transformer using a mask-filling protocol22.
Before calculations of these zero-shot predictors, the EVcouplings webapp56 generates MSAs and trains an EVmutation model for the target protein. The sequence of the target protein is the only input required. The alignments were searched against the UniRef100 dataset. Except bitscore, all other parameters were used as their default values (search iterations = 5, position filter = 70%, sequence fragment filter = 50%, removing similar sequences = 90%, downweighting similar sequences = 80%). The entire 56-residue sequence of GB1 (PDB 2GI9) was used for alignments:
MQYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE
Bitscore was taken as 0.4 according to ref. 2, resulting in 56 redundancy-reduced sequences. The sequence of PhoQ (UniProtKB P23837) has 486 residues:
MKKLLRLFFPLSLRVRFLLATAAVVLVLSLAYGMVALIGYSVSFDKTTFRLLRGESNLFY TLAKWENNKLHVELPENIDKQSPTMTLIYDENGQLLWAQRDVPWLMKMIQPDWLKSNGFH EIEADVNDTSLLLSGDHSIQQQLQEVREDDDDAEMTHSVAVNVYPATSRMPKLTIVVVDT IPVELKSSYMVWSWFIYVLSANLLLVIPLLWVAAWWSLRPIEALAKEVRELEEHNRELLN PATTRELTSLVRNLNRLLKSERERYDKYRTTLTDLTHSLKTPLAVLQSTLRSLRSEKMSV SDAEPVMLEQISRISQQIGYYLHRASMRGGTLLSRELHPVAPLLDNLTSALNKVYQRKGV NISLDISPEISFVGEQNDFVEVMGNVLDNACKYCLEFVEISARQTDEHLYIVVEDDGPGI PLSKREVIFDRGQRVDTLRPGQGVGLAVAREITEQYEGKIVAGESMLGGARMEVIFGRQH SAPKDE
The four mutational sites (A284, V285, S288 and T289) are located at the interface between the sensor domain and kinase domain. In EVcouplings, we took 189 residues in the protein–protein interface (positions 188~376; bold fragment) to search for a more relevant homolog that covers the mutational sites. The authors of EVcouplings suggest generating ≥10L redundancy-reduced sequences40,56. By tuning bitscore, we took it to be 0.5, resulting in 2,998 redundancy-reduced sequences.
The zero-shot predictions from EVmutation were calculated for the combinatorial libraries using the model downloaded from the EVcouplings webapp. When applying MSA transformer, MSAs may need to be subsampled to make the model memory efficient. We used the hhfilter function in the HHsuite package57 to subsample the alignments by maximizing the diversity, as suggested by the original MSA transformer publication22. For the MSAs of GB1 there were only 56 sequences, and subsampling was omitted. For the MSAs of PhoQ, the –diff parameter in hhfilter was taken as 100, which generates 128 sequences. The zero-shot predictions using MSA transformer were calculated by the mask-filling protocols using naive probability2.
Unsupervised clustering and clustering sampling
In this work, two unsupervised clustering algorithms, K-means23 and Louvain25, were tested on CLADE. K-means clustering was computed using the scikit-learn package with default kmeans++ initialization53. Louvain clustering was computed on a shared nearest-neighbor graph implemented by the Seurat package58 (Supplementary Section 6).
In clustering sampling, a cluster is selected according to the cluster-wise sampling probabilities. The cluster-wise sampling probabilities depend on the average fitness of selected variants in each cluster. The cluster with higher average fitness has a higher probability to be selected. In the kth cluster at the ith hierarchy, the sampling probability is given by
$${P}_{k}^{(i)}=frac{frac{1}{#{C}_{k}^{(i)}}mathop{sum}limits_{j,{{rm{in}}},{C}_{k}^{(i)}}{y}_{j}}{mathop{sum}limits_{l}frac{1}{#{C}_{l}^{(i)}}mathop{sum}limits_{j,{{rm{in}}},{C}_{l}^{(i)}}{y}_{j}},$$
(7)
where ({C}_{l}^{(i)}subset {I}) is the index set of the lth cluster at the ith hierarchy and I is the index set of the combinatorial library that gives each variant a unique index. Here, yj is the fitness of the jth variant. Once the cluster is selected, in-cluster sampling is used to select a variant within this cluster. In one approach, the random sampling uniformly picks a variant. Another approach is GP-based model sampling. The GP model is trained on all labeled sequences. The difference for the in-cluster sampling with conventional GP is that we only pick variants within the selected cluster to maximize the acquisition function instead of searching globally.
In deep hierarchical clustering, only K-means is applied because it is easy to control the number of clusters with a single hyperparameter K. For maximum hierarchy N, the increment of clusters at the ith (i ≤ N) hierarchy is given by Ki. The total number of clusters at the maximum hierarchy is the sum of these numbers ({mathop{sum }limits_{i=1}^{N}{{K}_{i}}}). At a new hierarchy, clusters with higher average fitness are divided into more subclusters, and clusters with low average fitness are divided into fewer subclusters or not divided. The kth parent cluster at the (i − 1)th hierarchy will be divided into ({L}_{k}^{(i)}) subclusters at the ith hierarchy, and ({L}_{k}^{(i)}) is given by
$${{L}_{k}^{(i)}}=left{begin{array}{ll}{[{{P}_{k}^{(i)}}{{K}_{i}}]}+{1},&,{{rm{if}}},{k}ne {{k}_{0}}\{{K}_{i}}-{mathop{sum}limits_{jne {k}_{0}}}{[{{P}_{j}^{(i)}}{{K}_{i}}]}+{1},&,{{rm{if}}},{k}={{k}_{0}}end{array}right.$$
(8)
where ({k}_{0}={arg max }_{k}frac{1}{#{C}_{k}^{(i)}}mathop{sum}limits_{j,{{mbox{in}}},{C}_{k}^{(i)}}{y}_{j}) is the index of the cluster having the largest average fitness from selected variants over all clusters. [x] represents the largest integer not greater than x.
Here we summarize the workflow of clustering sampling together with the required hyperparameters. The structure of clusters needs to be determined before the sampling process, with N + 1 hyperparameters, including maximum hierarchy N and the increment of clusters at each hierarchy Ki. The batch size, NUMbatch, is taken to be the number of variants being screened in parallel in the experiment. The batch size decides the frequency for updating the sampling probability and clusters at the new hierarchy, and a lower batch size usually leads to more accurate CLADE prediction but higher cost in experiments. During sampling, the first-round selection chooses NUM1st variants, which are equally picked over clusters to have a rough coverage of all clusters. After the first-round selection, the cluster-wise sampling probability is updated for every batch according to equation (7), and a new hierarchy is generated after every set of NUMhierarchy variants is screened until reaching the maximum hierarchy N. The sampling method to pick variants from the selected clusters can be either random sampling or GP-based sampling. The sampling process generates NUMtrain labeled variants to train the downstream supervised learning model. The top M variants predicted by CLADE are experimentally screened. These numbers—NUM1st, NUMhierarchy, NUMtrain and M—are all required to be multiples of batch size NUMbatch. Two batch sizes, 96 and 1, were used in this work. Batch size 96 was followed according to the small 96-well plate commonly seen in many experimental systems3,33 and is referred to as a medium-throughput system in this work. Batch size 1 was used to simulate systems with extremely low throughput in which variants need to be screened one by one. The hyperparameters for medium- and low-throughput systems are provided in Supplementary Table 3. In application, NUMbatch can be picked according to the experimental protocol and T can be picked according to the screening capacity. The other three numbers can be selected according to our experiment and scaled to suitable values.
For clustering sampling using zero-shot predictions, we only sample within a subspace of the combinatorial library given by the top-ranking variants from the zero-shot predictions. The other steps are identical to the case without using zero-shot predictions.
Ensemble supervised learning
The MLDE package2 was used for the supervised learning model in this work. An ensemble of 17 regression models optimized by Bayesian hyperparameter optimizations was used. Fivefold cross-validation was performed on training data and used to evaluate the performance of each model measured by mean square errors. Bayesian hyperparameter optimizations were performed to find the best-performing hyperparameters for each model. After hyperparameter optimizations, the top three models were picked and averaged to predict the fitness of unlabeled variants. Details are provided in Supplementary Section 2 and Supplementary Tables 4 and 5.
Evaluating metrics
Various metrics were used to evaluate the training data diversity and CLADE outcome. Mean fitness and max fitness were calculated in three sets: training data, the top M predicted variants and their union. In selecting the top M predicted variants, only variants that could be constructed by the recombination of variants in the training data were considered. This enhances the confidence of predictions by reducing extrapolations, especially when a less diverse training set is used. ‘Global maximal fitness hit rate’ calculates the frequency with which the global maximal variant is successfully picked in multiple independent repeats. ‘Normalized discounted cumulative gain (NDCG)’ is a measure of ranking quality to evaluate the predictive performance of CLADE on all unlabeled data. Its value is between 0 and 1. When this is close to 1, it indicates that variants ranked by the predicted fitness are similar to that ranked by the ground-truth fitness. Mean square error and Pearson correlation are used to evaluate the performance of the supervised learning for both cross-validation and testing. ‘Modified functional attribute diversity’ (MFAD) is a quantity used to measure data diversity59. In this Article we use it to measure the fitness and sequence diversity for training data. If T is the training data size, MFAD is given by
$${{rm{MFAD}}},={frac{{mathop{sum }limits_{i=1}^{T}}{mathop{sum }limits_{j=1}^{T}}{{d}_{ij}}}{T}},$$
(9)
where dij represents the dissimilarity between the ith and jth samples. For fitness diversity, the dissimilarity is calculated by the difference of fitness between two samples:
$${{d}_{ij}^{rm{fitness}}}={left|{{y}_{i}}-{{y}_{j}}right|}.$$
(10)
For sequence diversity, the dissimilarity is calculated by the Euclidean distance between two samples of the physicochemical encoding:
$${{d}_{ij}^{rm{sequence}}}={left||{{x}_{i}}-{{x}_{j}}right||}_{2}$$
(11)
where xi is the physicochemical encoding feature vector of the ith variant and ∣∣⋅∣∣ is the Euclidean distance.

