SemiBin pipeline
SemiBin is developed with Python. The input to SemiBin consists of a set of contigs and mapped short reads (in the form of BAM files) from one or multiple samples. The output consists of a collection of bins with the original contigs clustered.
Preprocessing
Prior to binning, contigs are filtered to remove short contigs, with the default minimal length for binning being 2500 bp. If the contigs in the range 1000–2500 bp represent <5% of the total number of basepairs (this parameter can be changed by the user), then the threshold is reduced to 1000 bp.
After filtering, every contig is represented by k-mer relative frequencies K and abundance distribution A. SemiBin uses k=4 (also used in other tools18,19,20,21,22), so the dimension of K is 136 (k-mers that are the reverse complement of each other are considered equivalent). K-mer vectors are normalized by the lengths of each contig after the addition of a small pseudo-count, namely:
$${k}_{i}^{prime}=frac{{k}_{i}+1{0}^{-5}}{mathop{sum }nolimits_{j = 1}^{136}{k}_{j}+1{0}^{-5}}$$
(1)
The abundance a is defined as the average number of reads per base mapping to the contig. We calculated the abundance with BEDTools (version 2.29.1, genomecov command)47. If the number of samples N used in the binning is greater than or equal to 5, SemiBin scales the abundance features from the N samples to a similar order of magnitude as the k-mer frequencies (which are restricted to the 0–1 range, by construction). In particular, the constant s used to scale was defined as (100times lceil frac{{a}_{mean}}{100}rceil). After the processing, the input to the semi-supervised model is the vector ({{{{{{{bf{Z}}}}}}}}=[{k}_{1}^{prime},{k}_{2}^{prime}…{k}_{136}^{prime};frac{{a}_{1}}{{s}_{1}},frac{{a}_{2}}{{s}_{2}}…frac{{a}_{N}}{{s}_{N}}]). Otherwise, when the number of samples used is smaller than 5, the input to the semi-supervised model is the vector ({{{{{{{bf{Z}}}}}}}}=[{k}_{1}^{prime},{k}_{2}^{prime}…{k}_{136}^{prime}]).
Generating must-link and cannot-link constraints
SemiBin uses taxonomic annotation results to generate cannot-link constraints and breaks up contigs to build must-link constraints. By default, MMseqs245,46 (version 13.45111, default parameters) is used to annotate contigs with GTDB28 reference genomes. We defined contigs pairs with different annotations at species level (with scores both above 0.95) or at genus level (with scores both above 0.80) as containing a cannot-link constraint between them. If the number of cannot-link constraints is too large (default is four million), they are randomly sampled to speed up the training process. SemiBin also employs the same method as Maxbin219 with FragGeneScan (version 1.30)48 and HMMER (version 3.1.b1)49 to predict seed contigs for different species using single-copy marker genes and generates cannot-link constraints between them.
To generate must-link constraints, SemiBin breaks up long contigs into two fragments with equal length artificially and generates must-link constraints between the two shorter contigs. By default, SemiBin uses a heuristic method to automatically determine the minimum size of contigs to break up (alternatively, the user can specify the threshold): SemiBin selects a minimum size that makes contigs as long as or longer than this threshold can contain most (98%) basepairs of the input contigs. If this minimum size is smaller than 4000 bp, the minimum will be set to 4000 bp.
Semi-supervised deep siamese neural network architecture
We used a semi-supervised siamese neural network (see Supplementary Figs. 1 and 2)24 for dealing with the must-link and cannot-link constraints. The network has two siamese neural networks with shared weights.
As it is semi-supervised, this neural network is trained with two loss functions. The supervised loss is a contrastive loss that is used to classify pairs of inputs as must-link or cannot-link (in the contrastive loss function, we termed the must-link constraints as positive labels and cannot-link constraints as negative labels):
$${L}^{Con}= frac{1}{| {{{{{{{{bf{M}}}}}}}}}_{{{{{{{{bf{x}}}}}}}}}cup {{{{{{{{bf{C}}}}}}}}}_{{{{{{{{bf{x}}}}}}}}}| }mathop{sum}limits_{({x}_{1},{x}_{2})in | {{{{{{{{bf{M}}}}}}}}}_{{{{{{{{bf{x}}}}}}}}}cup {{{{{{{{bf{C}}}}}}}}}_{{{{{{{{bf{x}}}}}}}}}| }yd{({x}_{1},{x}_{2})}^{2}\ +,(1-y)max {left{1-d({x}_{1},{x}_{2}),0right}}^{2}$$
(2)
where Mx denotes the must-link pairs and Cx the cannot-link pairs in the training set, d(x1, x2) is the Euclidean distance of the embedding of (x1, x2), and y is an indicator variable for the (x1, x2) pair (with the value 1 if (x1, x2) ∈ Mx, and 0 if (x1, x2) ∈ Cx).
The goal of the supervised embedding is to transform the input space so that contigs have a smaller distance if they are in the same genome, compared to pairs of contigs from different genomes. To ensure that the embedding learns structure is shared by all genomes, we also used an autoencoder50 to reconstruct the original input from the embedding representation with an unsupervised mean square error (MSE) loss function:
$${L}^{MSE}=frac{1}{| {{{{{{{bf{X}}}}}}}}| }mathop{sum}limits_{xin {{{{{{{bf{X}}}}}}}}}{(x-hat{x})}^{2}$$
(3)
where x is the original input, (hat{x}) is the reconstructed input, and X are all contigs in the dataset.
The model used here is a dense deep neural network (see Supplementary Fig. 2). The dimension of the input features, F depends on the number of samples used in the binning (see “Preprocessing” Section). The first two layers of the encoding and the decoding network are followed by batch normalization51, a leaky rectified linear unit52, and a dropout layer53 (the dropout rate used was 0.2). The purpose of the training is to optimize the contrastive loss and the MSE loss at the same time with the Adam54 optimization algorithm. The model was implemented in Pytorch55.
Similarity between contigs
The similarity between two contigs is defined as
$$S({x}_{1},{x}_{2})=1-left{begin{array}{l}min (d({hat{x}}_{1},{hat{x}}_{2}),1)qquad qquad {{{{{{{rm{if}}}}}}}},ge5,{{{{{{{rm{samples}}}}}}}}\ min left(d({hat{x}}_{1},{hat{x}}_{2})cdot a({x}_{1},{x}_{2}),1right)quad {{{{{{{rm{otherwise}}}}}}}};end{array}right.$$
(4)
where (d({hat{x}}_{1},{hat{x}}_{2})) is the Euclidean distance of the semi-supervised embedding. The embedding of the contig is the features (100 dimensions) from the output layer of the encoder of the semi-supervised siamese neural network (see Supplementary Fig. 19). When there are fewer than five samples, the embedding distance only contains k-mer information. In this case, we modeled the number of reads per base of one contig as a normal distribution56, and used the Kullback-Leibler divergence57 to measure the divergence between the normal distributions from contigs, denoted as a(x1, x2) above.
Clustering the contigs
The binning problem is modeled as clustering on a graph. First, SemiBin considers the fully connected graph with contigs as nodes and similarity between contigs as the weight of edges. To convert the community detection task to an easier task, the fully connected graph is converted to a sparse graph. A parameter (max_edges, defaulting to 200) is used to control the sparsity of the graph. For each node, only the max_edges edges with the highest weights are kept. To remove any potentially artefactual edges introduced by the embedding procedure, SemiBin builds another graph with the same procedure using the original features (k-mer frequencies and abundance features). The edges in the graph built from embedding that does not exist in the graph from original features are also removed.
After building the sparse graph, Infomap (python-igraph package58, version 0.9.7), an information-theory-based algorithm to reveal community structure in weighted graphs29, is used to reconstruct bins from the graph. If the user requests it, SemiBin can use single-copy genes of the reconstructed bins to independently re-bin bins whose mean number of single-copy genes is greater than one19. For this, SemiBin uses the weighted k-means algorithm to recluster bins according to the embedded and the abundance features. Finally, SemiBin outputs the final binning results, removing bins smaller than a user-definable threshold (which defaults to 200 kbp).
SemiBin(pretrain)
The default pipeline of SemiBin is to (1) generate must-link/cannot-link constraints, (2) train the siamese neural network model for this sample, (3) bin based on the embeddings. To address the issue that contig annotations and model training requires significant computational time and considering that the trained models can be transferred between samples or even projects, we propose to (1) train a model with constraints from one sample or several samples and (2) apply this model to other samples without further learning. This approach is termed SemiBin(pretrain). To use SemiBin(pretrain) in the tool, users can train a model from their datasets or use one of our 10 built-in pretrained models.
Binning modes
We have evaluated SemiBin in three binning modes: single-sample, co-assembly, and multi-sample binning20. Single-sample binning means binning each sample into inferred genomes after independent assembly. This mode allows for parallel binning of samples, but it does not use information across samples.
Co-assembly binning means that samples are co-assembled first and then binning contigs with abundance information across samples. This mode can generate longer contigs and use co-abundance information, but co-assembly may lead to inter-sample chimeric contigs1 and binning based on co-assembly cannot retain sample-specific variation20.
Multi-sample binning means that the resulting bins are sample-specific (as in single-sample binning), but the information is aggregated across samples (in our case, abundance information). This mode requires more computational resources as it requires mapping reads back to a database consisting of contigs from all the samples.
In single-sample and co-assembly binning, we calculate the k-mer frequencies and abundance for every sample and bin contigs from every sample independently. For multi-sample binning, we first concatenate contigs from every sample into a single database to which short reads are mapped to generate abundance profiles. Unlike what is used in VAMB (which introduced the multi-sample binning concept), this concatenated database is then re-split and contigs from each sample are binned separately.
Data used
For the benchmarking of binners, we used five simulated datasets from the CAMI challenges (CAMI I and CAMI II) and six real metagenomic datasets. Five simulated datasets from CAMI I and CAMI II were downloaded from the CAMI challenge25,26. CAMI I includes three datasets: low complexity, medium complexity, and high complexity datasets. The low complexity dataset has 40 genomes with a single simulated sample. The medium complexity dataset has 132 genomes with two simulated samples with different insert sizes. Here, we used samples with 150 bp insert size. The high complexity dataset has 596 genomes with 5 samples. We also used the skin and oral cavity datasets from the toy Human Microbiome Project data in CAMI II. The Skin dataset contains 10 samples with 610 genomes, while the Oral dataset contains 10 samples with 799 genomes. We used the low complexity dataset to evaluate the single-sample binning mode of our method, the medium and high complexity datasets to evaluate the co-assembly binning mode, and the Skin and Oral datasets to evaluate the multi-sample binning mode. We used fastANI59 (version 1.32, default parameters) to calculate the ANI value between genomes for every sample from the CAMI II datasets.
We also used six real microbiome projects from different environments:
-
1.
a German human gut dataset with 82 samples31 (study accession PRJEB27928),
-
2.
a dog gut dataset with 129 samples32 (study accession PRJEB20308),
-
3.
a marine dataset from the Tara Oceans project with 109 ocean surface samples33 (study accessions PRJEB1787, PRJEB1788, PRJEB4352, and PRJEB4419),
-
4.
a soil dataset with 101 samples34 (for accessions, see Supplementary Data 2),
-
5.
a second German human gut dataset with 92 samples39 (study accession PRJNA290729),
-
6.
a non-Westernized African human gut dataset with 50 samples7 (study accession PRJNA504891).
We used the first four datasets to evaluate single-sample and multi-sample binning mode and the last two human gut projects as hold-out datasets to evaluate transferring with a pretrained model in SemiBin. We also used samples of cat gut (n = 30)60, human oral (n = 30)61, mouse gut (n = 30)62, pig gut (n = 30)63, built environment (n = 30)64 and wastewater (n = 17)65 environments from GMGCv110 to generate pretrained models. There are four other habitats in GMGCv1, but they did not contain enough deeply sequenced samples for training and testing. For the human oral dataset, we removed repeated samples from the same individual to ensure that samples are independent. For every environment, we used 20 samples for training the model (except for the case of wastewater, where seven were used) and 10 samples for testing the results (no overlap between the training and testing samples). We additionally trained a model from all environments (training from all samples used to generate the pretrained model for the 10 environments considered here). We termed this model SemiBin(global).
For simulated datasets, we used the gold standard contigs provided as part of the challenge. For real datasets (except PRJNA504891), short reads were first filtered with NGLess66 (version 1.0.1, default parameters) to remove low-quality reads and host-matching reads (human reads for human gut datasets, dog reads for dog gut datasets). These preprocessed reads were then assembled using Megahit67 (version 1.2.4, default parameters) to assemble reads to contigs. For PRJNA504891, we used Megahit67 (version 1.2.8, default parameters) to assemble reads to contigs. We mapped reads to the contigs with Bowtie268 (version 2.4.1, default parameters) to generate the BAM files used in the binning. For multi-sample binning mode, contigs from all samples were collated together into a single FASTA file. Then reads from every sample were mapped to the concatenated database to obtain a BAM file for each sample.
Methods included in the benchmarking
We compared SemiBin to other methods in three binning modes. For single-sample and co-assembly binning of CAMI I datasets, we compared our method to the following methods: Maxbin2 (version 2.2.6)19, Metabat2 (version 2)18, VAMB (version 3.0.2)20, COCACOLA (git version 707e284a74b9a9257bec6dfe08205939a210ea31)21, SolidBin (version 1.3)22. SolidBin is the only existing semi-supervised binning method and it has different modes. Here, we focused on the comparison to modes that use information from reference genomes: SolidBin-coalign (which generates must-link constraints from reference genomes), SolidBin-CL (which generates cannot-link constraints from reference genomes), and SolidBin-SFS-CL (which generates must-link constraints from feature similarity and reference genomes). We also added SolidBin-naive (without additional information) to show the influence of different semi-supervised modes.
For multi-sample binning of CAMI II datasets, we compared to the existing multi-sample binning tool VAMB, which clusters concatenated contigs based on co-abundance across samples and then splits the clusters according to the original samples and default Metabat2. For more comprehensive benchmarking, we converted Metabat2 to multi-sample mode. We used jgi_summarize_bam_contig_depths (with default parameters) to calculate depth values using the BAM files from every sample mapped to the concatenated contig database. Then, we ran Metabat2 to bin contigs for every sample with abundance information across samples, which is a similar idea to the multi-sample mode in SemiBin. This adaptation, however, led only to modest improvements (see Supplementary Fig. 8a)
We also benchmarked single-sample and multi-sample binning modes in real datasets. For single-sample binning, we compared the performance of SemiBin to Maxbin2, Metabat2, and VAMB; and, for multi-sample binning, we compared to VAMB. These tools have been shown to perform well in real metagenomic projects (see Supplementary Table 3). For VAMB, we set the minimum contig length to 2000 bp. For SolidBin, we ran SolidBin with constraints generated from annotation results with MMseqs245,46 and we used the binning results after postprocessing with CheckM. For SemiBin, we ran the whole pipeline described in the Methods (with default parameters). For other methods, we ran tools with default parameters.
To evaluate the effectiveness of the semi-supervised learning in SemiBin, we also benchmarked modified versions of SemiBin:
-
1.
NoSemi, where we removed the semi-supervised learning step in the SemiBin pipeline, and clustered based on the original inputs.
-
2.
SemiBin_m, where we removed the semi-supervised learning step, but directly used the must-link constraints in the sparse graph (by adding a link between must-link contigs).
-
3.
SemiBin_c was analogous to SemiBin_m, but we used cannot-link constraints by removing links in the sparse graph between cannot-link contigs.
-
4.
SemiBin_mc combines the operations of SemiBin_m and SemiBin_c.
By default, SemiBin is trained on each sample to obtain the embeddings that are used for extracting the final bins. For SemiBin(pretrain), we trained the model from several samples and applied the pretrained model to the whole project (see Fig. 3). We also used two hold-out projects to show the generalization of the pretrained model (see Supplementary Fig. 16). For other benchmarking which used 10 samples as the testing set to evaluate the pretrained model, there was no overlap between the training and testing samples (see Figs. 3a, 5 and Supplementary Fig. 12).
Computational resource usage
For evaluating computational resource usage in a standardized condition, we used two Amazon Web Services (AWS) virtual machines. We used the machine type g4ad.4xlarge with 1 CPU (2nd generation AMD EPYC processors) containing 8 physical cores, 16 logical cores, and 64 GB RAM memory to run Maxbin2, Metabat2, VAMB, and SemiBin in CPU-mode. Additionally, we used the type g4dn.4xlarge to run VAMB and SemiBin in GPU-mode. This machine contains an NVIDIA Tesla T4 GPU.
Evaluation metrics
For simulated datasets in CAMI, we used AMBER (version 2.0.1)69 to calculate the completeness (recall), purity (precision), and F1-score to evaluate the performance of different methods.
In the real datasets, as ground truth is not available, we evaluated the completeness and contamination of the predicted bins with CheckM35 (version 1.1.3, using lineage_wf workflow with default parameters). We defined high-quality bins as those with completeness >90%, contamination <5%27 and also passing the chimeric detection implemented in GUNC36 (version 0.1.1, with default parameters). Medium-quality bins are defined as completeness ≥50% and contamination <10% and low-quality bins are defined as completeness <50% and contamination <10%27.
Model transfer between environments
To evaluate the generalization of the learned models, we selected three models as training sets from the human gut, dog gut, and ocean microbiome datasets. In each dataset, we selected a model from the sample that could generate the highest number, median number, and lowest number of high-quality bins. For the human gut dataset, we termed them human_high, human_median, and human_low, with models from the other environments named analogously. For every environment, we also randomly selected 10 samples from the rest of the samples as testing sets (no overlap between the training sets and testing sets). Then, we transferred these models to the testing sets from the same environment or different environments and used the embeddings from these models to bin the contigs.
To evaluate the effect of training with different numbers of samples, for every environment, we also randomly chose 10 samples as testing sets and trained the model on different numbers of training samples (randomly chosen 1, 3, 5, 10, 15, and 20 samples; no overlap between the training sets and testing sets). For each number of samples, we randomly chose samples and trained the model 5 times.
To evaluate the pretraining approach, we used another two human gut datasets. We termed SemiBin with a pretrained model from the human gut dataset used before as SemiBin(pretrain; external). We also trained a model from 20 randomly chosen samples from the hold-out datasets and applied it to the same dataset; an approach we termed SemiBin(pretrain; internal). We benchmarked SemiBin(pretrain; external), Metabat2, original SemiBin, and SemiBin(pretrain; internal).
MAG analyses
To identify the overlap between the bins from SemiBin and Metabat2 with single-sample binning, we used Mash (version 2.2, with default parameters)38 to calculate the distance between bins from the two methods. Then, we assigned corresponding bins with Mash distance ≤0.01 and we considered these two bins as the same genome in subsequent comparisons. After obtaining the overlap of bins sets, we classified the high-quality bins from each method into 4 classes: HQ-HQ: also high-quality in the other method; HQ-MQ: medium-quality in the other; HQ-LQ: low-quality or worse in the other; and HQ-Miss: could not be found in the other. We calculated the recall, precision, and F1-score for the HQ-HQ component. Recall and precision are completeness and 1−contamination are estimated from CheckM. F1-score is 2 × (recall × precision)/(recall + precision). To evaluate the species diversity of different methods, we annotated high-quality bins with GTDB-Tk (version 1.4.1, using classify_wf workflow with default parameters)37.
For the analysis of B. vulgatus bins, average nucleotide identity (ANI) comparisons were calculated using fastANI59 (version 1.32, -fragLen 1000). B. vulgatus bins were annotated with Prokka (version 1.14.5, with default parameters)70. Pan-genome analyses were carried out using Roary (version 3.13.0, -i 95 -cd 100)71. We used Scoary (version 1.6.16, -c BH)72 to identify genes with significant differences in the human and dog gut microbiome datasets. Phylogeny reconstructions of core genes were performed with IQTREE using 1000 bootstrap pseudoreplicates for model selection (version 1.6.9, -m MFP -bb 1000 -alrt 1000)73, and visualized with ggtree package (version 1.8.154)74. Principal component analysis was done using the prcomp function from the stats75 package.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

