Data collection
The first step of the work consisted in data collection from the literature, which represents the first dataset. It should be stated that currently in the literature no established database is available related to LbL coating thickness, and the available experimental data is relatively scarce compared to the number of LbL related articles. The second dataset was based on the QCM-D experiments done in the laboratory. For the data extracted from the literature, different ways of thickness calculation/estimation were used, such as AFM, ellipsometry, confocal microscopy. Coating thickness determined by these methods may differ, but due to the limited amount of the available data, we selected to include all the data regardless of the thickness measurement method. All the results extracted from the literature, as well as obtained in the laboratory, were entered in the tables describing different parameters (Table S1). Of note, all the multilayers used in the study were prepared by dip-coating or a similar technique (simple polymer solution deposition on the substrate followed by adsorption time and rinsing). Thus, the coating preparation method was not among the parameters influencing thickness.
Influence of different construction parameters on coating thickness
As a first step, distribution of the thickness values in the coatings made of different polymers was studied (Fig. 2). The results show that for polycations, the coatings made of poly(L-lysine) (PLL) have the greatest thickness median values with large interquartile range (IQR), which overlaps the thickness distribution of chitosan (CHI)-made coatings (Fig. 2A). In Fig. 2B, coatings having poly(L-glutamic acid) (PGA) have the greatest thickness median value with IQR overlapping with hyaluronic acid (HA)-containing coatings thickness distribution. The large thickness distribution of the LbL films containing the aforementioned polymers is probably due to the high frequency of their utilization and therefore to the wide variety of molecular weights (MW) that have been used.


Boxplots showing the distribution of the thickness values in the coatings made of different polymers. (A) Thickness depending on polycations. (B) Thickness depending on polyanions. COL: collagen, PEI: polyethylenimine, PLL: poly(L-lysine), PAH: poly(allylamine hydrochloride), CHI: chitosan, glyc-CHI: glycol-chitosan, PDADMA: poly(diallyldimethylammonium chloride), PAR: poly(L-arginine), PTEMC: poly(trimethylammonium ethyl methacrylate chloride), HA: hyaluronic acid, ALG: alginate, PGA: poly(L-glutamic acid), CSA: chondroitin sulfate, FUC: fucoidan, PSS: poly(styrene sulfonate), PCBS: poly[1-[4-(3-carboxy-4-hydroxyphenylazo)benzenesulfonamido]-1,2-ethanediyl, sodium salt, CARlambda: λ-carrageenan, CARkappa: κ-carrageenan, PSSMA: poly(4-styrenesulfonic acid-co-maleic acid), PAA: poly(acrylic acid), DEX: dextran, HEP: heparin.
Next, linear relationships between the coating thickness and different parameters were evaluated using the Pearson correlation, and non-linear relationships were analyzed with the predictive power score (PPS) method (Fig. 3). The PPS allows seeing non-symmetrical relationships between variables: features located on the x-axis are independent variables (influencers), and features on the y-axis are dependent variables (influenced by x-features).


Pearson correlation (A) and Predictive Power Scores (PPS) (B) calculated for the final thickness of the coating and coating features. The first seven Pearson correlations are statistically significant (p ≤ 0.05). Only features having PPS > 0.001 with the target value are included. Full names of polymer features are provided in SI Table S3.
The results show that concentrations of polymers and the number of bilayers in the coating have a strong positive linear relationship with the resulting thickness (Fig. 3A), while polyanion molecular weight and buffer properties have some non-linear relationships with this property (Fig. 3B). The effect of concentration and the number of bilayers are intuitive, however, their relative importance in an overall tendency sense cannot be extracted from a single or few types of LbL films, which is the currently common practice. Similarly, the discrepancy between the relative importance of polyanion (PA) MW and polycation (PC) MW is also not evident.
One of the most important features which have strong linear relationships with the coating thickness is the number of bilayers in the coating (Fig. 3).
As stated above, this dependency is largely obvious, and therefore, the data had to be unified by this number. For this, we decided to calculate the thickness at eight bilayers for each coating in the literature-generated data using the growth function (Eq. 1).
The growth function, describing changes in coating thickness d with the number of layers, N, has three coefficients, a0, a, and b, which vary depending on the coatings16,17. In this function parameter, b defines function curvature: for b ≥ 0.05, the growth is exponential, for b < 0.05, it is nearly linear.
$$d = a_{0} + a*e^{bN}$$
(1)
We extracted data on the dynamics of each coating growth from the original research papers. Then we used these data to calculate coefficients of growth function. Having these coefficients, we defined the thickness of coatings having eight bilayers (N = 16). With this, we created a new target value, the thickness of the 8-bilayer coating, which was not dependent on the number of bilayers.
In this configuration, the type of polyanion, its concentration and its molecular weight were found to positively influence the thickness of the coating, while the charge density of polyanion negatively correlated with the coating thickness (Fig. 4A). Three features (on the x-axis) contributed to the coating thickness (on the y-axis) (Fig. 4B): type of polyanion, polyanion unit molecular weight, and resulting molecular weight. The reason why polyanion characteristics appear more important than those of polycation can be explained by pH values at which polyelectrolyte multilayers were built (pH values are close to the pKa of the acid groups).


Pearson correlation (A) and Predictive Power Scores (PPS) (B) calculated for the thickness of the 8-bilayered coating and coating features. The first three and the last Pearson correlations are statistically significant (p ≤ 0.05). Only features having PPS > 0.001 with the target value are included. Full names of polymer features are provided in SI Table S3.
Coating thickness predictions
After the determination of the most influential parameters, the next step was the build-up of a predictive model. We constructed a Bagging Regressor model18 to make predictions about coating thickness using ten features from the original data set: presence of HA, presence of poly(styrene sulfonate) (PSS), resulting MW of the polycation and of the polyanion, unit MW of the polycation and of the polyanion, concentration of polycation and concentration of polyanion, the concentration of NaCl, and charge density of polyanion. Two quantitative structure–property relationship (QSPR) regression models were constructed using the selected features, Bagging Regressor and support vector regression (SVR) (Table 1).
These models have smaller root-mean-square error (RMSE) values for the validation set than the Bagging Regressor constructed using the original dataset features. This fact indicates their greater potential for generalization. The model performance was evaluated with RMSE values calculated for the training set, test set, and validation set. All the models were fivefold cross-validated. All metrics are mean values for scores from different folds from cross-validation.
However, we encountered a classical Machine Learning challenge: overfitting, when the model makes good predictions for the instances it was built on (training set), but fails to “generalize”, i.e. make good predictions for the unseen items (validation and test sets). Therefore, the large gap between training and test/validation error values is the major sign of overfitting. This is the case for the Bagging Regressor constructed on the original data set features. As we can see, this model generates a large RMSE value for the validation set, which is 4 times larger than the error for the training set. From here, we conclude that polymers as specific chemical entities are not good features by themselves, and generating numerical features that describe the chemical properties of polymers can improve the model performance.
To get features of a molecule, firstly we had to get information about its structure, which is commonly represented in simplified molecular-input line-entry system (SMILES) format, available in the PubChem database. This information is further used to predict molecular features by deep learning pre-trained models available in the DeepChem library.
For each polycation and polyanion, 123 molecular descriptors were generated, therefore each coating in the dataset was characterized by 246 molecular descriptors. Many molecular features have a moderate correlation with the thickness of the coating, so we assume that they can be used to predict this target value (Fig. 5). Mostly, molecular descriptors with high correlations with the coating thickness are related to polyanions, not polycations; this is in line with the observations in the previous section.


(A) Schematic presentation of the model building process. (B) Correlation between RDKit descriptors and thickness of the coating. Only descriptors with r ≥ 0.25 are shown. All coefficients are statistically significant (p ≤ 0.05).
In the next step, we combined features generated by DeepChem and numerical features of the polymers from the original data set. After that, we performed feature elimination with the recursive feature elimination (RFE) algorithm leaving ten features that will be used by the models. These features are MW of the polycation, MW of the polyanion, NaCl concentration, six polyanion RDKit descriptors (MinAbsEStateIndex, FpDensityMorgan3, BalabanJ, PEOE_VSA8, VSA_EState2, VSA_EState6), and one polycation descriptor Kappa1. The most significant molecular descriptors demonstrate the importance of the topological features of the polyanions (such as Balaban distance connectivity index (BalabanJ) and also polycations (Kappa1) in the formation of the supramolecular LbL assemblies in addition to the electrostatic interactions which are described by molecular operation environment (MOE) type electrotopological descriptors. Descriptors related to van der Walls forces and also partial charges underline the highly intricate nature of the LbL film formation at molecular level.
As the last step, in order to further test the generalization capacity of the final model, we have tested polymers which were not in the training set that are described solely by molecular descriptors.
In this configuration, we observed more inaccurate predictions (Table 2, Fig. 6). However, this could be expected, given the size of the training dataset. As we can see, the best results are obtained with the use of the QSPR RFE/Bagging regressor. It can predict the thickness of the coating better than two other models. We can also see that predictions of QSPR models are better than for the model RFE/Bagging constructed with original features, which predicts almost constant thickness values for all coatings (302 ± 25 nm). Predictions of QSPR models are correlating, and there are four coatings for which both regressors failed to predict thickness correctly (relative error > 100%): PAR30/PSS0.2, PAR30/CARiota_2, PAR30/PSS4, PAR30/CARiota_1 (Fig. 6B).


(A) Real (ground truth) and predicted thickness of coatings in the validation data set. (B) Relative errors for predictions made with the best of constructed models (QSPR RFE/Bagging).
Two other coatings, that have inaccurate predictions, contain CARiota polymer which was not in the dataset during training. The fact that the model fails to make predictions for the coatings which are made of unseen polymers confirms the assumption the generalization is not complete. It is interesting that despite this, there is one coating with CARiota for which relative error is small (47%), and the model was able to make an accurate prediction in this case.
Below, we discuss some of the possible reasons that may have caused large errors in the predictions made by the best of the constructed models, QSPR RFE/Bagging. For the PAR30/PSS0.2 and PAR30/PSS4, a large error may be caused by the too low molecular weight of the PSS polyanion compared to the one that the model has seen in the training set. The smallest polyanion in the dataset used for training was DEX with MW = 7.2 kDa, hereas in the validation set we have coatings where PSS has lower MW values (0.2 and 4 kDa). Moreover, in the training set, PSS MW was greater than in the validation set (60–70 kDa), so the combination of polyanion/MW of polyanion is new to the model. Because the model has made inaccurate predictions for the kind of polymers that it did not see during training, we assume that it can not generalize well and this is the reason for the prediction failures.
In the end, we observe that the model makes acceptable predictions for the coatings made of combinations of polymers that were present in the data set (like CHI/FUC, CHI/HA, and PAR/DEX). However, it fails to do so for the unseen polymers tested, due to the lack of extensive training data. We believe that the model potentially can be improved by generating and using more data for training. More data points will cover more chemical parameters, and the dependencies between features and thickness will be more informative and will have more predicting power.

