Preloader

Predicting cognitive impairment in outpatients with epilepsy using machine learning techniques

Patient demographics

In order to make the influence of missing data reach a relatively unbiased and controllable state after the imputation method was applied, eight cases with over 30% missing clinically relevant information from a total number of 441 were excluded15. A total of 433 patients with complete information were analyzed and divided into a training group (n = 304) and a test group (n = 129). Data were included for all patients equal to or older than 12 years. The results of Wilcoxon rank sum test of age showed that the grouping of MMSE was different (training group: W = 5378, p = 1.295E-05; test group: W = 927.5, p-value = 0.003), there is no difference for MOCA (training group: W = 9284.5, p-value = 0.290; test group: W = 1595, p-value = 0.244). According to this result, we did not target age groups, and multivariate statistics can analyze the statistical laws of multiple objects and indicators when they are correlated with each other.

The database consisted of 12 features, including two different categories, in which two features provided demographic information and the other 10 features provided clinical symptom features used to monitor clinical outcomes. Descriptive statistical analysis was performed on 12 clinical features. Table 1 showed the details of the selected features.

Table 1 Features of epileptic patients.

Prediction modeling for MMSE scores of patients with diagnosed epilepsy

In total, there were 304 patients in the training dataset. We identified patients with cognitive impairment using the MMSE scale according to educational level. Seventy patients with epilepsy were found to have cognitive impairment. logistic regression (LR), DT, RF and SVM model’s accuracy, positive predictive value, and specificity are described in Table 2. The AUC values of these four models (LR, DT, RF, and SVM) were 0.67, 0.63, 0.72 and 0.70, respectively. Meanwhile, the mean AUC after internal cross-validation within RF modeling, which was 0.72, was significantly higher than that of the other models. The RF was selected as the optimal modeling approach.

Table 2 Results of cross-validation for different machine learning algorithms using MMSE data.

Furthermore, we obtained the ranking of the importance of each feature in the RF model. The top ten features in the ranking of feature importance were age, age of onset, sex, usage of drugs, history of brain trauma or surgery, seizure type, seizure frequency, epileptiform discharge in EEG, brain MRI abnormalities, and other brain diseases. (Table 3). Mean decrease in accuracy, the higher the value, the greater the importance of the variable.

Table 3 Ranking of feature importance in the RF model (MMSE).

Prediction modeling for MoCA scores of patients with diagnosed epilepsy

We used the same strategy described above to build four common machine learning models. Our results showed that the AUC values of these four models (LR, DT, SVM, and RF) were 0.62, 0.61, 0.60, and 0.71, respectively (Table 4). Therefore, in terms of generalization of the different machine learning algorithms, the RF model achieved an AUC of 0.71; thus, it performed better than others using MoCA data.

Table 4 Results of cross-validation for different machine learning algorithms using MoCA data.

We determined the variable importance in the RF model. The top ten features in the ranking of feature importance were status epilepticus, history of brain trauma or surgery, seizure frequency, sex, usage of drugs, age, epileptiform discharge in EEG, family history, brain MRI abnormalities and age of onset (Table 5).

Table 5 Ranking of feature importance in the RF model (MoCA).

RDA contributes to variable constraint

Unlike other single outcome features, RDA can explain the comprehensive relationship between dual outcomes and exposure features. Then, the degree that the critical features explain the double outcome can be determined, and the redundant information can be removed. Therefore, the RDA model was used in this study as a method to study the correlation between the outcome variable matrix and the exposure variable matrix. Eigenvalues of RDA1 and RDA2 were 0.719 and 0.281, respectively. The accumulated constrained eigenvalues showed the contribution of features (Table 6, Fig. 2).

Table 6 The contribution of features in the RDA model.
Figure 2
figure2

RDA analysis plot. The length of the arrow shows the strength of the correlation between the variable and the result variable. The longer the arrow length, the stronger the correlation. The vertical distance reflects the correlation between them. The smaller the distance, the stronger the correlation.

The features of the top ten contribution rates for MMSE and MoCA outcomes were age, age of onset, sex, usage of drugs, seizure frequency, epileptiform discharge in EEG, brain MRI abnormalities, status epilepticus, seizure type and family history ranked by RDA1 values.

Selection of the optimal combination of features

In the RF modeling of MMSE or MoCA data, we obtained the top ten characteristic variables with contribution rates. Additionally, we selected the top ten features according to the contribution rate with bivariate outcomes from the RDA model. The optimal candidate features were filtered by Venn analysis, and there were 7 overlapping features, namely, sex, age, age of onset, seizure frequency, brain MRI abnormalities, epileptiform discharge in EEG and usage of drugs (Fig. 3).

Figure 3
figure3

In the Venn diagram, each circle represents the difference variable in a model, the number of overlaps in the circle represents the number of common variables in the two models, and the overlap area represents the number of unique variables in each model (purple: MMSE; Yellow: MOCA; Green: RDA).

Validation for the optimal combination of features

To determine the optimal combination of features, we chose the optimal combination of features through the optimal model for internal validation of binary classification, the top ten features of RDA modeling, the top ten features of MoCA outcomes in RF modeling, and the top ten features of MMSE outcomes in RF modeling for external validation.

Validation for MMSE outcomes

Verification results of various variable combinations for RF models showed that the ROC value of the optimal combination of features, which was 0.786, was the highest (Fig. 4). After analyzing all the combinations of features details, the optimal combination of features revealed that highest candidate variable combinations had specificity, accuracy, and precision values of 0.90, 0.82, and 0.61, respectively (Table 7).

Figure 4
figure4

ROC curve of MMSE’s prediction model. (red: the optimal combination of variables; blue: the top ten features of RDA; green: the top ten features of MMSE RF analysis; purple: the top ten features of MoCA RF analysis).

Table 7 Validation dataset validated all the combinations of features (MMSE).

Validation for MoCA outcomes

Verification results of various variable combination models showed that the ROC value of the optimal combination of features, which was 0.702, was the highest (Fig. 5). All the combinations of features in detail and the specificity and precision of the optimal combination of features, which were 0.90 and 0.90, respectively, were also the highest (Table 8).

Figure 5
figure5

ROC curve of MOCA’s prediction model. (red: the optimal combination of variables; blue: the top ten features of RDA; green: the top ten features of MMSE RF analysis; purple: the top ten features of MoCA RF analysis).

Table 8 Validation dataset validated all the combinations of features (MoCA).

Evaluation for dual MMSE and MoCA outcomes

Different candidate variables had different clinical values. The optimal combination of features was used not only to predict MMSE and MoCA scores but also to predict the results of double outcome variables. We plotted the mixed matrix of four combinations predicting two MMSE and MoCA outcomes and calculated the accuracy (Table 9). The column names in the table are expressed as MMSE outcomes (0 indicates normal and 1 indicates cognitive function) and MoCA outcomes (0 indicates normal and 1 indicates cognitive function).

Table 9 Different candidate variables predict the correct probability at the same time.

The results showed that the best predictor of noncognitive function (MMSE = 0, MoCA = 0) was the top ten features of RDA, followed by the candidate variable combination; prediction accuracy was 38.47% and 38.1%, respectively. In particular, the best predictor for cognitive function (MMSE = 1, MoCA = 1) was the optimal combination of features, and the accuracy was 50.00% (Table 9). All the results above indicate that candidate features were the optimal combination of features not only for the prediction of MMSE or MoCA outcomes individually but also for both.

Source link