Machine learning for predicting the outcomes and risks of cardiovascular diseases in patients with hypertension: results of ESSE-RF in the Primorsky

Aim. To assess the prospects of using artificial intelligence technologies in predicting the outcomes and risks of cardiovascular diseases (CVD) in patients with hypertension (HTN). Material and methods. A software application was created for data mining from respondent profiles in a semi-automatic mode; libraries with data preprocessing were analyzed. We analyzed the main and additional parameters (35) of CVD risk factors in 2131 people as a part of ESSE-RF study (2014-2019). To create a fore-casting model, a high-level language Python 2.7 was used using object-oriented programming and exception handling with multithreading support. Using randomization, learning (n=488) and test (n=245) samples were formed, which included data from patients with an established diagnosis of HTN. Results. The prevalence of HTN among subjects was 34,39%. There were following significant factors for predicting CVD: anthropometric parameters, smoking, biochemical profile (total cholesterol, ApoA, ApoB, glucose, D-dimer, C-reactive protein). As a result of a 5-year follow-up, CVD was found in 235 people (32,06%) with HTN and 187 people (13,38%) without HTN; mortality rates were 1,27% in subjects with HTN and 1,12% — without HTN. The absolute mortality risk among participants with HTN (0,037) was significantly higher (p<0,05) than in patients without HTN (0,017). To create a neural network (NN), the basic Sequential model from the Keras library was used. During machine learning, 26 variables important for the CVD development were used as input and 9 neurons — as output, which corresponded to the number of established cardiovascular events. The created NN had a predictive value of up to 97,9%, which exceeded the SCORE value (34,9%). Conclusion. The data obtained indicate the importance of risk fac-tor phenotyping using anthropometric markers and biochemical profile for determining their significance in the top 20 predictors of CVD. The Python-based machine learning provides CVD prediction according to standard risk assessments.

wing inclusion criteria: signed informed consent, age 24-65 years, full completion of the questionnaire, available data on cardiovascular RF. The exclusion criteria were refusal to participate and active cancer. A total of 2,800 people were included in the study; 2131 of them (76,1%) completed the program by 2019. Patterned sampling using the data adjustment algorithm was carried out in a computer program for extracting data from respondents' questionnaires in a semi-automatic mode (Figure 1).
We analyzed the prevalence of main RF: overweight with body mass index (BMI) calculation, waist circumference (WC), BP and pulse pressure (PP) levels; heart rate (HR), smoking, sedentary lifestyle; SCORE 10-year mortality risk (in individuals ≥40 years old and ≤65 years old) based on gender, age, systolic blood pressure (SBP), total cholesterol (TC) and smoking status. BP levels were evaluated in accordance with the guidelines [6], where BP ≥140/90 mm Hg belongs to HTN. Family history of heart Most often, to predict the risk of cardiovascular diseases (CVD), multivariate regression analysis models are developed, which combines data on a limited number of established risk factors (RF). Such an algorithm assumes that all included RF are linearly associated with CVD outcomes and are characterized by limited interaction between each other or its absence. Due to such a limitative approach to modeling and predictors, these algorithms, in particular, the Framingham, SCORE, and DECODE equations, demonstrate insufficient prognostic efficiency [1]. In various areas, including in medicine, the most effective prognostic approach is data mining, especially, deep neural networks (DNN). For now, there are many libraries ready for use, on the basis of which it is possible to use DNN in practice. Such a methods based on machine learning (ML) increase the efficiency of risk prediction through the use of data warehouses with the independent identification of new risk predictors and complex interactions between them. There is a small number of studies on the prospects of using ML to predict CVD risk. Some studies showed that, compared with the above equations, ML significantly increases the accuracy of CVD risk prediction and, as a result, the number of patients who could benefit more from preventive measures before the onset of severe manifestations [2][3][4].
The current study presents the potential value of using ML to develop a model for CVD risk prediction using blood pressure (BP) data. We prospectively analyzed data obtained by a cross-sectional examination of 2800 residents of the Primorsky Krai without CVD. This examination was conducted from 2014 to 2019 as a part of ESSE-RF study. To develop a risk prediction model, we used the modern automated high-level language Python and open-source neuralnetwork library Keras. Learning and optimization of DNN were carried out using the Adam algorithm. The prognostic value of DNN in healthy general population, including a clinically significant subgroup of patients with hypertension (HTN), was assessed.
The aim of the study was to assess the prospects of using artificial intelligence technologies in predicting the outcomes and risks of CVD in patients with HTN.

Material and methods
As a part of the ESSE-RF study (2014-2019), cross-sectional examination of Primorsky Krai residents was performed [5]. This study was performed in accordance with the Helsinki declaration and Good Clinical Practice standards. To form a representative sample, we used the continuous method by an individual invitation of participants. There were follo-  LP, USA). Continuous variables are represented as medians with interquartile intervals; the comparison was carried out using the Student's t-test. To compare discrete variables, the Chi-squared test or the Fisher's exact test were used. The cumulative probabilities for CVD were estimated using the Kaplan-Meier method and compared using a log rank test. To assess the impact of various variables on the CVD risk, univariate and multivariate regression models (Cox proportional-hazards model) were used. Hazard ratios and 95% confidence intervals with corresponding p-values are presented. The differences were considered significant at p<0,05. The effectiveness of ML prediction algorithms was assessed using a validation coefficient. The study was supported by the grant of Russian Foundation for Basic Research (№ 19-29-01077).

Results and discussion
Characteristics of the study population. Complete information on 2131 participants was determined using a randomized algorithm for computer data adjustment ( Table 1). The mean age of participants at disease, smoking and alcohol status was determined by anamnesis collection. The parameters of lipid profile (TC, triglycerides (TG), low density lipoproteins (LDL) and high density lipoproteins (HDL), lipoprotein(a) (LP(a)), apolipoprotein A (ApoA), apolipoprotein B (ApoB)), glucose, creatinine, uric acid, D-dimer, C-reactive protein (CRP) levels were determined.
For neural network data analysis, a high-level language Python 2.7 (Python Software Foundation License) was used based on object-oriented programming with exception handling mechanism and multithreading support. After analysis of Python libraries (TensorFlor, Keras), Keras was used to initiate ML. Learning and optimization of the DNN were carried out according to the Adam algorithm (adaptive moment estimation) with the calculation of adaptive learning rate for each parameter. Adam also keeps an exponentially decaying average of past squared gradients AdaDelta and past gradients m t , similar to momentum.
Statistical processing was carried out using the software package Stata 11.2 and R 3.2.1 (StataCorp Table 1 Clinical and laboratory characteristics of the subjects  -10), angina was detected in 51,06% of HTN people; atrial fibrillation and flutter -in 11,06% and 14,44% of people with and without HTN, respectively; old myocardial infarction -in 5,53% and 9,09% of people with and without HTN, respectively; unspecified stroke -in 6,81% of people with HTN. In the HTN group, the absolute mortality risk was significantly higher than in individuals without HTN (0,037 vs 0,017, respectively; p <0,05); the relative mortality risk was 2,146. CVD risk predictors. The revealed statistical differences in the studied RF between HTN (experimental) and non-HTN (comparison) groups are presented in Table 1.
The PP level exceeded the threshold level among HTN participants, while the maximum value (68,88±0,71 mm Hg) was observed in women. The mg/dl, respectively; p=0,704). Non-HTN men had slightly higher mean value of LP(a) (20,70±1,09 mg/ dl) compared with HTN men (18,16±1,28 mg/dl), but this difference was not significant. It is assumed that ApoA and ApoB levels may be decisive in determining the atherosclerosis risk, especially when other lipid parameters do not exceed the norm and/or there are no manifestations of vascular damage [7]. There were significant differences between experimental and comparison groups in ApoA (p=0,025) and ApoB (p=0,00001) levels.
The mean values of CRP were higher in HTN individuals compared with non-HTN individuals, regardless of gender. The differences between groups were significant (p=0,0001).
Thus, statistical processing revealed following significant factors: anthropometric parametersheight, weight, BMI, WC; blood biochemical parameters -levels of TC, fasting glucose, ApoA and ApoB, D-dimer and CRP.
ML model for predicting CVD outcomes in HTN patients. Various programming languages are used to create DNN, where basic mathematical operators and multidimensional arrays are supported. These include such interpreted C languages as Python, which we used for ML. To develop the DNN, we used the basic sequential model from the Keras mean HR in the groups was within the acceptable range.
The mean TC level exceeded the normal value in all subjects. The highest mean LDL value (3,88±0,05 mmol/L) was noted in women with HTN. A significant difference in HDL levels was found between HTN and non-HTN women (p=0,007). The mean TG level exceeded the norm only in HTN men (1,77±0,08 mmol/L).
Fasting glucose >5,6 mmol/L is considered to be the RF for diabetes and CVD. Significant differences between the groups were found (p<0,001). Exceeding the threshold level was observed in all HTN participants.
The mean creatinine level did not exceed acceptable values in 100% of cases. However, the studied groups had significant differences in terms of this RF (p=0,006). The highest mean creatinine level was 318,80±4,96 μmol/L in HTN women. LP(a) is an atherogenic lipid variant, which has a high prognostic value for atherosclerosis and CVD, in particular, coronary artery disease. Acceptable values of LP(a) are in the range of 5-18 mg/dl. There were no significant differences of this RF between HTN and non-HTN groups (20,62±0,93 vs 20,19±0,65  ( Figure 3). During testing, accuracy decreased to 95,5% ( Figure 3). Classification analysis. To assess the clinical significance of our results, we compared our model with the SCORE model in predicting CVD risk. At this operating point, the basic SCORE model correctly predicted 145 CVD out of 465 cases (sensitivity -61,7%, predictor coefficient -1,5%). Our ML model correctly predicted 230 CVD out of 733 subjects (sensitivity -97,9%). The resulting difference is 36,2% of increase in the accuracy of predicting CVD using ML methods.

Conclusion
The study showed that ML methods can be effectively used for cardiovascular risk prediction. The Python-based method provides CVD prediction using standard risk assessments. The use of the randomization function for selecting variables, followed by use of Cox regression methods allows improving prediction. The results also indicate the importance of advanced phenotyping of the subjects using anthropometric markers and blood biochemical parameters, when determining the top-20 predictors for CVD.
The Multi-Ethnic Study of Atherosclerosis (MESA) showed that indicators such as age, inflammation, and vascular diseases prevails in death prognosis. It is also indicated that impaired glucose metabolism an HTN is associated with stroke prognosis, and markers of subclinical atherosclerosis are central to the prognosis of various CVD [8]. The ML method used by us is unique in that it demonstrates the patterns of predictor changes that differ for specific disease outcomes. Relatively high accuracy values (from 86 to 98%) indicate the acceptability of library, which is represented by multiple layers combining to a Rumelhart multilayer perceptron. Using the randomization function X_train, X_test, y_train, y_test=train_test_split (X,Y,test_size=0,40, ran-dom_state=42), 2 samples were formed from the total data array: learning (n=488) and test (n=245, Figure 1). These included data from patients with established HTN. Of all the subjects with HTN (n=733), 144 participants were smokers, 170 -former smokers, 419 -non-smokers. Input layer of the prediction model included 26 most important variables (Table 2, Figure 2). Hidden layers were determined empirically: the first layer, where the matrix of weighting coefficients and the matrix of input data of previous neurons are multiplied (15 neurons); the second layer contained the result of minimizing the error (8 neurons), and the third layer was used to refine the prognosis (10 neurons). The output layer consisted of 9 neurons, each of which corresponded to the number of events belonging to the ICD-10 diagnosis (Table 3).
Learning and optimization of DNN were carried out according to the Adam algorithm, which calculates adaptive learning rates for each parameter. Adam keeps an exponentially decaying average of past squared gradients AdaDelta and past gradients m t , similar to momentum. The Adam algorithm differs from other adaptive methods in the rapid learning rate and efficiency. Changes of the DNN accuracy in the learning and testing processes are presented in Figure 3.
The sample size for the ML was 66,6% of all HTN subjects. Learning and optimization of DNN was carried out in 1000 epochs. As a result of testing using the Adam algorithm, the DNN accuracy reached 97,9%, and the loss value was in the range 10-7-10-8 Table 3 Stratification of hypertensive subjects aged 24-65 years without CVD at the study beginning, depending on the presence of first cardiovascular event after a 5-year follow-up using this ML method in cardiovascular risk estimation. The advantage of current study is the consideration of anthropometric data, the results of laboratory tests and other important predictors of CVD. Thus, combination of ML and advanced phenotyping increases the accuracy of predicting cardiovascular events in HTN population. The developed approaches allow to more accurately understand markers of subclinical diseases without a priori guess on their nature.
Relationships and activities: the study was supported by the grant of Russian Foundation for Basic Research (№ 19-29-01077).