Prediction of heart disease based on logistic regression

. Heart disease is a major threat to human health, with a variety of contributing factors, and is not easily cured. This paper will present a dataset from a cardiovascular study of residents of Framingham, Massachusetts. First, the validity of the three models, logistic regression, random forest, and decision tree, is estimated by comparing information such as accuracy, precision, recall, and F1 values. The optimal model, i.e., the logistic regression model, was selected by plotting ROC curves and using AUC as a reference criterion for assessing the predictive effectiveness of the models. Then the raw data and data were preprocessed, including dealing with missing values. Finally, a logistic regression model was developed to analyze the influencing factors of heart disease. The purpose of this study was to use the results of the logistic model to help doctors and patients in heart disease treatment. The results show that the model has a good predictive effect.


Introduction
Heart disease is a disease that afflicts many individuals and families.As technology develops and living standards improve, more and more people are paying more attention to their health.In recent years, the incidence of heart disease in many regions has been on the rise, and the loss of life caused by heart disease is also rising year by year.The World Health Organization estimates that 12 million people die of heart disease each year globally.For example, in some developed countries, such as the United States, more than half of the inhabitants die because they suffer from cardiovascular diseases.To reduce the incidence of heart disease and the mortality rate of the population due to heart disease, further targeted interventions should be used to study the factors of heart disease.
First, many researchers believe that reducing the incidence of acute postoperative lung injury in neonates with heart disease can significantly improve child survival [1].Among adults, many bad lifestyle habits may also be a major factor in the predisposition to heart disease.For example, it has been suggested that the incidence of cardiovascular disease due to smoking is higher in China than in the nonsmoking population [2].Metabolic diseases such as high fasting plasma glucose (HFPG) are significant and risky factors that lead to cardiovascular disease in humans [3,4].In China, with the gradual development of the economy, the lifestyle and nutritional structure of the population have changed dramatically, and lifestyle habits such as excessive sugar intake and lack of exercise have led to an increasing prevalence of HFPG [5].
The disease burden of ischemic heart disease (IHD) attributable to HFPG in Chinese residents has obvious gender and age group characteristics.From a gender perspective, all the disease burden indicators of the female population are lower than those of the male group, and the trend of disease burden in the total population is more susceptible to the male group, which may be related to the structure of the female organism [6].The main reasons for lower life expectancy in men also include behavioral factors such as smoking and alcohol consumption, genetic and physiological factors, and higher rates of injury mortality [7,8].However, some findings are contrary to popular belief, that light drinkers are less likely to develop aortic stenosis than never-drinkers [9,10].For example, if a person drinks 60 grams of alcohol per day, he may have a lower risk of developing the disease than someone who drinks 10 grams of alcohol per day [9,10].
Wang et al. have shown in their studies that heart disease is often closely related to disability in the elderly [11].When older adults were selected for the study, the results showed that the risk of the disease increased twofold for every 10 years of age [12].In terms of education level, Ni concluded that the risk of developing activity of daily living (ADL) limitations in elderly cardiac patients with elementary school or higher education was 0.666 times higher than that of elderly cardiac patients who had never attended school [13].Married, cohabiting and educated urban elderly cardiac patients had a lower risk of ADL limitation [13].
In summary, it was initially determined that the prevalence of heart disease is related to several factors such as age, gender, genetic factors, amount of smoking, amount of alcohol consumption, level of education, marital status, and current status of social development.The study will predict which type of patients are most likely to develop heart disease in the future by analyzing given characteristics, comparing differences between patients, and making predictions about future trends, with the ultimate goal of expecting to provide a basis for reducing the incidence of heart disease.

Data source and description
This study utilizes a dataset provided by the Kaggle platform, which is derived from an ongoing cardiovascular study of residents in the town of Framingham, Massachusetts.The dataset has a total of 4,239 samples, each with 16 variables.Fifteen of the variables are independent, with each variables attribute being a potential risk factor.The last variable "TenYearCHD" is the dependent variable, indicating whether the patient is at risk of having coronary heart disease (CHD) in the next ten years.

Selection and description of indicators
Among all the variables, both quantitative variables such as "Age", "CigsPerDay" and categorical variables such as "Male", "Education" are included.Due to the different types of variables, in this paper, the variables involved in the data will be interpreted according to the type of data.Each quantitative variable is shown in Table 1 and each categorical variable is shown in Table 2.

Method introduction
There are many ways to predict whether or not a patient will suffer from heart disease.However, the predicted results are sometimes very different from the real situation, which is related to whether or not the patient can get timely treatment or even the patient's life, so it is crucial for the patient to make a correct prediction or judgment [14].Logistic regression belongs to the probabilistic regression model, is a kind of generalized linear model, widely used in probabilistic prediction and classification, has the characteristics of simple, efficient and strong interpretability [15,16].In this study, the samples in the above dataset were processed accordingly by using logistic regression, and the results obtained from the processing were further analyzed by observing the results of model fitting, etc., to obtain the main factors influencing the diagnosis of heart disease.Logistic regression is a type of regression analysis in statistics that is applied to predict the outcome of the dependent variable from predictors or independent variables, where the dependent variable usually refers to categorical dependent variables.Also, in logistic regression, the dependent variable is always binary.Below is the logistic regression equation: (|) After inserting all the variables, the author gets the following equation: Where Y denotes the explanatory variable, which in the logistic regression model denotes whether or not heart disease is diagnosed.X denotes the explanatory variable, which in the model is specified as the factors influencing whether or not one has heart disease.  is the parameter to be estimated.

Correlation analysis
Figure 1 demonstrates the heat map that can reflect the relationship between the features, through which the correlation between the features can be directly observed.The heat map shows the correlation between every two data, and the value range chosen in this paper is between -1 and 1, i.e., greater than 0 indicates that the two selected data are positively correlated, less than 0 indicates that the two selected data are negatively correlated and equal to 0 indicates that the two selected data are not correlated.The larger the absolute value of the value indicates that the stronger the correlation and vice versa the weaker the correlation.As can be seen from Figure 1, the four variables diaBP, SysBP, PrevalentStroke, and age show positive correlation and larger coefficients than the other variables with TenYearCHD, indicating that they are more intimately related to whether or not the disease is present.

Comparison of different models
In this paper, the effectiveness of the logistic regression model is derived by comparing the logistic regression model with two commonly used models named random forest and decision tree.The various models were compared in terms of four indicators: accuracy, precision, recall and F1 value.The results are shown in Table 3.The comparative ROC curves of the three models are plotted in Figure 2. According to the results of the above three models, no model excels in all aspects, i.e., no model outperforms the other models in all indicators.However, on a comprehensive consideration, the accuracy (0.835) and precision (0.538) of the logistic regression model are in the first place.The recall and F1 values are in second place.According to the ROC curve, the area under the curve (AUC) of this regression is 0.65, which is not the highest, but it's only different from the random forest model by 0.02.This result indicates that the logistic regression model has a good predictive effect on the heart disease data used in the present study, and it is also of great significance for the subsequent prediction of heart disease data used for similar purposes.
It is important to choose the model with better results, and after a comprehensive evaluation, this paper decides to use the logistic regression model for the subsequent research.

Logistic regression results
Before performing the logistic regression, the study requires some data preprocessing steps.Firstly, the missing values are processed, which is done by removing the null values, with the aim of ensuring that the data is clean and usable.The study then divides the processed dataset into two parts: a training set and a test set, where the training set is used to train the logistic regression model, and the test set is used to evaluate the performance of the model (Table 4).In this study, 15 factors affecting the determination of heart disease were used as independent variables and then performed the binomial logistic regression.The regression results were organized as shown in Table 4. Table 4 gives the estimated values of the parameters, and the mean square error corresponding to the values, in addition to the p-value and OR.Where it is considered significant when p is less than 0.05; the OR value means the result of comparing the probability of a particular probability occurring with the probability of it not occurring, which in this paper is expressed as the ratio of having a heart attack to not having a heart attack in the condition of that independent variable.

Discussion
From the regression results in Table 4, it can be seen that: male, age, education, cigsPerDay, prevalentStroke, prevalentHyp, diabetes, sysBP, diaBP, BMI and heartRate have a statistically significant (p<0.05)effect on having heart disease, which is inextricably associated with heart disease disease were inextricably linked.On the contrary, currentSmoker, BPMeds, totChol, and glucose did not have a significant effect on the presence of heart disease (p>0.05),they were not the main influencing factors for the final confirmation of heart disease.
According to the positive and negative regression coefficients, there is a negative correlation between the level of education and the ten-year risk of heart attack, indicating that a higher level of education may reduce the risk, which can also be seen in Figure 2. The coefficients for diaBP, BMI, and heartRate are also negative, indicating that these variables have a negative effect on the diagnosis of heart disease.The results also show that gender has a significant effect on the final diagnosis of heart disease, i.e., men may have a higher 10-year risk of heart attack than women, which may be related to the different lifestyles of men and women, for example, far more men than women choose to smoke or drink alcohol in their lives.In addition, the rest of the influencing factors have a positive effect on the ten-year risk of heart attack, with the slopes of age, cigsPerDay, and sysBP being relatively flat, and the slopes of prevalentStroke, prevalentHyp and diabetes being larger, indicating that the above variables affect the final diagnosis of heart disease to varying degrees.

Conclusion
Heart disease is an important problem that threatens human health with various factors and it is not easy to cure.To further analyze the causative factors of heart disease, this paper compares multiple models and finally uses logistic regression to model 15 variables that affect heart disease.The model aims to predict the probability of developing coronary heart disease over a ten-year period based on demographics, lifestyle and health-related factors.The results show that male, age, education, cigsPerDay, prevalentStroke, prevalentHyp, diabetes, sysBP, diaBP, BMI and heartRate are important factors in the diagnosis of heart disease.Finally, based on the ROC curve and AUC, it can be seen that the logistic regression model performs well for the prediction of heart disease.It is hoped that the conclusions drawn from this study will be helpful in the field of cardiology, provide reference for both doctors and patients, and gain valuable time to save patients' lives.

Figure 2 .
Figure 2. ROC curve of three models.

Table 1 .
Overview of quantitative variables.

Table 2 .
Overview of categorical variables.

Table 3 .
Comparison of three models.