time_months
sex
cancer
smoking
obesity
hypertension
dyslipidemia
diabetes
cardiovascular disease
creatinine
outcome
Dataset Preparation
- All coding done in Python 3.10.12 using the following libraries:
- Pandas, NumPy, LifeLines, Matplotlib, Seaborn, Shap, and Scikit-Learn
- For modeling, allocated dataset into an 80% train and 20% split.
Basic Data Exploration
Outcome |
Lived |
Died |
Total Count |
Lived (%) |
Died (%) |
Total (%) |
0-1_years |
12 |
8 |
20 |
60 |
40 |
100 |
1-2_years |
12 |
14 |
26 |
46.15 |
53.85 |
100 |
2-3_years |
14 |
12 |
26 |
53.85 |
46.15 |
100 |
3-4_years |
24 |
11 |
35 |
68.57 |
31.43 |
100 |
4-5_years |
31 |
11 |
42 |
73.81 |
26.19 |
100 |
5-6_years |
39 |
10 |
49 |
79.59 |
20.41 |
100 |
6-7_years |
68 |
19 |
87 |
78.16 |
21.84 |
100 |
7-8_years |
101 |
9 |
110 |
91.82 |
8.18 |
100 |
8-9_years |
305 |
14 |
319 |
95.61 |
4.39 |
100 |
9-10_years |
445 |
4 |
449 |
99.11 |
0.89 |
100 |
10_years_plus |
23 |
0 |
23 |
100 |
0 |
100 |
Total Count |
1,074 |
112 |
1,186 |
90.56 |
9.44 |
100 |
Outcomes by Risk Factors/Comorbidities
Cox Proportional-Hazards Model
\begin{equation}
\scriptsize
\begin{aligned}
h(t | x) = h_0(t) \times \exp(b_1x_1 + b_2x_2 + \cdot\cdot\cdot + b_px_p)
\end{aligned}
\end{equation}
\begin{equation}
\scriptsize
\begin{aligned}
h(t | X) = h_0(t) \times \exp\biggl(&0.09X_{\text{sex}} + 0.62X_{\text{cancer}} \\
&+ 0.09X_{\text{smoking}} - 0.71X_{\text{obesity}} \\
&+ 0.87X_{\text{hypertension}} - 1.10X_{\text{dyslipidemia}} \\
&+ 0.57X_{\text{diabetes}} + 1.39X_{\text{cardiovascular_disease}}\biggr) \\
\end{aligned}
\end{equation}
Model Prediction Comparisons
Performance Assessment
Metric |
Logistic Regression |
Support Vector Machine |
Random Forest Classifier |
ExtraTrees Classifier |
Best Model |
AUC ROC
|
0.897520 |
0.863178 |
0.916831
|
0.914088 |
Random Forest Classifier
|
PR AUC
|
0.500029 |
0.431027 |
0.518698
|
0.466860 |
Random Forest Classifier
|
Precision |
0.666667 |
0.562500 |
0.590909 |
0.466667 |
Logistic Regression |
Recall
|
0.476190 |
0.428571 |
0.619048
|
0.333333 |
Random Forest Classifier
|
Specificity |
0.976959 |
0.967742 |
0.958525 |
0.963134 |
Logistic Regression |
Average Precision
|
0.520694 |
0.466152 |
0.545936
|
0.484984 |
Random Forest Classifier
|
Brier Score |
0.053218 |
0.062154 |
0.057231 |
0.058575 |
Logistic Regression |
Decision Tree from Random Forest Model
Feature Importance
Partial Dependence (Creatinine vs. Time)
References
- Al-Shamsi, S., Govender, R. D., & King, J. (2021). Predictive value of creatinine-based equations of kidney function in the long-term prognosis of United Arab Emirates patients with vascular risk. Oman medical journal, 36(1), e217. https://doi.org/10.5001/omj.2021.07
- Al-Shamsi, S., Govender, R. D., & King, J. (2019). Predictive value of creatinine-based equations of kidney function in the long-term prognosis of United Arab Emirates patients with vascular risk [Dataset]. Mendeley Data, V1. https://data.mendeley.com/datasets/ppfwfpprbc/1