Kidney Function and Mortality in the United Arab Emirates

A Predictive Modeling Approach
by Leonid Shpaner
Reproducing and Extending Original Work by Al-Shamsi et al.


Code | GitHub Repo

Size and Scope

1,186 patients | 11 columns

Features

time_months
sex
cancer
smoking
obesity
hypertension
dyslipidemia
diabetes
cardiovascular disease
creatinine
outcome

Dataset Preparation

  • All coding done in Python 3.10.12 using the following libraries:
      • Pandas, NumPy, LifeLines, Matplotlib, Seaborn, Shap, and Scikit-Learn
  • For modeling, allocated dataset into an 80% train and 20% split.
Basic Data Exploration
Outcome Lived Died Total Count Lived (%) Died (%) Total (%)
0-1_years 12 8 20 60 40 100
1-2_years 12 14 26 46.15 53.85 100
2-3_years 14 12 26 53.85 46.15 100
3-4_years 24 11 35 68.57 31.43 100
4-5_years 31 11 42 73.81 26.19 100
5-6_years 39 10 49 79.59 20.41 100
6-7_years 68 19 87 78.16 21.84 100
7-8_years 101 9 110 91.82 8.18 100
8-9_years 305 14 319 95.61 4.39 100
9-10_years 445 4 449 99.11 0.89 100
10_years_plus 23 0 23 100 0 100
Total   Count 1,074 112 1,186 90.56 9.44 100

Outcomes by Risk Factors/Comorbidities

Cox Proportional-Hazards Model

\begin{equation} \scriptsize \begin{aligned} h(t | x) = h_0(t) \times \exp(b_1x_1 + b_2x_2 + \cdot\cdot\cdot + b_px_p) \end{aligned} \end{equation} \begin{equation} \scriptsize \begin{aligned} h(t | X) = h_0(t) \times \exp\biggl(&0.09X_{\text{sex}} + 0.62X_{\text{cancer}} \\ &+ 0.09X_{\text{smoking}} - 0.71X_{\text{obesity}} \\ &+ 0.87X_{\text{hypertension}} - 1.10X_{\text{dyslipidemia}} \\ &+ 0.57X_{\text{diabetes}} + 1.39X_{\text{cardiovascular_disease}}\biggr) \\ \end{aligned} \end{equation}

Model Calibration

Model Prediction Comparisons

Performance Assessment

Metric Logistic Regression Support Vector Machine Random Forest Classifier ExtraTrees Classifier Best Model
AUC ROC 0.897520 0.863178 0.916831 0.914088 Random Forest Classifier
PR AUC 0.500029 0.431027 0.518698 0.466860 Random Forest Classifier
Precision 0.666667 0.562500 0.590909 0.466667 Logistic Regression
Recall 0.476190 0.428571 0.619048 0.333333 Random Forest Classifier
Specificity 0.976959 0.967742 0.958525 0.963134 Logistic Regression
Average Precision 0.520694 0.466152 0.545936 0.484984 Random Forest Classifier
Brier Score 0.053218 0.062154 0.057231 0.058575 Logistic Regression

AUROC Curves

Confusion Matrix

Decision Tree from Random Forest Model


Feature Importance

Partial Dependence (Creatinine vs. Time)

Click here for interactive 3D plot

References

  • Al-Shamsi, S., Govender, R. D., & King, J. (2021). Predictive value of creatinine-based equations of kidney function in the long-term prognosis of United Arab Emirates patients with vascular risk. Oman medical journal, 36(1), e217. https://doi.org/10.5001/omj.2021.07

  • Al-Shamsi, S., Govender, R. D., & King, J. (2019). Predictive value of creatinine-based equations of kidney function in the long-term prognosis of United Arab Emirates patients with vascular risk [Dataset]. Mendeley Data, V1. https://data.mendeley.com/datasets/ppfwfpprbc/1

Thank you!

Questions?

Email: Lshpaner@ucla.edu