Project Overview: This project uses logistic regression to predict diabetes occurrence based on demographic and medical data. By identifying high-risk individuals early, this model aids in the proactive management of diabetes, supporting healthcare providers in prevention efforts.
Objective: The goal is to develop a predictive model that classifies individuals as diabetic or non-diabetic, contributing to early detection and better management strategies.
Project Methodology:
Dataset: The dataset includes approximately 100,000 records with features like Age, BMI, HbA1c level, Blood Glucose level, and more, sourced from Kaggle.
Model Code: Below is the Python code used to build and evaluate the logistic regression model:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
# Load and prepare the dataset
data = pd.read_csv('path_to_data.csv')
data.dropna(inplace=True) # Handling missing data
# Feature engineering
data['smoking_history_trans'] = data['Smoking History'].apply(lambda x: 1 if x == 'smoker' else 0)
# Model building
X = data[['Age', 'BMI', 'HbA1c_level', 'Blood_Glucose_level', 'Gender_Male', 'Hypertension_1', 'Heart_Disease_1']]
y = data['Diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)
# Predicting and evaluating the model
predictions = model.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
Results: The model achieved high accuracy, with an accuracy of 94.75%, specificity of 97.77%, and sensitivity of 67.98%. Age, BMI, HbA1c, and Blood Glucose levels were identified as significant predictors.
Conclusion: This predictive model effectively identifies high-risk individuals, aiding in early intervention strategies for diabetes management.
Technical Stack: Python, Pandas, Scikit-learn, and Jupyter Notebook.
You can find this project on my GitHub: GitHub Link