Diabetes Prediction Using Logistic Regression

Project Overview: This project uses logistic regression to predict diabetes occurrence based on demographic and medical data. By identifying high-risk individuals early, this model aids in the proactive management of diabetes, supporting healthcare providers in prevention efforts.

Objective: The goal is to develop a predictive model that classifies individuals as diabetic or non-diabetic, contributing to early detection and better management strategies.

Project Methodology:

  • Data Preparation: Cleaned and preprocessed data, handling missing values and sampling a subset for analysis.
  • Exploratory Data Analysis (EDA): Used histograms and boxplots to analyze distributions and identify key features.
  • Model Selection: Logistic regression was chosen for its suitability in binary classification, with variable selection based on significance to the diabetes outcome.

Dataset: The dataset includes approximately 100,000 records with features like Age, BMI, HbA1c level, Blood Glucose level, and more, sourced from Kaggle.

  • Medical Features: BMI, Blood Glucose level, Hypertension, Heart Disease, HbA1c level.
  • Demographic Features: Age, Gender, and Smoking History.

Model Code: Below is the Python code used to build and evaluate the logistic regression model:

                    
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Load and prepare the dataset
data = pd.read_csv('path_to_data.csv')
data.dropna(inplace=True)  # Handling missing data

# Feature engineering
data['smoking_history_trans'] = data['Smoking History'].apply(lambda x: 1 if x == 'smoker' else 0)

# Model building
X = data[['Age', 'BMI', 'HbA1c_level', 'Blood_Glucose_level', 'Gender_Male', 'Hypertension_1', 'Heart_Disease_1']]
y = data['Diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predicting and evaluating the model
predictions = model.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
                    
                

Results: The model achieved high accuracy, with an accuracy of 94.75%, specificity of 97.77%, and sensitivity of 67.98%. Age, BMI, HbA1c, and Blood Glucose levels were identified as significant predictors.

Conclusion: This predictive model effectively identifies high-risk individuals, aiding in early intervention strategies for diabetes management.

Technical Stack: Python, Pandas, Scikit-learn, and Jupyter Notebook.

You can find this project on my GitHub: GitHub Link