Breast Cancer Prediction Algorithm
## Overview
This project focuses on building a machine learning algorithm using Python's Pandas, Seaborn, and Scikit-Learn libraries to predict whether a breast tumor is malignant or benign based on various attributes. The dataset used comprises features extracted from breast cancer biopsies.
## Data Exploration and Preprocessing
The project began by importing the dataset using Pandas, followed by a comprehensive exploration of the data. Initial steps involved examining the dataset's structure, such as the number of entries, data types, and summary statistics. To ensure clean data, redundant or unnecessary columns were dropped.
## Data Visualization
Visualizing the dataset was crucial to understand the distribution and characteristics of individual features. Seaborn was employed to create visual representations like histograms and boxplots, providing insights into feature distributions like 'radius_mean' and 'texture_mean'.
## Model Development
The dataset was split into training (70%) and testing (30%) sets using Scikit-Learn's 'train_test_split' function. The matrix for predictors ('X') and the target variable ('Y') were defined accordingly. For this project, a Decision Tree Classifier from Scikit-Learn was chosen as the predictive model.
## Model Training and Evaluation
The algorithm was trained using the training set ('X_train' and 'Y_train') and then tested for its predictive accuracy using the test set ('X_test' and 'Y_test'). Metrics such as precision, recall, and F1-score were calculated using the 'classification_report' from Scikit-Learn's 'metrics' module.
## Visualizing the Decision Tree
To gain a deeper understanding of the Decision Tree model, a graphical representation was created using the 'plot_tree' function from Scikit-Learn's 'tree' module. This visual representation provides insight into how the algorithm makes decisions based on various features.
## Conclusion
The model demonstrated a certain level of accuracy in predicting whether a tumor is malignant or benign based on the provided features. This project serves as an illustrative example of applying machine learning techniques in healthcare, specifically in cancer prediction, which holds significant real-world implications.
This project showcases not only the technical skills in data preprocessing, machine learning modeling, and evaluation but also the ability to translate findings into actionable insights, vital for data scientists.
---
Source Code:
Dataset: Breast Cancer Kaggle Dataset
I recommend running this on Goole Colab or Jupyter to see all the graphics, decision tree…
Thanks for reading!
Gerard Puche