Breast Cancer Prediction Algorithm

## Overview

This project focuses on building a machine learning algorithm using Python's Pandas, Seaborn, and Scikit-Learn libraries to predict whether a breast tumor is malignant or benign based on various attributes. The dataset used comprises features extracted from breast cancer biopsies.

## Data Exploration and Preprocessing

The project began by importing the dataset using Pandas, followed by a comprehensive exploration of the data. Initial steps involved examining the dataset's structure, such as the number of entries, data types, and summary statistics. To ensure clean data, redundant or unnecessary columns were dropped.

## Data Visualization

Visualizing the dataset was crucial to understand the distribution and characteristics of individual features. Seaborn was employed to create visual representations like histograms and boxplots, providing insights into feature distributions like 'radius_mean' and 'texture_mean'.

## Model Development

The dataset was split into training (70%) and testing (30%) sets using Scikit-Learn's 'train_test_split' function. The matrix for predictors ('X') and the target variable ('Y') were defined accordingly. For this project, a Decision Tree Classifier from Scikit-Learn was chosen as the predictive model.

## Model Training and Evaluation

The algorithm was trained using the training set ('X_train' and 'Y_train') and then tested for its predictive accuracy using the test set ('X_test' and 'Y_test'). Metrics such as precision, recall, and F1-score were calculated using the 'classification_report' from Scikit-Learn's 'metrics' module.

## Visualizing the Decision Tree

To gain a deeper understanding of the Decision Tree model, a graphical representation was created using the 'plot_tree' function from Scikit-Learn's 'tree' module. This visual representation provides insight into how the algorithm makes decisions based on various features.

## Conclusion

The model demonstrated a certain level of accuracy in predicting whether a tumor is malignant or benign based on the provided features. This project serves as an illustrative example of applying machine learning techniques in healthcare, specifically in cancer prediction, which holds significant real-world implications.

This project showcases not only the technical skills in data preprocessing, machine learning modeling, and evaluation but also the ability to translate findings into actionable insights, vital for data scientists.

---

Source Code:

import pandas as pd

df = pd.read_csv("BreastCancer (2).csv")

df.head(20)

df.describe()

df.info()

df = df.drop(columns=["Unnamed: 32"], axis=1)

#Number of data we have

df.diagnosis.value_counts()

#Graphics

import seaborn as sns

sns.distplot(df["radius_mean"])

df.boxplot(column="texture_mean")

#Divide training pack (70%) and test pack (30%)

from sklearn.model_selection import train_test_split

#Define the matrix X and Y
X = df.drop("diagnosis", axis=1)
Y = df.diagnosis

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size= 0.7, random_state=0)

#Algorithm Training: Decision Tree

from sklearn.tree import DecisionTreeClassifier

algorithm = DecisionTreeClassifier()

#Training

algorithm.fit(X_train, Y_train)

#Algorithm Testing (precision)

#Calculate what the algorithm predicts for the testing pack

prediction = algorithm.predict(X_test)

#Evaluate how good it works in function of the predictions

from sklearn import metrics

print(metrics.classification_report(Y_test, prediction))

#GraphicRepresentation of Decision Tree

from matplotlib import pyplot as plt
from sklearn import tree

fig = plt.figure(figsize=(25,20))
DecisionTree = tree.plot_tree(algorithm, feature_names=["id","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean","compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean","radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se","concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst","perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst","concave points_worst","symmetry_worst","fractal_dimension_worst"],
                 class_names=["M", "B"], filled=True)

X_test.info()

algorithm.predict([[5.1, 3.5, 1.4, 0.2]])

Dataset: Breast Cancer Kaggle Dataset

I recommend running this on Goole Colab or Jupyter to see all the graphics, decision tree…

Thanks for reading!

Gerard Puche