Breast Cancer Early Prediction and Biomarker Analysis using Data Analytics, Visualization, and Machine Learning¶

Tutorial by Chris Jose

Introduction - Motivation¶

Interested in data science? Integrating ML and Data Science into Healthcare? You are in the right place. Welcome to my Data Science Pipeline! Here, you will gain a step by step, meticulous introduction to the basic ingredients encompassing the Data Science project, from data collection, information preprocessing/feature engineering, exploratory analysis, and supervised learning which you will set you on the path to becoming a Data Science Practicioner in no time!

Specifically, today, we're aiming to create a robust classifier for predicting 2 different classes of Breast Cancer malignancies. Breast Cancer is the 2nd most common cancer among women in the United States, with over 3.8 million people having a history of breast cancer in 2021 and an estimated 281,550 new cases of invasive breast cancer in the first half of 2021 alone, according to the CDC. The steady increase in cases, combined with the excessive burden it places on fragile healthcare ecosystems especially in developing countries, difficulties in early diagnosis and prevention, and ever short supply of radiologists to meet the needs of hospitals receiving exponential amounts of data, machine learning, data science and deep learning methodologies are some of the most effective analytical tools to effectively meet some of these challenges. Through this tutorial, I will give a detailed, step-by-step introduction to the DataOps pipeline and hopefully ease your introduction into the exciting world of AI+healthcare.

The dataset I will be using has been extracted from the UCI Machine Learning Repository, found at the the following link: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. In this csv file, firstly, important attributes of Patient ID Number along with Diagnosis(M for malignant and B for benign) are listed. Then, for each sample input, 10-real values features are computed from a digitized image of a fine needly aspirate (FNA) of a breast mass. These 10 important features describe fundamental characteristics of the cell nuclei presented in histopathological images. The quantitative descriptive statistics mean, standard error, and largest(average of 3 largest values in a column) of each feature is also given. The 10 features are as follows:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

The distribution of the dataset is approximately 212 malignant and 357 benign samples. The principle research question we want to ask is whether we can accurately predict, given diagnostic information extracted from a digitized histopathological image, whether the sample is benign or malignant and provide some insight into the top most important biomarkers which correlate with different variations of breat cancer, using powerful data science, data munching, and machine learning hypothesis testing toolkits at our disposal. I hope to further shed some light into the how these ML-predictive models can best generalize to new sets of multi-modal data. Without further a due, let's get started!

#import necessary packages 
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import seaborn as sns
import sklearn as sk
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve, train_test_split
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score

1. Data Collection/Extraction¶

Data was cleaned and extracted from the UCI Machine Learning Repository's Breast Cancer Diagnostic DataSet, which was nicely formatted into a CSV consisting of id's, 30 features, and the binary classification of malignant vs. benign, after some backend additional SQL instructions. These features were computed from a digitized image of a fine needle aspirate (FNA) of a cell nuclei in each image. While there are various publicly available datasets for breast cancer screening, including several PLCO datasets available through https://cdas.cancer.gov/datasets/plco/19/, for my specific purposes of doing data analytics and training a ML model for achieving high scores for binary classification, the UCI dataset was the best one due to the sizable number of features and data points included as well as ease of use.

Once the data was downloaded from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), the master file "data.csv" containing all datapoints of interest was saved locally.

In the subsequent few sections, data is extracted and organized effectively using the Pandas DataFrame and the library's diverse functionalities for data wrangling.

#raw_data_path = os.path.join('uci_breast_cancer', 'csv','data.csv' )
df = pd.read_csv("data.csv")
df.head(10)

features.describe().transpose()

The pandas describe() function helps us get a whole side picture of several key aspects of our dataset. One of the key aspects we can observe is that many of the features are not standardized/normalized - for example, compactness and area are on different scales. This should ring some bells since ML models dealing with differences in scale across input variables have a very difficult time modelling the dataset patterns, and can led to poor performance or a high generalization error. We will due to this prior to implementing our supervised classifiers.

As you can see, this dataset has a total of 569 data points, 30 features of interest for our purposes of binary classification, and a target column with 2 labels, M=malignant and B=benign. All data entries all filled and complete, and the extra columns of id and "Unnamed" are removed from the dataframe. The data is then further split into 2 dataframes, one containing the features(independent variables) and the other containing the target column(dependent variable).

# features contains the independent variables
# Y_diagnosis variable contains the dependent variable
df.info()
df = df.drop(['Unnamed: 32', 'id'], axis=1)
df.replace("?", np.nan, inplace=True)
features = df

# Use LabelEncoder() to transform the target into quantitative data - 0 for benign and 1 for malignant
le = LabelEncoder()
Y_diagnosis = pd.Series(le.fit_transform(features['diagnosis']))

# differentiate between the 3 sets of statistics that are computed for each of the 10 features
means = [col for col in features.columns if col.endswith("mean")]
se = [col for col in features.columns if col.endswith("se")]
worst = [col for col in features.columns if col.endswith("worst")]

print("----------------------------------------------------")
print("Number of Training Samples: ", len(df.index))
print("Number of Features: ", len(df.columns))
print("Set of Features: ", df.columns)
print("----------------------------------------------------")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id                         569 non-null int64
diagnosis                  569 non-null object
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
Unnamed: 32                0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
----------------------------------------------------
Number of Training Samples:  569
Number of Features:  31
Set of Features:  Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')
----------------------------------------------------

As shown above, we have a total of 569 data points available, with all possible entries being filled with valid(non-null) data. While we do have the option of using a specific outlier detection methodology like IQR or Z-scores to remove outliers that are present due to the natural variability in this medical dataset, I will not utilize it here due to how compact the dataset is currently.

In this dataset, 10 predictor variables(the features) were computed from the digital nuclei image. Furthermore, the mean, standard error(se) and the mean of the top 3 largest values("worst") were also computed and registered for each feature in this data pool.

2. Exploratory Data Analysis¶

One of the critical roles EDA and Visualization plays to analyze this particular dataset is to (1) understand relations between our different feature sets and the target and (2) to select the best subset of features that we can train our classifiers on. The Seaborn and Matplotlib libraries are ideal candidates of choice that I will use to create clear, high level graphs for visualization purposes. More information on several cool features of seaborn can be found here: https://seaborn.pydata.org/examples/index.html. Definitely check it out if you're a graphics or visualization enthusiast!

I first decided to do a heatmap of correlations visualized on a correlation matrix to see whether there were any subtle dependencies or patterns between the 30 features(independent variable) and the dependent variables that were elucidated above. Before implementing a supervised ML classifier, locating subtle correlations and dependency relations in the data is an important part for our further steps of feature engineering and dimensionality reduction.

Independent Variables(in the dataset, the mean, SE, and largest values of the features below were reported, yielding a total= of 30 features) a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

Dependent Variable:

Diagnosis - Benign vs. malignant

features['diagnosis'].value_counts(normalize=True)

B    0.627417
M    0.372583
Name: diagnosis, dtype: float64

As shown above, it looks like we have a class imbalance among benign vs. malignant. To address this, we could try out a variety of approaches, from resampling the data, using different evaluation metrics like AUC and MCC(correlation coefficient between the observed and predicted binary classifications), or use K-fold cross validation for resampling different portions of the dataset.

#Create a heatmap
correlation_matrix = features.corr()
sns.heatmap(correlation_matrix, vmin=correlation_matrix.values.min(), vmax=1, linewidths=0.3, square=True)
plt.show()

Based off of the heatmap above, it looks like radius, perimeter, and area are highly correlated(as seen by the lighter shades). Also, the concave points_mean, concavity_mean, and compactness_mean are another group of 3 features which are highly correlated. This is what we expect since concavity, radius, perimeter, and area of a nuclei are critical factors that we utilize to distinguish between a malignant vs. benign cell mass. These correlations can be further confirmed by conducting a Pearson's Correlation test and by graphing a joint_plot, as shown below.

This is a prime example of multicollinearity, which can be a problem in an ML model because we would not be able to distinguish the individual effects of the independent variable on the dependent variable. Thus, we can sample a single feature from these groups of features with high correlation to account for the multicollinearity between some features.

Additionally, since we're working with a slightly imbalanced dataset(357 malignant, 212 benign), we need to employ a sufficient sampling methodology to avoid including inherent bias when training a classifier. In particular, a feature selection algorithm must be used to eliminate any unnecessary features.

Based off of a high-level glance at this heatmap, we can gain some initial intuition on how to go about feature engineering/selection. Often times, traditional feature selection algorithms employ statistical tests to choose discriminative features which don't fall into the trap of multicollinearity. For example, one of the most prominent groups of correlated features are radius_mean, area_mean, and perimeter_mean, so we can decide to drop the others and instead choose area_mean since, as we'll see later on, it seems to be very effective at discriminating between our 2 binary classes.

#this plot produces the pairwise correlations among 4 different chosen features
sns.pairplot(features, hue='diagnosis', vars = ["radius_mean", "texture_mean", "smoothness_mean", "concavity_mean"])

<seaborn.axisgrid.PairGrid at 0x1525748d160>

Y_diagnosis = pd.Series(Y_diagnosis)
plt.figure(figsize=(15,10))
#plot the correlations between the independent variables and the target variable
features.drop('diagnosis', axis=1).corrwith(Y_diagnosis).plot(kind='bar', grid=True)
plt.title("Correlation of the Mean Features with the Diagnosis")

Text(0.5, 1.0, 'Correlation of the Mean Features with the Diagnosis')

The features smoothness_se, fractal_dim_mean, fractal_dim_se, and texture_se seems to be the features that are least correlated with our target, which is the diagnosis. All other features are well correlated, especially concave_points_worst, concave_points_mean, radius, and perimeter.

#function plotting the correlation between the 2 features
def plt_feat1_feat2(feat1, feat2):
    # group the data(predictor, taget) variables for our feature of interest
    x_m = features[feat1].groupby(features['diagnosis']).apply(list)[1]
    y_m = features[feat2].groupby(features['diagnosis']).apply(list)[1]
    x_b = features[feat1].groupby(features['diagnosis']).apply(list)[0]
    y_b = features[feat2].groupby(features['diagnosis']).apply(list)[0]
    fig, ax = plt.subplots(figsize=(8,8))

    
    #create subplots showing the correlations between the 2 features with 2 labels
    plt.scatter(x_m, y_m, color='red', label='malignant')
    plt.scatter(x_b, y_b, color='blue',label='benign')
    plt.xlabel(feat1)
    plt.ylabel(feat2)
    plt.title("Correlation between {} vs. {}".format(feat1, feat2))
    plt.legend()
    plt.show()
    
plt_feat1_feat2("perimeter_mean", "radius_worst")
plt_feat1_feat2("area_worst", "radius_worst")
plt_feat1_feat2("texture_mean", "texture_worst")
plt_feat1_feat2("area_mean", "radius_worst")

def violin_plots(lower_bound, upper_bound):

    num_features = len(features.columns)
    input_std = features.drop('diagnosis', axis=1)
    # averaging each input value and concatenate into a new dataframe
    input_std = (input_std - input_std.mean())/(input_std.std())
    input_std = pd.concat([Y_diagnosis, input_std.iloc[:, lower_bound:upper_bound]], axis=1)
    #flatten the input to be used in the violin plot
    input_flatten = pd.melt(input_std, id_vars="diagnosis", var_name="features", value_name='value')
    input_flatten

    plt.figure(figsize=(15,8))
    sns.violinplot(x='features', y='value', hue='diagnosis', data = input_flatten, split=True, inner='quart')
    #plt.xticks(fontsize=15)
    plt.xticks(rotation=45)
    plt.legend(loc=2, prop={'size': 10})

#plot the features which have the mean attribute
violin_plots(0,10)

Observation from the violin plot above: Both the smoothness_mean and the symmetry_mean features seem to have nearly identical distributions! It might be useful to use our correlation plots above to determine whether there is a particular dependency relation between the two, and if there si, we can reduce redundancy and data dimensionality by dropping one of the columns, since most of the variance of the dataset is incorporated into the other features.

#plot the features which have the "SE" attribute
violin_plots(10,20)

#plot the feature values with the "worst" attribute
violin_plots(20, 30)

The violin plots above combine the best aspects of a box and a kernel density plot and depict the median, the interquartile range, and the 1st/3rd quartiles which can be used to identify and extract outliers.

In the 1st violin plot, one observation we can make is that the median value the malignant and benign sections of features like texture_mean, perimeter_mean, and area_mean don't coincide - compare it with the fractal_dimension_mean for example, and you can see that the medians of the malignant and benign subcategories coincide and might be difficult for the classifier to distinguish between the 2 classes. Features with non-coinciding medians and ranges could potentially make this a good feature that a classifier could train on to distinguish between the 2 classes.

Principal Component Analysis and Dimensionality Reduction¶

The next step after EDA is dimensionality reduction! We'll be employing the PCA technique in order to reduce the dimensions of our dataset and highlight any inherent patterns that we can learn from through meaningful graphs. PCA specifically can merge different variables into an assortment of linear combinations that can best encompass the total variability of this dataset. From these linear combinations, we can reduce the complexity of a 30-dim model by extract the linear combination that yields a good chunk (~90%) of the covariance within the data.

Additionally, through our EDA above, from the covariance matrix, we can see that features such as "area mean" and "area worst" have descriptive statistics that are on operating on different scales. Unscaled statistics of our feature set can be detrimental to the performance of our ML classifier, and can lead to amplified variances. Thus, we can use sklearn's StandardScaler() interface to fix this issue.

target = pd.DataFrame(Y_diagnosis)
data = features.drop('diagnosis', axis=1)

#scale the input using StandardScaler()
X_standardized = StandardScaler().fit_transform(data.values)
pca = PCA(svd_solver='full')
X_pca = pca.fit_transform(X_standardized)
#merge the PCA-extracted features into a new dataframe
pca_df = pd.DataFrame(X_pca)
pca_df = pca_df.merge(target, left_index=True, right_index=True, how='left')
pca_df.head()

X_pca = PCA().fit(X_standardized)
#plot the cumulative variance ratio that is encoded in the PCA extracted features
plt.plot(np.cumsum(X_pca.explained_variance_ratio_))
plt.xlabel("#Components")
plt.ylabel("Cumulative Explained Variance Ratio")
pca = PCA(n_components=6)
pca.fit(X_standardized)
transformed_df = pd.DataFrame(pca.transform(X_standardized), columns = ['PC1', 'PC2', 'PC3', 
                                       'PC4', 'PC5', 'PC6'])
transformed_df.head()

One of the foremost powers of PCA is to be able to estimate how many components are needed to describe the data and encompass as much of the variance in the dataset as possible. This curve helps quantify how much of the total 30-dimensional variance is contained within the first N components. As shown above, we can see that with respect to features, the first 6 features contain about 90% of the variance, while you need about 20 components to desribe close to 100% of the variance. Thus, we would need about 6 components in order to retain 90% of the variance.

# Binary Classifier
pca = PCA(n_components=2) 
# fit PCA with 2 components
pca = pca.fit_transform(X_standardized)
plt.figure(figsize = (9,5))
# plot first 2 PCA components
plt.scatter(pca[:,0],pca[:,1], c = Y_diagnosis, cmap = "RdBu_r", edgecolor = "Red", alpha=0.35)
plt.colorbar()
plt.title('PCA Scatter Plot')

Text(0.5, 1.0, 'PCA Scatter Plot')

Above, I created a scatter plot of observations of PCA component 1 and 2. Though, it looks like there is not a definite, clear cut or discriminative boundary between the 2 features(there is a bit of overlap). This could be attributed to PCA component 1 not explaining as much of the variance in the original data as we want it to.

from scipy import stats
# plot histograms showing the distribution of the each feature in the dataset
data.hist(figsize=(25,20), color='r')
plt.show()

As shown above, a good chunk of the distributions are skewed to the left, since they rise very sharply in the beginning for values near 0 and have a long right trail. This shows how scaling/normalization is essential, especially if we implement regressor algorithms like kNN or linear regression, in order to avoid overfitting and optimize model performance and generalizability. We could employ 2 types of normalization techniques, Log Scaling and the MinMaxScaler. Log scaling is particularly useful if you have a distribution skewed as above, where a handful of values have many points, while most other values at the tail end have fewer points. This type of distribution is typically referred to as a power law distribution. On the other hand, the MinMaxScaler usually scales the data to a fixed range, with a drawback being that we will end up with smaller deviations and it will overall suppress the effect of outliers since this type of scaler is sensitive to outliers.

Machine Learning and Statistical Analysis¶

#split the dataset into a training and holdout test set - 75% for training and 25% for validation/testing
X_train, X_test, y_train, y_test = train_test_split(data, Y_diagnosis, test_size=0.25, random_state=42)

# function for printing evaluation metrics given a grid with optimized paramters along with a confusion matrix
def print_evaluation_metrics(grid, cf_matrix):
    #returns a classification report comparing model performance with test data
    print(classification_report(y_test, y_pred))
    #print the best parameters from GridSearch and the best score the model attains using those tuned hyperparameters
    print("Tuned Hyperparameters: {}".format(grid.best_params_))
    print("Accuracy Rate: {}".format(grid.score(X_test, y_test)))
    print("Optimal Score: {}".format(grid.best_score_))
    group_names = ['True Neg','False Pos','False Neg','True Pos']
    #label the confusion matrix and standardize the TP/TN/FP/FN values
    group_counts = ['{0:0.0f}'.format(value) for value in
                    cf_matrix.flatten()]
    group_percentages = ['{0:.2%}'.format(value) for value in  
                         cf_matrix.flatten()/np.sum(cf_matrix)]
    labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in
              zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    #Produce a heatmap to clearly visualize model evaluation metrics
    sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

Random Forest and Recursive Feature Elimination¶

RFE is a wrapper-style selection algorithm that employs filter-based feature selection internally. It works by searching for a subset of features. In conjunction with the random forest classifier, it assigns weights to each features and recursively prunes out the subset of features with the smallest absolute weights until we achieve our optimal feature subset. Here, we will choose the top 10 features computed by RFE.

from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn import tree
from sklearn.tree import export_graphviz

rf_model = RandomForestClassifier()

#perform an exhaustive search through these input parameters using GridSearch 
parameters = {'n_estimators': [4, 6, 9, 10, 15], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1, 5, 8]
             }

#fit and score the estimator on the training set with the tuned parameters
param_tuning = GridSearchCV(rf_model, parameters)
param_tuning = param_tuning.fit(X_train, y_train)
print("Best Parameters for Random Forest Classifier: {}".format(param_tuning.best_params_))

rf_model = param_tuning.best_estimator_
rf_model.fit(X_train, y_train)

cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_tree, cf_matrix)

#utilize the embedded methods of random forest to identify the most important features
# While training a tree-based classifier, it progressively identifies important features by averaging the impurity reduction caused by an individual feature
plt.figure(figsize=(15,12))
sorted_ids = rf_model.feature_importances_.argsort()
plt.barh(data.columns[sorted_ids], rf_model.feature_importances_[sorted_ids])
plt.xlabel("Random Forest Feature Importance Ranking")
plt.show()

#run the Recursive Feature Elimination Algorithm which identifies the top k features and exclusively trains on that subset
rfe = RFE(estimator = model, n_features_to_select=10)
rfe = rfe.fit(X_train, y_train)
#print(X_train.shape)
for i in range(X_train.shape[1]):
    if rfe.support_[i]:
        print('Column: %s, Selected %s, Rank: %.3f' % (X_train.columns[i], rfe.support_[i], rfe.ranking_[i]))


#feature_names = list(X_train.columns[rfe.support_])
#out = export_graphviz(dt, feature_names=feature_names, out_file=file, class_names=['Benign', 'Malignant'], filled=True)

#fig, ax = plt.subplots(figsize=(30,20))
#tree.plot_tree(dt)

C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

Best Parameters for Random Forest Classifier: {'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 9}
              precision    recall  f1-score   support

           0       0.98      0.98      0.98        89
           1       0.96      0.96      0.96        54

   micro avg       0.97      0.97      0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

Tuned Hyperparameters: {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 2}
Accuracy Rate: 0.958041958041958
Optimal Score: 0.9460093896713615

Column: texture_mean, Selected True, Rank: 1.000
Column: concave points_mean, Selected True, Rank: 1.000
Column: radius_se, Selected True, Rank: 1.000
Column: area_se, Selected True, Rank: 1.000
Column: radius_worst, Selected True, Rank: 1.000
Column: texture_worst, Selected True, Rank: 1.000
Column: perimeter_worst, Selected True, Rank: 1.000
Column: area_worst, Selected True, Rank: 1.000
Column: smoothness_worst, Selected True, Rank: 1.000
Column: concave points_worst, Selected True, Rank: 1.000

As shown above, we have selected the top 10 optimal features. The Accuracy overall was about ~95%.

print('Best Features', list(X_train.columns[rfe.support_]))

Best Features ['texture_mean', 'concave points_mean', 'perimeter_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'concavity_worst', 'concave points_worst']

y_pred = rfe.predict(X_test)

print("Training Accuracy: ", rfe.score(X_train, y_train))
print("Testing Accuracy: ", rfe.score(X_test, y_test))

Training Accuracy:  0.9976525821596244
Testing Accuracy:  0.965034965034965

Decision Tree Classifier¶

# run Sklearn's DecisionTreeClassifier wrapper method
dt = DecisionTreeClassifier(random_state=40)
param_dict = {"criterion":['gini', 'entropy'],
             "max_depth":range(1,10),
             "min_samples_leaf":range(1,5)}
# run GridSearchCV to tune the criterion, max_depth, and min_samples_leaf hyperparameters
grid_tree = GridSearchCV(dt, param_grid=param_dict, cv=10, verbose=1, n_jobs=-1)
# evaluate the best combination of parameter values on the training set
grid_tree.fit(X_train, y_train)
# generate and print the evaluation metrics on the test data for this model
cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_tree, cf_matrix)

Fitting 10 folds for each of 72 candidates, totalling 720 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:   12.7s finished
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

              precision    recall  f1-score   support

           0       0.98      0.98      0.98        89
           1       0.96      0.96      0.96        54

   micro avg       0.97      0.97      0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

Tuned Hyperparameters: {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 2}
Accuracy Rate: 0.958041958041958
Optimal Score: 0.9460093896713615

Support Vector Classifier(SVC)¶

from sklearn.svm import SVC
from sklearn.metrics import classification_report
#run a transformation pipeline with our SVC wrapper function
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
parameters = {'svm__C':[1,10,100],
             'svm__gamma': [0.1, 0.01, 0.001],
             'svm__kernel': ['linear', 'rbf', 'poly']}
#run GridSearch with cross validation to find best combination of parameters yielding optimal classifier performance
grid_svc = GridSearchCV(pipeline, parameters, cv=4, refit=True, verbose=2)
#fit this optimized supervised learner on training data and evaluate on the holdout set
grid_svc.fit(X_train, y_train)
y_pred = grid_svc.predict(X_test)
#generate and print evaluation metrics for the model
cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_svc, cf_matrix)

from sklearn import svm
X = data.values[:, :2]
y = Y_diagnosis.values

h = 0.02 #step size in the mesh
C = 1.0  # SVM regularization parameter
svc = SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']

plt.figure(figsize=(15,12))
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    #Plot the decision boundary. For that, we will assign a color to each
    #point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # ut the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

    #Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
    plt.xlabel('radius_mean', size=15)
    plt.ylabel('texture_mean', size=15)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()

C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)

Above, I have plotted the hyperplanes from the SVC() algorithm with different kernels as a point of comparison. Through GridSearch algorithm, we have found out how the algorithm performs optimally with the linear kernel. Looking at the graph, we can see how the linear hyperplane, which maximizing the margin between the support vectors of the 2 different classes, is best able to discriminate between the 2 classes through its decision boundary.

KFold CrossValidation and Comparison of Classifier Performance¶

from sklearn import model_selection
# run the 3 classifiers through the optimal subset of features extracted by PCA as a point of comparison
svc_pca = grid_svc.fit(transformed_df, Y_diagnosis)
tree_pca = grid_tree.fit(transformed_df, Y_diagnosis)
rf_pca = rf_model.fit(transformed_df, Y_diagnosis)
#compare SVC, Decision Tree, and Random Forest Models with all the features vs top 6 PCA-extracted features
models_pca = [('svc', svc_pca), ('decision tree',tree_pca), ('random forest', rf_pca)]
models = [('svc', grid_svc), ('decision tree', grid_tree), ('random forest', rf_model)]
eval_metrics = []
names = []
for (name, model) in models:
    #use the split method of scikit's cross-validation generators to yield the train/test indices for each iteration of KFold cross validation
    kfold = model_selection.KFold(n_splits=15, random_state=6)
    #train the estimator using the training set and compute the scores based on the testing set on each iteration
    cv_res = model_selection.cross_val_score(model, data, Y_diagnosis, cv=kfold, scoring='accuracy')
    eval_metrics.append(cv_res)
    names.append(name)

eval_metrics_pca = []
names_pca = []
for (name, model) in models_pca:
    kfold = model_selection.KFold(n_splits=15, random_state=6)
    cv_res = model_selection.cross_val_score(model, data, Y_diagnosis, cv=kfold, scoring='accuracy')
    #compile the model evaluation results into a list
    eval_metrics_pca.append(cv_res)
    names_pca.append(name)

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
# plot boxplots showing the distribution in accuracies across the different iterations of KFold cross validation for each classifier
plt.title("Supervised Classifier Predictive Performance Comparison")
ax.set_xticklabels(['SVC', "Decision Tree", "Random Forest"])
plt.boxplot(eval_metrics)
plt.show()

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
#plot boxplots comparing the accuracies for each classifier trained on the PCA-reduced feature set
plt.title("Supervised Classifier Predictive Performance Comparison with PCA-Reduced Features")
ax.set_xticklabels(['SVC', "Decision Tree", "Random Forest"])
plt.boxplot(eval_metrics_pca)
plt.xlabel("SVC, Decision Tree, and Random Forest")
plt.ylabel("Accuracy")
plt.show()

Conclusions and Future Directions¶

In conclusion, our original purpose was to implement a robust classifier to conduct a 2-way classification between the breast cancer classes of malignant vs. benign. First, after some initial data cleaning, data analysis and EDA were a useful starting point in order to develop some fundamental intuition on patterns and correlative relationships among the independent variables and the independent variables vs. the target, and also to gain insight onto the distribution of the dataset and necessary preprocessing that must be employed before implementing a supervised classifier. By utilizing the insight gained from EDA, we were able to utilize the PCA dimensionality reduction technique to extract the top 10 relevant features that encode >90% of the dataset variance.

Moving towards the machine learning side of the project, we were able to successfully train 3 different supervised classifiers - Random Forest, Support Vector Classifier, and Decision Tree - tune its hyperparameters, evaluate performance metrics, and run the RFE algorithm on a fixed number of features to maximize the averaged accuracies. Overall, the SVC model performed optimally and reached an average classification accuracy of 97%, which was validated using multiple iterations of KFold Cross Validation. Training and testing the classifiers on the top 10 PCA components did not alter performance as much, except for the Random Forest Classifier. Though, it is important to keep in mind that our classifier could be subject to the problem of overfitting, especially due to the relatively constrained size of the dataset and number of features that we are working with. Increasing the training set sample size or trying different ensemble methods, such as bagging and boosting, can be useful to complete a robust, comparative analysis among even more supervised classifiers.

One possible way we could extend this research direction is by transitioning from the 2-class binary benign/malignant classification to a multi-classification model which, when given the same series of features from digital histopathological datasets, can discriminate between different subordinate classes of breast cancer(Ductal carcinoma, Firboadenoma, Lobular carcinoma, ect. Additionally, we could use novel deep learning techniques such as convolutional neural networks(CNN) or ensemble frameworks to represent and learn feature maps directly from the images, which is a considerably more challenging task but could open new doors to integrating such a robust, generalizable AI system in the clinical hemisphere for real-time diagnostic results. As the Google AI team has shown with their clinically-tested models for diseases like diabetic retinopathy and breast cancer, Data Science together with Machine Learning/Deep Learning are critical tools that healthcare increasingly needs for efficient, real-time data processing as well as gain clinically-relevant diagnostic information to guide fundamental decisions. If you are interested in learning more about this interesting area of AI+Healthcare, I encourage you to check out the Nature Journal - https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwig7KrWltHwAhWsFlkFHT-RDRwQFjAXegQIIxAD&url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41591-018-0316-z&usg=AOvVaw1sv9RHIUuzFaBR1JqTwvEZ - which thoroughly elucidates the most recent advances in deep learning in Healthcare and Radiology. I also recommend you check out the AI for Medicine Specialization - https://www.deeplearning.ai/program/ai-for-medicine-specialization/ - to receive further guidance and training to learn novel Data Analytics and Deep Learning skills that todays healthcare sphere urgently needs. Thanks for visiting this page!

Resources¶

Please feel free to peruse through these helpful resources for understanding this dataset, the real world applicability of ML-driven breast classifiers in healthcare, and the latest research involving breast cancer datasets such as this one!

Statistical Analysis

HeatMaps - https://seaborn.pydata.org/generated/seaborn.heatmap.html

Correlation Matrices - https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec

Principal Component Analysis - https://builtin.com/data-science/step-step-explanation-principal-component-analysis

Recursive Feature Elimination - https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

Random Forest Classifier - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

DataSet Information - https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29

Latest Research in the Field!

https://arxiv.org/abs/2011.08001

https://arxiv.org/abs/2104.12021

https://healthitanalytics.com/news/deep-learning-may-detect-breast-cancer-earlier-than-radiologists

https://ieeexplore.ieee.org/document/8644835

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890	NaN
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902	NaN
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758	NaN
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300	NaN
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678	NaN
5	843786	M	12.45	15.70	82.57	477.1	0.12780	0.17000	0.15780	0.08089	...	23.75	103.40	741.6	0.1791	0.5249	0.5355	0.1741	0.3985	0.12440	NaN
6	844359	M	18.25	19.98	119.60	1040.0	0.09463	0.10900	0.11270	0.07400	...	27.66	153.20	1606.0	0.1442	0.2576	0.3784	0.1932	0.3063	0.08368	NaN
7	84458202	M	13.71	20.83	90.20	577.9	0.11890	0.16450	0.09366	0.05985	...	28.14	110.60	897.0	0.1654	0.3682	0.2678	0.1556	0.3196	0.11510	NaN
8	844981	M	13.00	21.82	87.50	519.8	0.12730	0.19320	0.18590	0.09353	...	30.73	106.20	739.3	0.1703	0.5401	0.5390	0.2060	0.4378	0.10720	NaN
9	84501001	M	12.46	24.04	83.97	475.9	0.11860	0.23960	0.22730	0.08543	...	40.68	97.65	711.4	0.1853	1.0580	1.1050	0.2210	0.4366	0.20750	NaN

	count	mean	std	min	25%	50%	75%	max
radius_mean	569.0	14.127292	3.524049	6.981000	11.700000	13.370000	15.780000	28.11000
texture_mean	569.0	19.289649	4.301036	9.710000	16.170000	18.840000	21.800000	39.28000
perimeter_mean	569.0	91.969033	24.298981	43.790000	75.170000	86.240000	104.100000	188.50000
area_mean	569.0	654.889104	351.914129	143.500000	420.300000	551.100000	782.700000	2501.00000
smoothness_mean	569.0	0.096360	0.014064	0.052630	0.086370	0.095870	0.105300	0.16340
compactness_mean	569.0	0.104341	0.052813	0.019380	0.064920	0.092630	0.130400	0.34540
concavity_mean	569.0	0.088799	0.079720	0.000000	0.029560	0.061540	0.130700	0.42680
concave points_mean	569.0	0.048919	0.038803	0.000000	0.020310	0.033500	0.074000	0.20120
symmetry_mean	569.0	0.181162	0.027414	0.106000	0.161900	0.179200	0.195700	0.30400
fractal_dimension_mean	569.0	0.062798	0.007060	0.049960	0.057700	0.061540	0.066120	0.09744
radius_se	569.0	0.405172	0.277313	0.111500	0.232400	0.324200	0.478900	2.87300
texture_se	569.0	1.216853	0.551648	0.360200	0.833900	1.108000	1.474000	4.88500
perimeter_se	569.0	2.866059	2.021855	0.757000	1.606000	2.287000	3.357000	21.98000
area_se	569.0	40.337079	45.491006	6.802000	17.850000	24.530000	45.190000	542.20000
smoothness_se	569.0	0.007041	0.003003	0.001713	0.005169	0.006380	0.008146	0.03113
compactness_se	569.0	0.025478	0.017908	0.002252	0.013080	0.020450	0.032450	0.13540
concavity_se	569.0	0.031894	0.030186	0.000000	0.015090	0.025890	0.042050	0.39600
concave points_se	569.0	0.011796	0.006170	0.000000	0.007638	0.010930	0.014710	0.05279
symmetry_se	569.0	0.020542	0.008266	0.007882	0.015160	0.018730	0.023480	0.07895
fractal_dimension_se	569.0	0.003795	0.002646	0.000895	0.002248	0.003187	0.004558	0.02984
radius_worst	569.0	16.269190	4.833242	7.930000	13.010000	14.970000	18.790000	36.04000
texture_worst	569.0	25.677223	6.146258	12.020000	21.080000	25.410000	29.720000	49.54000
perimeter_worst	569.0	107.261213	33.602542	50.410000	84.110000	97.660000	125.400000	251.20000
area_worst	569.0	880.583128	569.356993	185.200000	515.300000	686.500000	1084.000000	4254.00000
smoothness_worst	569.0	0.132369	0.022832	0.071170	0.116600	0.131300	0.146000	0.22260
compactness_worst	569.0	0.254265	0.157336	0.027290	0.147200	0.211900	0.339100	1.05800
concavity_worst	569.0	0.272188	0.208624	0.000000	0.114500	0.226700	0.382900	1.25200
concave points_worst	569.0	0.114606	0.065732	0.000000	0.064930	0.099930	0.161400	0.29100
symmetry_worst	569.0	0.290076	0.061867	0.156500	0.250400	0.282200	0.317900	0.66380
fractal_dimension_worst	569.0	0.083946	0.018061	0.055040	0.071460	0.080040	0.092080	0.20750

	0_x	1	2	3	4	5	6	7	8	9	...	21	22	23	24	25	26	27	28	29	0_y
0	9.192837	1.948583	-1.123166	3.633731	-1.195110	1.411424	2.159370	-0.398407	-0.157118	-0.877402	...	0.068850	0.084519	-0.175256	-0.151020	-0.201503	-0.252585	-0.033914	0.045648	-0.047169	1
1	2.387802	-3.768172	-0.529293	1.118264	0.621775	0.028656	0.013358	0.240988	-0.711905	1.106995	...	-0.094578	-0.217718	0.011290	-0.170510	-0.041129	0.181270	0.032624	-0.005687	-0.001868	1
2	5.733896	-1.075174	-0.551748	0.912083	-0.177086	0.541452	-0.668166	0.097374	0.024066	0.454275	...	-0.060309	-0.074291	0.102762	0.171158	0.004735	0.049569	0.047026	0.003146	0.000751	1
3	7.122953	10.275589	-3.232790	0.152547	-2.960878	3.053422	1.429911	1.059565	-1.405440	-1.116975	...	-0.203266	-0.124105	0.153430	0.077496	-0.275225	0.183462	0.042484	-0.069295	-0.019937	1
4	3.935302	-1.948072	1.389767	2.940639	0.546747	-1.226495	-0.936213	0.636376	-0.263805	0.377704	...	-0.017650	0.139454	-0.005332	0.003062	0.039254	0.032168	-0.034786	0.005038	0.021214	1

	PC1	PC2	PC3	PC4	PC5	PC6
0	9.192837	1.948583	-1.123166	3.633731	-1.195110	1.411424
1	2.387802	-3.768172	-0.529293	1.118264	0.621775	0.028656
2	5.733896	-1.075174	-0.551748	0.912083	-0.177086	0.541452
3	7.122953	10.275589	-3.232790	0.152547	-2.960878	3.053422
4	3.935302	-1.948072	1.389767	2.940639	0.546747	-1.226495