Tutorial by Chris Jose
Interested in data science? Integrating ML and Data Science into Healthcare? You are in the right place. Welcome to my Data Science Pipeline! Here, you will gain a step by step, meticulous introduction to the basic ingredients encompassing the Data Science project, from data collection, information preprocessing/feature engineering, exploratory analysis, and supervised learning which you will set you on the path to becoming a Data Science Practicioner in no time!
Specifically, today, we're aiming to create a robust classifier for predicting 2 different classes of Breast Cancer malignancies. Breast Cancer is the 2nd most common cancer among women in the United States, with over 3.8 million people having a history of breast cancer in 2021 and an estimated 281,550 new cases of invasive breast cancer in the first half of 2021 alone, according to the CDC. The steady increase in cases, combined with the excessive burden it places on fragile healthcare ecosystems especially in developing countries, difficulties in early diagnosis and prevention, and ever short supply of radiologists to meet the needs of hospitals receiving exponential amounts of data, machine learning, data science and deep learning methodologies are some of the most effective analytical tools to effectively meet some of these challenges. Through this tutorial, I will give a detailed, step-by-step introduction to the DataOps pipeline and hopefully ease your introduction into the exciting world of AI+healthcare.
The dataset I will be using has been extracted from the UCI Machine Learning Repository, found at the the following link: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. In this csv file, firstly, important attributes of Patient ID Number along with Diagnosis(M for malignant and B for benign) are listed. Then, for each sample input, 10-real values features are computed from a digitized image of a fine needly aspirate (FNA) of a breast mass. These 10 important features describe fundamental characteristics of the cell nuclei presented in histopathological images. The quantitative descriptive statistics mean, standard error, and largest(average of 3 largest values in a column) of each feature is also given. The 10 features are as follows:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The distribution of the dataset is approximately 212 malignant and 357 benign samples. The principle research question we want to ask is whether we can accurately predict, given diagnostic information extracted from a digitized histopathological image, whether the sample is benign or malignant and provide some insight into the top most important biomarkers which correlate with different variations of breat cancer, using powerful data science, data munching, and machine learning hypothesis testing toolkits at our disposal. I hope to further shed some light into the how these ML-predictive models can best generalize to new sets of multi-modal data. Without further a due, let's get started!
#import necessary packages
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import seaborn as sns
import sklearn as sk
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve, train_test_split
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score
Data was cleaned and extracted from the UCI Machine Learning Repository's Breast Cancer Diagnostic DataSet, which was nicely formatted into a CSV consisting of id's, 30 features, and the binary classification of malignant vs. benign, after some backend additional SQL instructions. These features were computed from a digitized image of a fine needle aspirate (FNA) of a cell nuclei in each image. While there are various publicly available datasets for breast cancer screening, including several PLCO datasets available through https://cdas.cancer.gov/datasets/plco/19/, for my specific purposes of doing data analytics and training a ML model for achieving high scores for binary classification, the UCI dataset was the best one due to the sizable number of features and data points included as well as ease of use.
Once the data was downloaded from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), the master file "data.csv" containing all datapoints of interest was saved locally.
In the subsequent few sections, data is extracted and organized effectively using the Pandas DataFrame and the library's diverse functionalities for data wrangling.
#raw_data_path = os.path.join('uci_breast_cancer', 'csv','data.csv' )
df = pd.read_csv("data.csv")
df.head(10)
features.describe().transpose()
The pandas describe() function helps us get a whole side picture of several key aspects of our dataset. One of the key aspects we can observe is that many of the features are not standardized/normalized - for example, compactness and area are on different scales. This should ring some bells since ML models dealing with differences in scale across input variables have a very difficult time modelling the dataset patterns, and can led to poor performance or a high generalization error. We will due to this prior to implementing our supervised classifiers.
As you can see, this dataset has a total of 569 data points, 30 features of interest for our purposes of binary classification, and a target column with 2 labels, M=malignant and B=benign. All data entries all filled and complete, and the extra columns of id and "Unnamed" are removed from the dataframe. The data is then further split into 2 dataframes, one containing the features(independent variables) and the other containing the target column(dependent variable).
# features contains the independent variables
# Y_diagnosis variable contains the dependent variable
df.info()
df = df.drop(['Unnamed: 32', 'id'], axis=1)
df.replace("?", np.nan, inplace=True)
features = df
# Use LabelEncoder() to transform the target into quantitative data - 0 for benign and 1 for malignant
le = LabelEncoder()
Y_diagnosis = pd.Series(le.fit_transform(features['diagnosis']))
# differentiate between the 3 sets of statistics that are computed for each of the 10 features
means = [col for col in features.columns if col.endswith("mean")]
se = [col for col in features.columns if col.endswith("se")]
worst = [col for col in features.columns if col.endswith("worst")]
print("----------------------------------------------------")
print("Number of Training Samples: ", len(df.index))
print("Number of Features: ", len(df.columns))
print("Set of Features: ", df.columns)
print("----------------------------------------------------")
As shown above, we have a total of 569 data points available, with all possible entries being filled with valid(non-null) data. While we do have the option of using a specific outlier detection methodology like IQR or Z-scores to remove outliers that are present due to the natural variability in this medical dataset, I will not utilize it here due to how compact the dataset is currently.
In this dataset, 10 predictor variables(the features) were computed from the digital nuclei image. Furthermore, the mean, standard error(se) and the mean of the top 3 largest values("worst") were also computed and registered for each feature in this data pool.
One of the critical roles EDA and Visualization plays to analyze this particular dataset is to (1) understand relations between our different feature sets and the target and (2) to select the best subset of features that we can train our classifiers on. The Seaborn and Matplotlib libraries are ideal candidates of choice that I will use to create clear, high level graphs for visualization purposes. More information on several cool features of seaborn can be found here: https://seaborn.pydata.org/examples/index.html. Definitely check it out if you're a graphics or visualization enthusiast!
I first decided to do a heatmap of correlations visualized on a correlation matrix to see whether there were any subtle dependencies or patterns between the 30 features(independent variable) and the dependent variables that were elucidated above. Before implementing a supervised ML classifier, locating subtle correlations and dependency relations in the data is an important part for our further steps of feature engineering and dimensionality reduction.
Independent Variables(in the dataset, the mean, SE, and largest values of the features below were reported, yielding a total= of 30 features) a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
Dependent Variable:
features['diagnosis'].value_counts(normalize=True)
As shown above, it looks like we have a class imbalance among benign vs. malignant. To address this, we could try out a variety of approaches, from resampling the data, using different evaluation metrics like AUC and MCC(correlation coefficient between the observed and predicted binary classifications), or use K-fold cross validation for resampling different portions of the dataset.
#Create a heatmap
correlation_matrix = features.corr()
sns.heatmap(correlation_matrix, vmin=correlation_matrix.values.min(), vmax=1, linewidths=0.3, square=True)
plt.show()
Based off of the heatmap above, it looks like radius, perimeter, and area are highly correlated(as seen by the lighter shades). Also, the concave points_mean, concavity_mean, and compactness_mean are another group of 3 features which are highly correlated. This is what we expect since concavity, radius, perimeter, and area of a nuclei are critical factors that we utilize to distinguish between a malignant vs. benign cell mass. These correlations can be further confirmed by conducting a Pearson's Correlation test and by graphing a joint_plot, as shown below.
This is a prime example of multicollinearity, which can be a problem in an ML model because we would not be able to distinguish the individual effects of the independent variable on the dependent variable. Thus, we can sample a single feature from these groups of features with high correlation to account for the multicollinearity between some features.
Additionally, since we're working with a slightly imbalanced dataset(357 malignant, 212 benign), we need to employ a sufficient sampling methodology to avoid including inherent bias when training a classifier. In particular, a feature selection algorithm must be used to eliminate any unnecessary features.
Based off of a high-level glance at this heatmap, we can gain some initial intuition on how to go about feature engineering/selection. Often times, traditional feature selection algorithms employ statistical tests to choose discriminative features which don't fall into the trap of multicollinearity. For example, one of the most prominent groups of correlated features are radius_mean, area_mean, and perimeter_mean, so we can decide to drop the others and instead choose area_mean since, as we'll see later on, it seems to be very effective at discriminating between our 2 binary classes.
#this plot produces the pairwise correlations among 4 different chosen features
sns.pairplot(features, hue='diagnosis', vars = ["radius_mean", "texture_mean", "smoothness_mean", "concavity_mean"])
Y_diagnosis = pd.Series(Y_diagnosis)
plt.figure(figsize=(15,10))
#plot the correlations between the independent variables and the target variable
features.drop('diagnosis', axis=1).corrwith(Y_diagnosis).plot(kind='bar', grid=True)
plt.title("Correlation of the Mean Features with the Diagnosis")
The features smoothness_se, fractal_dim_mean, fractal_dim_se, and texture_se seems to be the features that are least correlated with our target, which is the diagnosis. All other features are well correlated, especially concave_points_worst, concave_points_mean, radius, and perimeter.
#function plotting the correlation between the 2 features
def plt_feat1_feat2(feat1, feat2):
# group the data(predictor, taget) variables for our feature of interest
x_m = features[feat1].groupby(features['diagnosis']).apply(list)[1]
y_m = features[feat2].groupby(features['diagnosis']).apply(list)[1]
x_b = features[feat1].groupby(features['diagnosis']).apply(list)[0]
y_b = features[feat2].groupby(features['diagnosis']).apply(list)[0]
fig, ax = plt.subplots(figsize=(8,8))
#create subplots showing the correlations between the 2 features with 2 labels
plt.scatter(x_m, y_m, color='red', label='malignant')
plt.scatter(x_b, y_b, color='blue',label='benign')
plt.xlabel(feat1)
plt.ylabel(feat2)
plt.title("Correlation between {} vs. {}".format(feat1, feat2))
plt.legend()
plt.show()
plt_feat1_feat2("perimeter_mean", "radius_worst")
plt_feat1_feat2("area_worst", "radius_worst")
plt_feat1_feat2("texture_mean", "texture_worst")
plt_feat1_feat2("area_mean", "radius_worst")
def violin_plots(lower_bound, upper_bound):
num_features = len(features.columns)
input_std = features.drop('diagnosis', axis=1)
# averaging each input value and concatenate into a new dataframe
input_std = (input_std - input_std.mean())/(input_std.std())
input_std = pd.concat([Y_diagnosis, input_std.iloc[:, lower_bound:upper_bound]], axis=1)
#flatten the input to be used in the violin plot
input_flatten = pd.melt(input_std, id_vars="diagnosis", var_name="features", value_name='value')
input_flatten
plt.figure(figsize=(15,8))
sns.violinplot(x='features', y='value', hue='diagnosis', data = input_flatten, split=True, inner='quart')
#plt.xticks(fontsize=15)
plt.xticks(rotation=45)
plt.legend(loc=2, prop={'size': 10})
#plot the features which have the mean attribute
violin_plots(0,10)
Observation from the violin plot above: Both the smoothness_mean and the symmetry_mean features seem to have nearly identical distributions! It might be useful to use our correlation plots above to determine whether there is a particular dependency relation between the two, and if there si, we can reduce redundancy and data dimensionality by dropping one of the columns, since most of the variance of the dataset is incorporated into the other features.
#plot the features which have the "SE" attribute
violin_plots(10,20)
#plot the feature values with the "worst" attribute
violin_plots(20, 30)
The violin plots above combine the best aspects of a box and a kernel density plot and depict the median, the interquartile range, and the 1st/3rd quartiles which can be used to identify and extract outliers.
In the 1st violin plot, one observation we can make is that the median value the malignant and benign sections of features like texture_mean, perimeter_mean, and area_mean don't coincide - compare it with the fractal_dimension_mean for example, and you can see that the medians of the malignant and benign subcategories coincide and might be difficult for the classifier to distinguish between the 2 classes. Features with non-coinciding medians and ranges could potentially make this a good feature that a classifier could train on to distinguish between the 2 classes.
The next step after EDA is dimensionality reduction! We'll be employing the PCA technique in order to reduce the dimensions of our dataset and highlight any inherent patterns that we can learn from through meaningful graphs. PCA specifically can merge different variables into an assortment of linear combinations that can best encompass the total variability of this dataset. From these linear combinations, we can reduce the complexity of a 30-dim model by extract the linear combination that yields a good chunk (~90%) of the covariance within the data.
Additionally, through our EDA above, from the covariance matrix, we can see that features such as "area mean" and "area worst" have descriptive statistics that are on operating on different scales. Unscaled statistics of our feature set can be detrimental to the performance of our ML classifier, and can lead to amplified variances. Thus, we can use sklearn's StandardScaler() interface to fix this issue.
target = pd.DataFrame(Y_diagnosis)
data = features.drop('diagnosis', axis=1)
#scale the input using StandardScaler()
X_standardized = StandardScaler().fit_transform(data.values)
pca = PCA(svd_solver='full')
X_pca = pca.fit_transform(X_standardized)
#merge the PCA-extracted features into a new dataframe
pca_df = pd.DataFrame(X_pca)
pca_df = pca_df.merge(target, left_index=True, right_index=True, how='left')
pca_df.head()
X_pca = PCA().fit(X_standardized)
#plot the cumulative variance ratio that is encoded in the PCA extracted features
plt.plot(np.cumsum(X_pca.explained_variance_ratio_))
plt.xlabel("#Components")
plt.ylabel("Cumulative Explained Variance Ratio")
pca = PCA(n_components=6)
pca.fit(X_standardized)
transformed_df = pd.DataFrame(pca.transform(X_standardized), columns = ['PC1', 'PC2', 'PC3',
'PC4', 'PC5', 'PC6'])
transformed_df.head()
One of the foremost powers of PCA is to be able to estimate how many components are needed to describe the data and encompass as much of the variance in the dataset as possible. This curve helps quantify how much of the total 30-dimensional variance is contained within the first N components. As shown above, we can see that with respect to features, the first 6 features contain about 90% of the variance, while you need about 20 components to desribe close to 100% of the variance. Thus, we would need about 6 components in order to retain 90% of the variance.
# Binary Classifier
pca = PCA(n_components=2)
# fit PCA with 2 components
pca = pca.fit_transform(X_standardized)
plt.figure(figsize = (9,5))
# plot first 2 PCA components
plt.scatter(pca[:,0],pca[:,1], c = Y_diagnosis, cmap = "RdBu_r", edgecolor = "Red", alpha=0.35)
plt.colorbar()
plt.title('PCA Scatter Plot')
Above, I created a scatter plot of observations of PCA component 1 and 2. Though, it looks like there is not a definite, clear cut or discriminative boundary between the 2 features(there is a bit of overlap). This could be attributed to PCA component 1 not explaining as much of the variance in the original data as we want it to.
from scipy import stats
# plot histograms showing the distribution of the each feature in the dataset
data.hist(figsize=(25,20), color='r')
plt.show()
As shown above, a good chunk of the distributions are skewed to the left, since they rise very sharply in the beginning for values near 0 and have a long right trail. This shows how scaling/normalization is essential, especially if we implement regressor algorithms like kNN or linear regression, in order to avoid overfitting and optimize model performance and generalizability. We could employ 2 types of normalization techniques, Log Scaling and the MinMaxScaler. Log scaling is particularly useful if you have a distribution skewed as above, where a handful of values have many points, while most other values at the tail end have fewer points. This type of distribution is typically referred to as a power law distribution. On the other hand, the MinMaxScaler usually scales the data to a fixed range, with a drawback being that we will end up with smaller deviations and it will overall suppress the effect of outliers since this type of scaler is sensitive to outliers.
#split the dataset into a training and holdout test set - 75% for training and 25% for validation/testing
X_train, X_test, y_train, y_test = train_test_split(data, Y_diagnosis, test_size=0.25, random_state=42)
# function for printing evaluation metrics given a grid with optimized paramters along with a confusion matrix
def print_evaluation_metrics(grid, cf_matrix):
#returns a classification report comparing model performance with test data
print(classification_report(y_test, y_pred))
#print the best parameters from GridSearch and the best score the model attains using those tuned hyperparameters
print("Tuned Hyperparameters: {}".format(grid.best_params_))
print("Accuracy Rate: {}".format(grid.score(X_test, y_test)))
print("Optimal Score: {}".format(grid.best_score_))
group_names = ['True Neg','False Pos','False Neg','True Pos']
#label the confusion matrix and standardize the TP/TN/FP/FN values
group_counts = ['{0:0.0f}'.format(value) for value in
cf_matrix.flatten()]
group_percentages = ['{0:.2%}'.format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
#Produce a heatmap to clearly visualize model evaluation metrics
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
RFE is a wrapper-style selection algorithm that employs filter-based feature selection internally. It works by searching for a subset of features. In conjunction with the random forest classifier, it assigns weights to each features and recursively prunes out the subset of features with the smallest absolute weights until we achieve our optimal feature subset. Here, we will choose the top 10 features computed by RFE.
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn import tree
from sklearn.tree import export_graphviz
rf_model = RandomForestClassifier()
#perform an exhaustive search through these input parameters using GridSearch
parameters = {'n_estimators': [4, 6, 9, 10, 15],
'max_features': ['log2', 'sqrt','auto'],
'criterion': ['entropy', 'gini'],
'max_depth': [2, 3, 5, 10],
'min_samples_split': [2, 3, 5],
'min_samples_leaf': [1, 5, 8]
}
#fit and score the estimator on the training set with the tuned parameters
param_tuning = GridSearchCV(rf_model, parameters)
param_tuning = param_tuning.fit(X_train, y_train)
print("Best Parameters for Random Forest Classifier: {}".format(param_tuning.best_params_))
rf_model = param_tuning.best_estimator_
rf_model.fit(X_train, y_train)
cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_tree, cf_matrix)
#utilize the embedded methods of random forest to identify the most important features
# While training a tree-based classifier, it progressively identifies important features by averaging the impurity reduction caused by an individual feature
plt.figure(figsize=(15,12))
sorted_ids = rf_model.feature_importances_.argsort()
plt.barh(data.columns[sorted_ids], rf_model.feature_importances_[sorted_ids])
plt.xlabel("Random Forest Feature Importance Ranking")
plt.show()
#run the Recursive Feature Elimination Algorithm which identifies the top k features and exclusively trains on that subset
rfe = RFE(estimator = model, n_features_to_select=10)
rfe = rfe.fit(X_train, y_train)
#print(X_train.shape)
for i in range(X_train.shape[1]):
if rfe.support_[i]:
print('Column: %s, Selected %s, Rank: %.3f' % (X_train.columns[i], rfe.support_[i], rfe.ranking_[i]))
#feature_names = list(X_train.columns[rfe.support_])
#out = export_graphviz(dt, feature_names=feature_names, out_file=file, class_names=['Benign', 'Malignant'], filled=True)
#fig, ax = plt.subplots(figsize=(30,20))
#tree.plot_tree(dt)
As shown above, we have selected the top 10 optimal features. The Accuracy overall was about ~95%.
print('Best Features', list(X_train.columns[rfe.support_]))
y_pred = rfe.predict(X_test)
print("Training Accuracy: ", rfe.score(X_train, y_train))
print("Testing Accuracy: ", rfe.score(X_test, y_test))
# run Sklearn's DecisionTreeClassifier wrapper method
dt = DecisionTreeClassifier(random_state=40)
param_dict = {"criterion":['gini', 'entropy'],
"max_depth":range(1,10),
"min_samples_leaf":range(1,5)}
# run GridSearchCV to tune the criterion, max_depth, and min_samples_leaf hyperparameters
grid_tree = GridSearchCV(dt, param_grid=param_dict, cv=10, verbose=1, n_jobs=-1)
# evaluate the best combination of parameter values on the training set
grid_tree.fit(X_train, y_train)
# generate and print the evaluation metrics on the test data for this model
cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_tree, cf_matrix)
from sklearn.svm import SVC
from sklearn.metrics import classification_report
#run a transformation pipeline with our SVC wrapper function
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
parameters = {'svm__C':[1,10,100],
'svm__gamma': [0.1, 0.01, 0.001],
'svm__kernel': ['linear', 'rbf', 'poly']}
#run GridSearch with cross validation to find best combination of parameters yielding optimal classifier performance
grid_svc = GridSearchCV(pipeline, parameters, cv=4, refit=True, verbose=2)
#fit this optimized supervised learner on training data and evaluate on the holdout set
grid_svc.fit(X_train, y_train)
y_pred = grid_svc.predict(X_test)
#generate and print evaluation metrics for the model
cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_svc, cf_matrix)
from sklearn import svm
X = data.values[:, :2]
y = Y_diagnosis.values
h = 0.02 #step size in the mesh
C = 1.0 # SVM regularization parameter
svc = SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel',
'LinearSVC (linear kernel)',
'SVC with RBF kernel',
'SVC with polynomial (degree 3) kernel']
plt.figure(figsize=(15,12))
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
#Plot the decision boundary. For that, we will assign a color to each
#point in the mesh [x_min, x_max]x[y_min, y_max].
plt.subplot(2, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# ut the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
#Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('radius_mean', size=15)
plt.ylabel('texture_mean', size=15)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
Above, I have plotted the hyperplanes from the SVC() algorithm with different kernels as a point of comparison. Through GridSearch algorithm, we have found out how the algorithm performs optimally with the linear kernel. Looking at the graph, we can see how the linear hyperplane, which maximizing the margin between the support vectors of the 2 different classes, is best able to discriminate between the 2 classes through its decision boundary.
from sklearn import model_selection
# run the 3 classifiers through the optimal subset of features extracted by PCA as a point of comparison
svc_pca = grid_svc.fit(transformed_df, Y_diagnosis)
tree_pca = grid_tree.fit(transformed_df, Y_diagnosis)
rf_pca = rf_model.fit(transformed_df, Y_diagnosis)
#compare SVC, Decision Tree, and Random Forest Models with all the features vs top 6 PCA-extracted features
models_pca = [('svc', svc_pca), ('decision tree',tree_pca), ('random forest', rf_pca)]
models = [('svc', grid_svc), ('decision tree', grid_tree), ('random forest', rf_model)]
eval_metrics = []
names = []
for (name, model) in models:
#use the split method of scikit's cross-validation generators to yield the train/test indices for each iteration of KFold cross validation
kfold = model_selection.KFold(n_splits=15, random_state=6)
#train the estimator using the training set and compute the scores based on the testing set on each iteration
cv_res = model_selection.cross_val_score(model, data, Y_diagnosis, cv=kfold, scoring='accuracy')
eval_metrics.append(cv_res)
names.append(name)
eval_metrics_pca = []
names_pca = []
for (name, model) in models_pca:
kfold = model_selection.KFold(n_splits=15, random_state=6)
cv_res = model_selection.cross_val_score(model, data, Y_diagnosis, cv=kfold, scoring='accuracy')
#compile the model evaluation results into a list
eval_metrics_pca.append(cv_res)
names_pca.append(name)
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
# plot boxplots showing the distribution in accuracies across the different iterations of KFold cross validation for each classifier
plt.title("Supervised Classifier Predictive Performance Comparison")
ax.set_xticklabels(['SVC', "Decision Tree", "Random Forest"])
plt.boxplot(eval_metrics)
plt.show()
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
#plot boxplots comparing the accuracies for each classifier trained on the PCA-reduced feature set
plt.title("Supervised Classifier Predictive Performance Comparison with PCA-Reduced Features")
ax.set_xticklabels(['SVC', "Decision Tree", "Random Forest"])
plt.boxplot(eval_metrics_pca)
plt.xlabel("SVC, Decision Tree, and Random Forest")
plt.ylabel("Accuracy")
plt.show()
In conclusion, our original purpose was to implement a robust classifier to conduct a 2-way classification between the breast cancer classes of malignant vs. benign. First, after some initial data cleaning, data analysis and EDA were a useful starting point in order to develop some fundamental intuition on patterns and correlative relationships among the independent variables and the independent variables vs. the target, and also to gain insight onto the distribution of the dataset and necessary preprocessing that must be employed before implementing a supervised classifier. By utilizing the insight gained from EDA, we were able to utilize the PCA dimensionality reduction technique to extract the top 10 relevant features that encode >90% of the dataset variance.
Moving towards the machine learning side of the project, we were able to successfully train 3 different supervised classifiers - Random Forest, Support Vector Classifier, and Decision Tree - tune its hyperparameters, evaluate performance metrics, and run the RFE algorithm on a fixed number of features to maximize the averaged accuracies. Overall, the SVC model performed optimally and reached an average classification accuracy of 97%, which was validated using multiple iterations of KFold Cross Validation. Training and testing the classifiers on the top 10 PCA components did not alter performance as much, except for the Random Forest Classifier. Though, it is important to keep in mind that our classifier could be subject to the problem of overfitting, especially due to the relatively constrained size of the dataset and number of features that we are working with. Increasing the training set sample size or trying different ensemble methods, such as bagging and boosting, can be useful to complete a robust, comparative analysis among even more supervised classifiers.
One possible way we could extend this research direction is by transitioning from the 2-class binary benign/malignant classification to a multi-classification model which, when given the same series of features from digital histopathological datasets, can discriminate between different subordinate classes of breast cancer(Ductal carcinoma, Firboadenoma, Lobular carcinoma, ect. Additionally, we could use novel deep learning techniques such as convolutional neural networks(CNN) or ensemble frameworks to represent and learn feature maps directly from the images, which is a considerably more challenging task but could open new doors to integrating such a robust, generalizable AI system in the clinical hemisphere for real-time diagnostic results. As the Google AI team has shown with their clinically-tested models for diseases like diabetic retinopathy and breast cancer, Data Science together with Machine Learning/Deep Learning are critical tools that healthcare increasingly needs for efficient, real-time data processing as well as gain clinically-relevant diagnostic information to guide fundamental decisions. If you are interested in learning more about this interesting area of AI+Healthcare, I encourage you to check out the Nature Journal - https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwig7KrWltHwAhWsFlkFHT-RDRwQFjAXegQIIxAD&url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41591-018-0316-z&usg=AOvVaw1sv9RHIUuzFaBR1JqTwvEZ - which thoroughly elucidates the most recent advances in deep learning in Healthcare and Radiology. I also recommend you check out the AI for Medicine Specialization - https://www.deeplearning.ai/program/ai-for-medicine-specialization/ - to receive further guidance and training to learn novel Data Analytics and Deep Learning skills that todays healthcare sphere urgently needs. Thanks for visiting this page!
Please feel free to peruse through these helpful resources for understanding this dataset, the real world applicability of ML-driven breast classifiers in healthcare, and the latest research involving breast cancer datasets such as this one!
Statistical Analysis
HeatMaps - https://seaborn.pydata.org/generated/seaborn.heatmap.html
Correlation Matrices - https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec
Principal Component Analysis - https://builtin.com/data-science/step-step-explanation-principal-component-analysis
Recursive Feature Elimination - https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
Random Forest Classifier - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
DataSet Information - https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29
Latest Research in the Field!
https://arxiv.org/abs/2011.08001
https://arxiv.org/abs/2104.12021
https://healthitanalytics.com/news/deep-learning-may-detect-breast-cancer-earlier-than-radiologists