Breast Cancer Early Prediction and Biomarker Analysis using Data Analytics, Visualization, and Machine Learning

Tutorial by Chris Jose

Introduction - Motivation

Interested in data science? Integrating ML and Data Science into Healthcare? You are in the right place. Welcome to my Data Science Pipeline! Here, you will gain a step by step, meticulous introduction to the basic ingredients encompassing the Data Science project, from data collection, information preprocessing/feature engineering, exploratory analysis, and supervised learning which you will set you on the path to becoming a Data Science Practicioner in no time!

Specifically, today, we're aiming to create a robust classifier for predicting 2 different classes of Breast Cancer malignancies. Breast Cancer is the 2nd most common cancer among women in the United States, with over 3.8 million people having a history of breast cancer in 2021 and an estimated 281,550 new cases of invasive breast cancer in the first half of 2021 alone, according to the CDC. The steady increase in cases, combined with the excessive burden it places on fragile healthcare ecosystems especially in developing countries, difficulties in early diagnosis and prevention, and ever short supply of radiologists to meet the needs of hospitals receiving exponential amounts of data, machine learning, data science and deep learning methodologies are some of the most effective analytical tools to effectively meet some of these challenges. Through this tutorial, I will give a detailed, step-by-step introduction to the DataOps pipeline and hopefully ease your introduction into the exciting world of AI+healthcare.

The dataset I will be using has been extracted from the UCI Machine Learning Repository, found at the the following link: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. In this csv file, firstly, important attributes of Patient ID Number along with Diagnosis(M for malignant and B for benign) are listed. Then, for each sample input, 10-real values features are computed from a digitized image of a fine needly aspirate (FNA) of a breast mass. These 10 important features describe fundamental characteristics of the cell nuclei presented in histopathological images. The quantitative descriptive statistics mean, standard error, and largest(average of 3 largest values in a column) of each feature is also given. The 10 features are as follows:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

The distribution of the dataset is approximately 212 malignant and 357 benign samples. The principle research question we want to ask is whether we can accurately predict, given diagnostic information extracted from a digitized histopathological image, whether the sample is benign or malignant and provide some insight into the top most important biomarkers which correlate with different variations of breat cancer, using powerful data science, data munching, and machine learning hypothesis testing toolkits at our disposal. I hope to further shed some light into the how these ML-predictive models can best generalize to new sets of multi-modal data. Without further a due, let's get started!

In [205]:
#import necessary packages 
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import seaborn as sns
import sklearn as sk
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score, learning_curve, train_test_split
from sklearn.metrics import precision_score, recall_score, confusion_matrix, roc_curve, precision_recall_curve, accuracy_score

1. Data Collection/Extraction

Data was cleaned and extracted from the UCI Machine Learning Repository's Breast Cancer Diagnostic DataSet, which was nicely formatted into a CSV consisting of id's, 30 features, and the binary classification of malignant vs. benign, after some backend additional SQL instructions. These features were computed from a digitized image of a fine needle aspirate (FNA) of a cell nuclei in each image. While there are various publicly available datasets for breast cancer screening, including several PLCO datasets available through https://cdas.cancer.gov/datasets/plco/19/, for my specific purposes of doing data analytics and training a ML model for achieving high scores for binary classification, the UCI dataset was the best one due to the sizable number of features and data points included as well as ease of use.

Once the data was downloaded from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic), the master file "data.csv" containing all datapoints of interest was saved locally.

In the subsequent few sections, data is extracted and organized effectively using the Pandas DataFrame and the library's diverse functionalities for data wrangling.

In [206]:
#raw_data_path = os.path.join('uci_breast_cancer', 'csv','data.csv' )
df = pd.read_csv("data.csv")
df.head(10)
Out[206]:
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN
5 843786 M 12.45 15.70 82.57 477.1 0.12780 0.17000 0.15780 0.08089 ... 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440 NaN
6 844359 M 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.11270 0.07400 ... 27.66 153.20 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368 NaN
7 84458202 M 13.71 20.83 90.20 577.9 0.11890 0.16450 0.09366 0.05985 ... 28.14 110.60 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.11510 NaN
8 844981 M 13.00 21.82 87.50 519.8 0.12730 0.19320 0.18590 0.09353 ... 30.73 106.20 739.3 0.1703 0.5401 0.5390 0.2060 0.4378 0.10720 NaN
9 84501001 M 12.46 24.04 83.97 475.9 0.11860 0.23960 0.22730 0.08543 ... 40.68 97.65 711.4 0.1853 1.0580 1.1050 0.2210 0.4366 0.20750 NaN

10 rows × 33 columns

In [210]:
features.describe().transpose()
Out[210]:
count mean std min 25% 50% 75% max
radius_mean 569.0 14.127292 3.524049 6.981000 11.700000 13.370000 15.780000 28.11000
texture_mean 569.0 19.289649 4.301036 9.710000 16.170000 18.840000 21.800000 39.28000
perimeter_mean 569.0 91.969033 24.298981 43.790000 75.170000 86.240000 104.100000 188.50000
area_mean 569.0 654.889104 351.914129 143.500000 420.300000 551.100000 782.700000 2501.00000
smoothness_mean 569.0 0.096360 0.014064 0.052630 0.086370 0.095870 0.105300 0.16340
compactness_mean 569.0 0.104341 0.052813 0.019380 0.064920 0.092630 0.130400 0.34540
concavity_mean 569.0 0.088799 0.079720 0.000000 0.029560 0.061540 0.130700 0.42680
concave points_mean 569.0 0.048919 0.038803 0.000000 0.020310 0.033500 0.074000 0.20120
symmetry_mean 569.0 0.181162 0.027414 0.106000 0.161900 0.179200 0.195700 0.30400
fractal_dimension_mean 569.0 0.062798 0.007060 0.049960 0.057700 0.061540 0.066120 0.09744
radius_se 569.0 0.405172 0.277313 0.111500 0.232400 0.324200 0.478900 2.87300
texture_se 569.0 1.216853 0.551648 0.360200 0.833900 1.108000 1.474000 4.88500
perimeter_se 569.0 2.866059 2.021855 0.757000 1.606000 2.287000 3.357000 21.98000
area_se 569.0 40.337079 45.491006 6.802000 17.850000 24.530000 45.190000 542.20000
smoothness_se 569.0 0.007041 0.003003 0.001713 0.005169 0.006380 0.008146 0.03113
compactness_se 569.0 0.025478 0.017908 0.002252 0.013080 0.020450 0.032450 0.13540
concavity_se 569.0 0.031894 0.030186 0.000000 0.015090 0.025890 0.042050 0.39600
concave points_se 569.0 0.011796 0.006170 0.000000 0.007638 0.010930 0.014710 0.05279
symmetry_se 569.0 0.020542 0.008266 0.007882 0.015160 0.018730 0.023480 0.07895
fractal_dimension_se 569.0 0.003795 0.002646 0.000895 0.002248 0.003187 0.004558 0.02984
radius_worst 569.0 16.269190 4.833242 7.930000 13.010000 14.970000 18.790000 36.04000
texture_worst 569.0 25.677223 6.146258 12.020000 21.080000 25.410000 29.720000 49.54000
perimeter_worst 569.0 107.261213 33.602542 50.410000 84.110000 97.660000 125.400000 251.20000
area_worst 569.0 880.583128 569.356993 185.200000 515.300000 686.500000 1084.000000 4254.00000
smoothness_worst 569.0 0.132369 0.022832 0.071170 0.116600 0.131300 0.146000 0.22260
compactness_worst 569.0 0.254265 0.157336 0.027290 0.147200 0.211900 0.339100 1.05800
concavity_worst 569.0 0.272188 0.208624 0.000000 0.114500 0.226700 0.382900 1.25200
concave points_worst 569.0 0.114606 0.065732 0.000000 0.064930 0.099930 0.161400 0.29100
symmetry_worst 569.0 0.290076 0.061867 0.156500 0.250400 0.282200 0.317900 0.66380
fractal_dimension_worst 569.0 0.083946 0.018061 0.055040 0.071460 0.080040 0.092080 0.20750

The pandas describe() function helps us get a whole side picture of several key aspects of our dataset. One of the key aspects we can observe is that many of the features are not standardized/normalized - for example, compactness and area are on different scales. This should ring some bells since ML models dealing with differences in scale across input variables have a very difficult time modelling the dataset patterns, and can led to poor performance or a high generalization error. We will due to this prior to implementing our supervised classifiers.

As you can see, this dataset has a total of 569 data points, 30 features of interest for our purposes of binary classification, and a target column with 2 labels, M=malignant and B=benign. All data entries all filled and complete, and the extra columns of id and "Unnamed" are removed from the dataframe. The data is then further split into 2 dataframes, one containing the features(independent variables) and the other containing the target column(dependent variable).

In [208]:
# features contains the independent variables
# Y_diagnosis variable contains the dependent variable
df.info()
df = df.drop(['Unnamed: 32', 'id'], axis=1)
df.replace("?", np.nan, inplace=True)
features = df

# Use LabelEncoder() to transform the target into quantitative data - 0 for benign and 1 for malignant
le = LabelEncoder()
Y_diagnosis = pd.Series(le.fit_transform(features['diagnosis']))

# differentiate between the 3 sets of statistics that are computed for each of the 10 features
means = [col for col in features.columns if col.endswith("mean")]
se = [col for col in features.columns if col.endswith("se")]
worst = [col for col in features.columns if col.endswith("worst")]

print("----------------------------------------------------")
print("Number of Training Samples: ", len(df.index))
print("Number of Features: ", len(df.columns))
print("Set of Features: ", df.columns)
print("----------------------------------------------------")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id                         569 non-null int64
diagnosis                  569 non-null object
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
Unnamed: 32                0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
----------------------------------------------------
Number of Training Samples:  569
Number of Features:  31
Set of Features:  Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')
----------------------------------------------------

As shown above, we have a total of 569 data points available, with all possible entries being filled with valid(non-null) data. While we do have the option of using a specific outlier detection methodology like IQR or Z-scores to remove outliers that are present due to the natural variability in this medical dataset, I will not utilize it here due to how compact the dataset is currently.

In this dataset, 10 predictor variables(the features) were computed from the digital nuclei image. Furthermore, the mean, standard error(se) and the mean of the top 3 largest values("worst") were also computed and registered for each feature in this data pool.

2. Exploratory Data Analysis

One of the critical roles EDA and Visualization plays to analyze this particular dataset is to (1) understand relations between our different feature sets and the target and (2) to select the best subset of features that we can train our classifiers on. The Seaborn and Matplotlib libraries are ideal candidates of choice that I will use to create clear, high level graphs for visualization purposes. More information on several cool features of seaborn can be found here: https://seaborn.pydata.org/examples/index.html. Definitely check it out if you're a graphics or visualization enthusiast!

I first decided to do a heatmap of correlations visualized on a correlation matrix to see whether there were any subtle dependencies or patterns between the 30 features(independent variable) and the dependent variables that were elucidated above. Before implementing a supervised ML classifier, locating subtle correlations and dependency relations in the data is an important part for our further steps of feature engineering and dimensionality reduction.

Independent Variables(in the dataset, the mean, SE, and largest values of the features below were reported, yielding a total= of 30 features) a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

Dependent Variable:

  1. Diagnosis - Benign vs. malignant
In [42]:
features['diagnosis'].value_counts(normalize=True)
Out[42]:
B    0.627417
M    0.372583
Name: diagnosis, dtype: float64

As shown above, it looks like we have a class imbalance among benign vs. malignant. To address this, we could try out a variety of approaches, from resampling the data, using different evaluation metrics like AUC and MCC(correlation coefficient between the observed and predicted binary classifications), or use K-fold cross validation for resampling different portions of the dataset.

In [179]:
#Create a heatmap
correlation_matrix = features.corr()
sns.heatmap(correlation_matrix, vmin=correlation_matrix.values.min(), vmax=1, linewidths=0.3, square=True)
plt.show()

Based off of the heatmap above, it looks like radius, perimeter, and area are highly correlated(as seen by the lighter shades). Also, the concave points_mean, concavity_mean, and compactness_mean are another group of 3 features which are highly correlated. This is what we expect since concavity, radius, perimeter, and area of a nuclei are critical factors that we utilize to distinguish between a malignant vs. benign cell mass. These correlations can be further confirmed by conducting a Pearson's Correlation test and by graphing a joint_plot, as shown below.

This is a prime example of multicollinearity, which can be a problem in an ML model because we would not be able to distinguish the individual effects of the independent variable on the dependent variable. Thus, we can sample a single feature from these groups of features with high correlation to account for the multicollinearity between some features.

Additionally, since we're working with a slightly imbalanced dataset(357 malignant, 212 benign), we need to employ a sufficient sampling methodology to avoid including inherent bias when training a classifier. In particular, a feature selection algorithm must be used to eliminate any unnecessary features.

Based off of a high-level glance at this heatmap, we can gain some initial intuition on how to go about feature engineering/selection. Often times, traditional feature selection algorithms employ statistical tests to choose discriminative features which don't fall into the trap of multicollinearity. For example, one of the most prominent groups of correlated features are radius_mean, area_mean, and perimeter_mean, so we can decide to drop the others and instead choose area_mean since, as we'll see later on, it seems to be very effective at discriminating between our 2 binary classes.

In [14]:
#this plot produces the pairwise correlations among 4 different chosen features
sns.pairplot(features, hue='diagnosis', vars = ["radius_mean", "texture_mean", "smoothness_mean", "concavity_mean"])
Out[14]:
<seaborn.axisgrid.PairGrid at 0x1525748d160>
In [238]:
Y_diagnosis = pd.Series(Y_diagnosis)
plt.figure(figsize=(15,10))
#plot the correlations between the independent variables and the target variable
features.drop('diagnosis', axis=1).corrwith(Y_diagnosis).plot(kind='bar', grid=True)
plt.title("Correlation of the Mean Features with the Diagnosis")
Out[238]:
Text(0.5, 1.0, 'Correlation of the Mean Features with the Diagnosis')

The features smoothness_se, fractal_dim_mean, fractal_dim_se, and texture_se seems to be the features that are least correlated with our target, which is the diagnosis. All other features are well correlated, especially concave_points_worst, concave_points_mean, radius, and perimeter.

In [211]:
#function plotting the correlation between the 2 features
def plt_feat1_feat2(feat1, feat2):
    # group the data(predictor, taget) variables for our feature of interest
    x_m = features[feat1].groupby(features['diagnosis']).apply(list)[1]
    y_m = features[feat2].groupby(features['diagnosis']).apply(list)[1]
    x_b = features[feat1].groupby(features['diagnosis']).apply(list)[0]
    y_b = features[feat2].groupby(features['diagnosis']).apply(list)[0]
    fig, ax = plt.subplots(figsize=(8,8))

    
    #create subplots showing the correlations between the 2 features with 2 labels
    plt.scatter(x_m, y_m, color='red', label='malignant')
    plt.scatter(x_b, y_b, color='blue',label='benign')
    plt.xlabel(feat1)
    plt.ylabel(feat2)
    plt.title("Correlation between {} vs. {}".format(feat1, feat2))
    plt.legend()
    plt.show()
    
plt_feat1_feat2("perimeter_mean", "radius_worst")
plt_feat1_feat2("area_worst", "radius_worst")
plt_feat1_feat2("texture_mean", "texture_worst")
plt_feat1_feat2("area_mean", "radius_worst")
In [19]:
def violin_plots(lower_bound, upper_bound):

    num_features = len(features.columns)
    input_std = features.drop('diagnosis', axis=1)
    # averaging each input value and concatenate into a new dataframe
    input_std = (input_std - input_std.mean())/(input_std.std())
    input_std = pd.concat([Y_diagnosis, input_std.iloc[:, lower_bound:upper_bound]], axis=1)
    #flatten the input to be used in the violin plot
    input_flatten = pd.melt(input_std, id_vars="diagnosis", var_name="features", value_name='value')
    input_flatten

    plt.figure(figsize=(15,8))
    sns.violinplot(x='features', y='value', hue='diagnosis', data = input_flatten, split=True, inner='quart')
    #plt.xticks(fontsize=15)
    plt.xticks(rotation=45)
    plt.legend(loc=2, prop={'size': 10})

#plot the features which have the mean attribute
violin_plots(0,10)

Observation from the violin plot above: Both the smoothness_mean and the symmetry_mean features seem to have nearly identical distributions! It might be useful to use our correlation plots above to determine whether there is a particular dependency relation between the two, and if there si, we can reduce redundancy and data dimensionality by dropping one of the columns, since most of the variance of the dataset is incorporated into the other features.

In [20]:
#plot the features which have the "SE" attribute
violin_plots(10,20)
In [21]:
#plot the feature values with the "worst" attribute
violin_plots(20, 30)

The violin plots above combine the best aspects of a box and a kernel density plot and depict the median, the interquartile range, and the 1st/3rd quartiles which can be used to identify and extract outliers.

In the 1st violin plot, one observation we can make is that the median value the malignant and benign sections of features like texture_mean, perimeter_mean, and area_mean don't coincide - compare it with the fractal_dimension_mean for example, and you can see that the medians of the malignant and benign subcategories coincide and might be difficult for the classifier to distinguish between the 2 classes. Features with non-coinciding medians and ranges could potentially make this a good feature that a classifier could train on to distinguish between the 2 classes.

Principal Component Analysis and Dimensionality Reduction

The next step after EDA is dimensionality reduction! We'll be employing the PCA technique in order to reduce the dimensions of our dataset and highlight any inherent patterns that we can learn from through meaningful graphs. PCA specifically can merge different variables into an assortment of linear combinations that can best encompass the total variability of this dataset. From these linear combinations, we can reduce the complexity of a 30-dim model by extract the linear combination that yields a good chunk (~90%) of the covariance within the data.

Additionally, through our EDA above, from the covariance matrix, we can see that features such as "area mean" and "area worst" have descriptive statistics that are on operating on different scales. Unscaled statistics of our feature set can be detrimental to the performance of our ML classifier, and can lead to amplified variances. Thus, we can use sklearn's StandardScaler() interface to fix this issue.

In [145]:
target = pd.DataFrame(Y_diagnosis)
data = features.drop('diagnosis', axis=1)

#scale the input using StandardScaler()
X_standardized = StandardScaler().fit_transform(data.values)
pca = PCA(svd_solver='full')
X_pca = pca.fit_transform(X_standardized)
#merge the PCA-extracted features into a new dataframe
pca_df = pd.DataFrame(X_pca)
pca_df = pca_df.merge(target, left_index=True, right_index=True, how='left')
pca_df.head()
Out[145]:
0_x 1 2 3 4 5 6 7 8 9 ... 21 22 23 24 25 26 27 28 29 0_y
0 9.192837 1.948583 -1.123166 3.633731 -1.195110 1.411424 2.159370 -0.398407 -0.157118 -0.877402 ... 0.068850 0.084519 -0.175256 -0.151020 -0.201503 -0.252585 -0.033914 0.045648 -0.047169 1
1 2.387802 -3.768172 -0.529293 1.118264 0.621775 0.028656 0.013358 0.240988 -0.711905 1.106995 ... -0.094578 -0.217718 0.011290 -0.170510 -0.041129 0.181270 0.032624 -0.005687 -0.001868 1
2 5.733896 -1.075174 -0.551748 0.912083 -0.177086 0.541452 -0.668166 0.097374 0.024066 0.454275 ... -0.060309 -0.074291 0.102762 0.171158 0.004735 0.049569 0.047026 0.003146 0.000751 1
3 7.122953 10.275589 -3.232790 0.152547 -2.960878 3.053422 1.429911 1.059565 -1.405440 -1.116975 ... -0.203266 -0.124105 0.153430 0.077496 -0.275225 0.183462 0.042484 -0.069295 -0.019937 1
4 3.935302 -1.948072 1.389767 2.940639 0.546747 -1.226495 -0.936213 0.636376 -0.263805 0.377704 ... -0.017650 0.139454 -0.005332 0.003062 0.039254 0.032168 -0.034786 0.005038 0.021214 1

5 rows × 31 columns

In [151]:
X_pca = PCA().fit(X_standardized)
#plot the cumulative variance ratio that is encoded in the PCA extracted features
plt.plot(np.cumsum(X_pca.explained_variance_ratio_))
plt.xlabel("#Components")
plt.ylabel("Cumulative Explained Variance Ratio")
pca = PCA(n_components=6)
pca.fit(X_standardized)
transformed_df = pd.DataFrame(pca.transform(X_standardized), columns = ['PC1', 'PC2', 'PC3', 
                                       'PC4', 'PC5', 'PC6'])
transformed_df.head()
Out[151]:
PC1 PC2 PC3 PC4 PC5 PC6
0 9.192837 1.948583 -1.123166 3.633731 -1.195110 1.411424
1 2.387802 -3.768172 -0.529293 1.118264 0.621775 0.028656
2 5.733896 -1.075174 -0.551748 0.912083 -0.177086 0.541452
3 7.122953 10.275589 -3.232790 0.152547 -2.960878 3.053422
4 3.935302 -1.948072 1.389767 2.940639 0.546747 -1.226495

One of the foremost powers of PCA is to be able to estimate how many components are needed to describe the data and encompass as much of the variance in the dataset as possible. This curve helps quantify how much of the total 30-dimensional variance is contained within the first N components. As shown above, we can see that with respect to features, the first 6 features contain about 90% of the variance, while you need about 20 components to desribe close to 100% of the variance. Thus, we would need about 6 components in order to retain 90% of the variance.

In [234]:
# Binary Classifier
pca = PCA(n_components=2) 
# fit PCA with 2 components
pca = pca.fit_transform(X_standardized)
plt.figure(figsize = (9,5))
# plot first 2 PCA components
plt.scatter(pca[:,0],pca[:,1], c = Y_diagnosis, cmap = "RdBu_r", edgecolor = "Red", alpha=0.35)
plt.colorbar()
plt.title('PCA Scatter Plot')
Out[234]:
Text(0.5, 1.0, 'PCA Scatter Plot')

Above, I created a scatter plot of observations of PCA component 1 and 2. Though, it looks like there is not a definite, clear cut or discriminative boundary between the 2 features(there is a bit of overlap). This could be attributed to PCA component 1 not explaining as much of the variance in the original data as we want it to.

In [142]:
from scipy import stats
# plot histograms showing the distribution of the each feature in the dataset
data.hist(figsize=(25,20), color='r')
plt.show()

As shown above, a good chunk of the distributions are skewed to the left, since they rise very sharply in the beginning for values near 0 and have a long right trail. This shows how scaling/normalization is essential, especially if we implement regressor algorithms like kNN or linear regression, in order to avoid overfitting and optimize model performance and generalizability. We could employ 2 types of normalization techniques, Log Scaling and the MinMaxScaler. Log scaling is particularly useful if you have a distribution skewed as above, where a handful of values have many points, while most other values at the tail end have fewer points. This type of distribution is typically referred to as a power law distribution. On the other hand, the MinMaxScaler usually scales the data to a fixed range, with a drawback being that we will end up with smaller deviations and it will overall suppress the effect of outliers since this type of scaler is sensitive to outliers.

Machine Learning and Statistical Analysis

In [79]:
#split the dataset into a training and holdout test set - 75% for training and 25% for validation/testing
X_train, X_test, y_train, y_test = train_test_split(data, Y_diagnosis, test_size=0.25, random_state=42)
In [126]:
# function for printing evaluation metrics given a grid with optimized paramters along with a confusion matrix
def print_evaluation_metrics(grid, cf_matrix):
    #returns a classification report comparing model performance with test data
    print(classification_report(y_test, y_pred))
    #print the best parameters from GridSearch and the best score the model attains using those tuned hyperparameters
    print("Tuned Hyperparameters: {}".format(grid.best_params_))
    print("Accuracy Rate: {}".format(grid.score(X_test, y_test)))
    print("Optimal Score: {}".format(grid.best_score_))
    group_names = ['True Neg','False Pos','False Neg','True Pos']
    #label the confusion matrix and standardize the TP/TN/FP/FN values
    group_counts = ['{0:0.0f}'.format(value) for value in
                    cf_matrix.flatten()]
    group_percentages = ['{0:.2%}'.format(value) for value in  
                         cf_matrix.flatten()/np.sum(cf_matrix)]
    labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in
              zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    #Produce a heatmap to clearly visualize model evaluation metrics
    sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

Random Forest and Recursive Feature Elimination

RFE is a wrapper-style selection algorithm that employs filter-based feature selection internally. It works by searching for a subset of features. In conjunction with the random forest classifier, it assigns weights to each features and recursively prunes out the subset of features with the smallest absolute weights until we achieve our optimal feature subset. Here, we will choose the top 10 features computed by RFE.

In [166]:
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier, DecisionTreeClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn import tree
from sklearn.tree import export_graphviz

rf_model = RandomForestClassifier()

#perform an exhaustive search through these input parameters using GridSearch 
parameters = {'n_estimators': [4, 6, 9, 10, 15], 
              'max_features': ['log2', 'sqrt','auto'], 
              'criterion': ['entropy', 'gini'],
              'max_depth': [2, 3, 5, 10], 
              'min_samples_split': [2, 3, 5],
              'min_samples_leaf': [1, 5, 8]
             }

#fit and score the estimator on the training set with the tuned parameters
param_tuning = GridSearchCV(rf_model, parameters)
param_tuning = param_tuning.fit(X_train, y_train)
print("Best Parameters for Random Forest Classifier: {}".format(param_tuning.best_params_))

rf_model = param_tuning.best_estimator_
rf_model.fit(X_train, y_train)

cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_tree, cf_matrix)

#utilize the embedded methods of random forest to identify the most important features
# While training a tree-based classifier, it progressively identifies important features by averaging the impurity reduction caused by an individual feature
plt.figure(figsize=(15,12))
sorted_ids = rf_model.feature_importances_.argsort()
plt.barh(data.columns[sorted_ids], rf_model.feature_importances_[sorted_ids])
plt.xlabel("Random Forest Feature Importance Ranking")
plt.show()

#run the Recursive Feature Elimination Algorithm which identifies the top k features and exclusively trains on that subset
rfe = RFE(estimator = model, n_features_to_select=10)
rfe = rfe.fit(X_train, y_train)
#print(X_train.shape)
for i in range(X_train.shape[1]):
    if rfe.support_[i]:
        print('Column: %s, Selected %s, Rank: %.3f' % (X_train.columns[i], rfe.support_[i], rfe.ranking_[i]))


#feature_names = list(X_train.columns[rfe.support_])
#out = export_graphviz(dt, feature_names=feature_names, out_file=file, class_names=['Benign', 'Malignant'], filled=True)

#fig, ax = plt.subplots(figsize=(30,20))
#tree.plot_tree(dt)
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Best Parameters for Random Forest Classifier: {'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 3, 'n_estimators': 9}
              precision    recall  f1-score   support

           0       0.98      0.98      0.98        89
           1       0.96      0.96      0.96        54

   micro avg       0.97      0.97      0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

Tuned Hyperparameters: {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 2}
Accuracy Rate: 0.958041958041958
Optimal Score: 0.9460093896713615
Column: texture_mean, Selected True, Rank: 1.000
Column: concave points_mean, Selected True, Rank: 1.000
Column: radius_se, Selected True, Rank: 1.000
Column: area_se, Selected True, Rank: 1.000
Column: radius_worst, Selected True, Rank: 1.000
Column: texture_worst, Selected True, Rank: 1.000
Column: perimeter_worst, Selected True, Rank: 1.000
Column: area_worst, Selected True, Rank: 1.000
Column: smoothness_worst, Selected True, Rank: 1.000
Column: concave points_worst, Selected True, Rank: 1.000

As shown above, we have selected the top 10 optimal features. The Accuracy overall was about ~95%.

In [98]:
print('Best Features', list(X_train.columns[rfe.support_]))
Best Features ['texture_mean', 'concave points_mean', 'perimeter_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'concavity_worst', 'concave points_worst']
In [82]:
y_pred = rfe.predict(X_test)

print("Training Accuracy: ", rfe.score(X_train, y_train))
print("Testing Accuracy: ", rfe.score(X_test, y_test))
Training Accuracy:  0.9976525821596244
Testing Accuracy:  0.965034965034965

Decision Tree Classifier

In [157]:
# run Sklearn's DecisionTreeClassifier wrapper method
dt = DecisionTreeClassifier(random_state=40)
param_dict = {"criterion":['gini', 'entropy'],
             "max_depth":range(1,10),
             "min_samples_leaf":range(1,5)}
# run GridSearchCV to tune the criterion, max_depth, and min_samples_leaf hyperparameters
grid_tree = GridSearchCV(dt, param_grid=param_dict, cv=10, verbose=1, n_jobs=-1)
# evaluate the best combination of parameter values on the training set
grid_tree.fit(X_train, y_train)
# generate and print the evaluation metrics on the test data for this model
cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_tree, cf_matrix)
Fitting 10 folds for each of 72 candidates, totalling 720 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:   12.7s finished
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
              precision    recall  f1-score   support

           0       0.98      0.98      0.98        89
           1       0.96      0.96      0.96        54

   micro avg       0.97      0.97      0.97       143
   macro avg       0.97      0.97      0.97       143
weighted avg       0.97      0.97      0.97       143

Tuned Hyperparameters: {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 2}
Accuracy Rate: 0.958041958041958
Optimal Score: 0.9460093896713615

Support Vector Classifier(SVC)

In [169]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report
#run a transformation pipeline with our SVC wrapper function
pipeline = Pipeline([('scaler', StandardScaler()), ('svm', SVC())])
parameters = {'svm__C':[1,10,100],
             'svm__gamma': [0.1, 0.01, 0.001],
             'svm__kernel': ['linear', 'rbf', 'poly']}
#run GridSearch with cross validation to find best combination of parameters yielding optimal classifier performance
grid_svc = GridSearchCV(pipeline, parameters, cv=4, refit=True, verbose=2)
#fit this optimized supervised learner on training data and evaluate on the holdout set
grid_svc.fit(X_train, y_train)
y_pred = grid_svc.predict(X_test)
#generate and print evaluation metrics for the model
cf_matrix = confusion_matrix(y_test, y_pred)
print_evaluation_metrics(grid_svc, cf_matrix)
precision recall f1-score support 0 0.98 0.98 0.98 89 1 0.96 0.96 0.96 54 micro avg 0.97 0.97 0.97 143 macro avg 0.97 0.97 0.97 143 weighted avg 0.97 0.97 0.97 143 Tuned Hyperparameters: {'svm__C': 1, 'svm__gamma': 0.1, 'svm__kernel': 'linear'} Accuracy Rate: 0.972027972027972 Optimal Score: 0.9741784037558685
In [141]:
from sklearn import svm
X = data.values[:, :2]
y = Y_diagnosis.values

h = 0.02 #step size in the mesh
C = 1.0  # SVM regularization parameter
svc = SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']

plt.figure(figsize=(15,12))
for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    #Plot the decision boundary. For that, we will assign a color to each
    #point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # ut the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

    #Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
    plt.xlabel('radius_mean', size=15)
    plt.ylabel('texture_mean', size=15)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\ChrisPC\Anaconda3\lib\site-packages\sklearn\svm\base.py:931: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)

Above, I have plotted the hyperplanes from the SVC() algorithm with different kernels as a point of comparison. Through GridSearch algorithm, we have found out how the algorithm performs optimally with the linear kernel. Looking at the graph, we can see how the linear hyperplane, which maximizing the margin between the support vectors of the 2 different classes, is best able to discriminate between the 2 classes through its decision boundary.

KFold CrossValidation and Comparison of Classifier Performance

In [181]:
from sklearn import model_selection
# run the 3 classifiers through the optimal subset of features extracted by PCA as a point of comparison
svc_pca = grid_svc.fit(transformed_df, Y_diagnosis)
tree_pca = grid_tree.fit(transformed_df, Y_diagnosis)
rf_pca = rf_model.fit(transformed_df, Y_diagnosis)
#compare SVC, Decision Tree, and Random Forest Models with all the features vs top 6 PCA-extracted features
models_pca = [('svc', svc_pca), ('decision tree',tree_pca), ('random forest', rf_pca)]
models = [('svc', grid_svc), ('decision tree', grid_tree), ('random forest', rf_model)]
eval_metrics = []
names = []
for (name, model) in models:
    #use the split method of scikit's cross-validation generators to yield the train/test indices for each iteration of KFold cross validation
    kfold = model_selection.KFold(n_splits=15, random_state=6)
    #train the estimator using the training set and compute the scores based on the testing set on each iteration
    cv_res = model_selection.cross_val_score(model, data, Y_diagnosis, cv=kfold, scoring='accuracy')
    eval_metrics.append(cv_res)
    names.append(name)

eval_metrics_pca = []
names_pca = []
for (name, model) in models_pca:
    kfold = model_selection.KFold(n_splits=15, random_state=6)
    cv_res = model_selection.cross_val_score(model, data, Y_diagnosis, cv=kfold, scoring='accuracy')
    #compile the model evaluation results into a list
    eval_metrics_pca.append(cv_res)
    names_pca.append(name)
In [237]:
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
# plot boxplots showing the distribution in accuracies across the different iterations of KFold cross validation for each classifier
plt.title("Supervised Classifier Predictive Performance Comparison")
ax.set_xticklabels(['SVC', "Decision Tree", "Random Forest"])
plt.boxplot(eval_metrics)
plt.show()

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
#plot boxplots comparing the accuracies for each classifier trained on the PCA-reduced feature set
plt.title("Supervised Classifier Predictive Performance Comparison with PCA-Reduced Features")
ax.set_xticklabels(['SVC', "Decision Tree", "Random Forest"])
plt.boxplot(eval_metrics_pca)
plt.xlabel("SVC, Decision Tree, and Random Forest")
plt.ylabel("Accuracy")
plt.show()

Conclusions and Future Directions

In conclusion, our original purpose was to implement a robust classifier to conduct a 2-way classification between the breast cancer classes of malignant vs. benign. First, after some initial data cleaning, data analysis and EDA were a useful starting point in order to develop some fundamental intuition on patterns and correlative relationships among the independent variables and the independent variables vs. the target, and also to gain insight onto the distribution of the dataset and necessary preprocessing that must be employed before implementing a supervised classifier. By utilizing the insight gained from EDA, we were able to utilize the PCA dimensionality reduction technique to extract the top 10 relevant features that encode >90% of the dataset variance.

Moving towards the machine learning side of the project, we were able to successfully train 3 different supervised classifiers - Random Forest, Support Vector Classifier, and Decision Tree - tune its hyperparameters, evaluate performance metrics, and run the RFE algorithm on a fixed number of features to maximize the averaged accuracies. Overall, the SVC model performed optimally and reached an average classification accuracy of 97%, which was validated using multiple iterations of KFold Cross Validation. Training and testing the classifiers on the top 10 PCA components did not alter performance as much, except for the Random Forest Classifier. Though, it is important to keep in mind that our classifier could be subject to the problem of overfitting, especially due to the relatively constrained size of the dataset and number of features that we are working with. Increasing the training set sample size or trying different ensemble methods, such as bagging and boosting, can be useful to complete a robust, comparative analysis among even more supervised classifiers.

One possible way we could extend this research direction is by transitioning from the 2-class binary benign/malignant classification to a multi-classification model which, when given the same series of features from digital histopathological datasets, can discriminate between different subordinate classes of breast cancer(Ductal carcinoma, Firboadenoma, Lobular carcinoma, ect. Additionally, we could use novel deep learning techniques such as convolutional neural networks(CNN) or ensemble frameworks to represent and learn feature maps directly from the images, which is a considerably more challenging task but could open new doors to integrating such a robust, generalizable AI system in the clinical hemisphere for real-time diagnostic results. As the Google AI team has shown with their clinically-tested models for diseases like diabetic retinopathy and breast cancer, Data Science together with Machine Learning/Deep Learning are critical tools that healthcare increasingly needs for efficient, real-time data processing as well as gain clinically-relevant diagnostic information to guide fundamental decisions. If you are interested in learning more about this interesting area of AI+Healthcare, I encourage you to check out the Nature Journal - https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwig7KrWltHwAhWsFlkFHT-RDRwQFjAXegQIIxAD&url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs41591-018-0316-z&usg=AOvVaw1sv9RHIUuzFaBR1JqTwvEZ - which thoroughly elucidates the most recent advances in deep learning in Healthcare and Radiology. I also recommend you check out the AI for Medicine Specialization - https://www.deeplearning.ai/program/ai-for-medicine-specialization/ - to receive further guidance and training to learn novel Data Analytics and Deep Learning skills that todays healthcare sphere urgently needs. Thanks for visiting this page!

Resources

In [ ]: