Breast Cancer Classification: Support Vector Machines

5 min readNov 16, 2021

Breast Cancer is the most common cancer among women world-wide, affecting 2.3 million women in 2020 and 685 000 deaths globally.

Detecting breast cancer early detection significantly increases the chances of survival.

A key challenge in cancer detection is classifying tumors into malignant or benign.

Research dictates that most experienced physicians can diagnose cancer with 79% accuracy. While 91% (up to 97%) accurate diagnosis is achieved using Machine Learning techniques.

In this case study our task is to classify tumors into malignant or benign tumors using features obtained by several cell images.

Here we see the Cancer Diagnosis Procedure, whereby cells are extracted using Fine Needle Aspiration. We are then presented with our data set of images/extractions.

These extractions are then fed into features such as Radius, Texture, Perimeter, Area, Smoothness, etc. (note for this case study we will use a total of 30 features to train our model).

These features are then fed into our Machine Learning Model, our model is fitted with the Support Vector Machine Algorithm (SVM).

A SVM utilizes our features and depicts a maximum margin hyperplane using support vectors in order to classify whether our extraction is malignant or benign.

Below is an example graph of how a SVM would classify data.

We will no begin to train our model using a Jupyter Notebook on Google Colab.

Importing our data set.

From our visualization we can detect that (as per our image examples above) that benign cases our more tightly coupled in that they have a smaller mean area, radius, and perimeter. While our malignant cases are larger and have more texture.

We will look at how many samples for each case we have in our data set, in order to get a picture of the ratio between our malignant and benign cases.

The below heat map explores the correlation between our features.

As we can see any value close to 1 has a very high correlation value.

Now we will begin training our model by splitting our data set into training and testing data.

Splitting our data set into training and testing data.

Fitting our model with the training data sets.

Using the SkLearn libraries SVM we fit our model with the above training data sets.

Evaluating our model using a confusion matrix.

Below we have a confusion matrix fitted with our test data sets, in order to evaluate its predictions.

Our confusion matrix depicts:

Upper-Left — number of our correctly classified points.

Upper-Right — the amount of incorrectly classified points, (type — 1 error) where our model would predict a malignant tumor when in fact it is benign.

Lower-Left — where we have 0 incorrectly classified predictions for cases, where our model would predict the tumors as benign when they were in fact malignant.

Lower-Right — overall (summation) amount of correct classifications our model has predicted.

Normalizing our data set to improve our models predictions.

We will now improve our models predictions through data normalization.

As we can see our mean smoothness is very small (0–0.16) and our mean area is extremely large in comparison (0–2500) this can have adverse effects on our models predictions.

Now that we have normalized our data using Unity-based normalization which as seen brings all our values into range [0,1], we will train our model with our normalized data sets and evaluate our model using a confusion matrix again.

Reevaluating our model.

As we can see our model has improved slightly with less incorrectly classified predictions after normalization.

Below we see the classification report of our model.

Our model has an overall accuracy of 96%.

Conclusion

As we have seen we can use Support Vector Machines in order to predict whether a tumor is malignant or benign with 96% accuracy.
This technique is able to evaluate breast masses and classify them rapidly using automation.
Early breast cancer detection can save lives, and offer physicians a second opinion on detection.
We can improve our Machine Learning technique by using Deep Learning/Computer Vision techniques in order to derive their own features for evaluation as we have inputted our own features with this technique.

View the Notebook on Github