Breast Cancer Classification: Support Vector Machines

Sachin Lendis
5 min readNov 16, 2021

Breast Cancer is the most common cancer among women world-wide, affecting 2.3 million women in 2020 and 685 000 deaths globally.

Detecting breast cancer early detection significantly increases the chances of survival.

A key challenge in cancer detection is classifying tumors into malignant or benign.

Research dictates that most experienced physicians can diagnose cancer with 79% accuracy. While 91% (up to 97%) accurate diagnosis is achieved using Machine Learning techniques.

In this case study our task is to classify tumors into malignant or benign tumors using features obtained by several cell images.

Here we see the Cancer Diagnosis Procedure, whereby cells are extracted using Fine Needle Aspiration. We are then presented with our data set of images/extractions.

These extractions are then fed into features such as Radius, Texture, Perimeter, Area, Smoothness, etc. (note for this case study we will use a total of 30 features to train our model).

These features are then fed into our Machine Learning Model, our model is fitted with the Support Vector Machine Algorithm (SVM).

A SVM utilizes our features and depicts a maximum margin hyperplane using support vectors in order to classify whether our extraction is malignant or benign.

Below is an example graph of how a SVM would classify data.

We will no begin to train our model using a Jupyter Notebook on Google Colab.

Importing our data set.

0 — malignant, 1 — benign

From our visualization we can detect that (as per our image examples above) that benign cases our more tightly coupled in that they have a smaller mean area, radius, and perimeter. While our malignant cases are larger and have more texture.

We will look at how many samples for each case we have in our data set, in order to get a picture of the ratio between our malignant and benign cases.

0 — malignant, 1 — benign

The below heat map explores the correlation between our features.

As we can see any value close to 1 has a very high correlation value.

Now we will begin training our model by splitting our data set into training and testing data.

Splitting our data set into training and testing data.

X_train — training data set
X_test — testing data set
y_train — training data set
y_test — testing data set

Fitting our model with the training data sets.

Using the SkLearn libraries SVM we fit our model with the above training data sets.

Evaluating our model using a confusion matrix.

Below we have a confusion matrix fitted with our test data sets, in order to evaluate its predictions.

Our confusion matrix depicts:

Upper-Left — number of our correctly classified points.

Upper-Right — the amount of incorrectly classified points, (type — 1 error) where our model would predict a malignant tumor when in fact it is benign.

Lower-Left — where we have 0 incorrectly classified predictions for cases, where our model would predict the tumors as benign when they were in fact malignant.

Lower-Right — overall (summation) amount of correct classifications our model has predicted.

Normalizing our data set to improve our models predictions.

We will now improve our models predictions through data normalization.

Before normalization

As we can see our mean smoothness is very small (0–0.16) and our mean area is extremely large in comparison (0–2500) this can have adverse effects on our models predictions.

After normalization

Now that we have normalized our data using Unity-based normalization which as seen brings all our values into range [0,1], we will train our model with our normalized data sets and evaluate our model using a confusion matrix again.

Reevaluating our model.

As we can see our model has improved slightly with less incorrectly classified predictions after normalization.

Below we see the classification report of our model.

Our model has an overall accuracy of 96%.

Conclusion

  • As we have seen we can use Support Vector Machines in order to predict whether a tumor is malignant or benign with 96% accuracy.
  • This technique is able to evaluate breast masses and classify them rapidly using automation.
  • Early breast cancer detection can save lives, and offer physicians a second opinion on detection.
  • We can improve our Machine Learning technique by using Deep Learning/Computer Vision techniques in order to derive their own features for evaluation as we have inputted our own features with this technique.

View the Notebook on Github

--

--

Sachin Lendis

Hi! I’m Sachin Lendis, a Multimedia Designer🧙🏾‍♂️, B.Sc Big Data student👨🏾‍🎓 and aspiring Full-Stack Machine Learning Engineer🤖👨🏾‍💻 from Cape Town🌞