PCA Dimensionality Reduction Effects on Training Handwritten Digit Classifier

Created by Sunny Lee ([email protected]) and AJ Liberatore ([email protected])

Introduction

The objective of this project is to see the effects of dimensionality reduction on training neural networks, specifically to see the effects of dimensionality reduction on a MLPClassifier for recognizing handwritten digits and how dimensionality reduction can help with saving both space and time.

Data

We used the MNIST Handwritten data set. MNIST comes with 60,000 samples of 28x28 handwritten digits and an additional 10,000 samples as training data. Here are 100 samples of the training data set:

Methods

Tools:

Keras to import the MNIST data set
Scikit-learn for the PCA dimensionality reduction and MLPClassifier training
GitHub for version control
Jupyter as IDE

Methods used within Scikit:

MLPClassifier model
PCA
StandardScaler

Results

In order to use this data, however, we must format it into a 2 dimensional array. To do this, we turn each image into one long vector, leaving us with a training matrix of size (60000, 784). To perform our dimensionality reduction, we scale our data using StandardScaler and use the built-in sklearn PCA function.

After fitting our data with the PCA method, we reduce the dimensions down to just 100 components. From the pca.explained_variance_ratio_ method, we see the first 100 components captures quite a lot of the variance over the entire training data set and the components are ordered from most variance to least:

To really see how good of an approximation the first 100 components are, we can inverse transform the components to get an approximation of the original dataset:

So we see that the first 100 components actually captures the originial dataset quite well. We will also see that training MLPClassifers on these components will also yield almost identical results to training the MLPClassifier on the original dataset:

Where we see the reduced train and test set achieved an accuracy of 95.95% in about 18 seconds while our original train and test set achieved an accuracy of 96.15% in about 64 seconds. Here, we also see how quickly the accuracy increases with the number of components:

As can be seen, we reach an accuracy of more than 80% with less than 50 components utilized, which is not that far from the accuracy when all 784 components are used.

Discussion

Dimensionality reduction is a great way to save time and resources if done correctly. Just because there is a large volume of data does not mean it is all useful or relevant to your objective. Here we show that just using about 1/8th of the total data we had, we can achieve an accuracy almost identical to a model using the full data set while taking a fraction of the time to train.

We are interested in how these observations can generalize to other problems involving computer vision. PCA can be done on any data which can be made into a matrix, thus is extremely versatile in what kinds of data we can apply it to. Some more advanced topics might include Robust Principal Component Analysis, which can be used when observations are extremely corrupted. We could also look into convolutional neural networks to drastically increase the performance of our model.

Summary

This project shows how we can use PCA in order to reduce the time and space needed to train neural networks. We achieved almost the same accuracy with 100 components as we did with our original data set within a fraction of the time.

References

[1] MNIST dataset with Keras

[2] PCA from Scikit-Learn

[3] MLPClassifier from Scikit-Learn

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitattributes		.gitattributes
README.md		README.md
accvscomp.png		accvscomp.png
main.ipynb		main.ipynb
mnist.png		mnist.png
mnistreduced.png		mnistreduced.png
pca.png		pca.png
time.png		time.png
variance.png		variance.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PCA Dimensionality Reduction Effects on Training Handwritten Digit Classifier

Introduction

Data

Methods

Results

Discussion

Summary

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

lees19/DS-Final

Folders and files

Latest commit

History

Repository files navigation

PCA Dimensionality Reduction Effects on Training Handwritten Digit Classifier

Introduction

Data

Methods

Results

Discussion

Summary

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages