Cracking 99%: Insights from Training a Neural Network on MNIST

Deon | Nov. 4, 2023, 7:12 a.m.

The MNIST dataset has become synonymous with the starting point for anyone delving into the world of machine learning and neural networks. Often referred to as the "Hello World" of deep learning, it serves as a benchmark for testing and comparing various algorithms and techniques. With its 70,000 images of handwritten digits, MNIST provides a perfect playground for budding data scientists and seasoned professionals alike to hone their skills. When I first embarked on my journey to train a model on MNIST, I aimed for the elusive 99% accuracy—a milestone that signifies a deep understanding of both the data and the model's architecture. In this blog post, I'll walk you through the steps I took to achieve this goal, sharing insights, challenges, and the strategies that led to this impressive result. Whether you're new to machine learning or looking to refine your techniques, I hope my experience will offer valuable guidance and inspiration for your own projects.

MNIST Data Structure:

The MNIST dataset consists of 70,000 images of handwritten digits, each structured as a 28x28 pixel grid. Each pixel in the grid represents a grayscale value, creating a total of 784 features per image. Because we are not using a Convolutional Neural Network (CNN) for this task, we must flatten each 28x28 image into a one-dimensional array of 784 inputs. This transformation allows us to feed the data into a standard fully connected neural network. The output of our model is designed to classify the digit in the image, which ranges from 0 to 9. Consequently, our output size is set to 10, corresponding to the 10 possible digit classes. This parameter is crucial as it directly influences the architecture of the final layer of our neural network. While some aspects of our model can be tuned as hyperparameters, such as the number of hidden layers and neurons, the output size of 10 is fixed based on the nature of the classification task.

Flattening The Data:

In the context of the MNIST dataset, flattening refers to the process of converting a two-dimensional array representing an image into a one-dimensional array. This is necessary because many machine learning algorithms, such as traditional feedforward neural networks, require a one-dimensional input. Each image in the MNIST dataset is a 28x28 pixel array, and without flattening, the algorithm would interpret each pixel value as a separate input feature, leading to a massive number of input features (784 in this case). Flattening reduces the complexity of the input while preserving essential information about the image's features and patterns. However, when using convolutional neural networks (CNNs), flattening is not required because CNNs are designed to work with two-dimensional input arrays, allowing them to retain spatial information about the image."

My Model:

The code snippet above details my configuration of activation functions and key hyperparameters for the TensorFlow/Keras neural network trained on the MNIST dataset. A batch size of 500 was chosen to strike a balance between epoch computation time and precision. Typically, smaller batch sizes lead to longer training durations. The number of epochs was increased from 5 to 10, which proved optimal for achieving 99% accuracy with the selected activation functions. The hidden layer size, representing the depth of the neural network, was set to 100. Increasing the size of hidden layers enhances the model's capacity to discern complex patterns in the data, offering benefits in capturing intricate relationships. However, this can also elevate the risk of overfitting if not properly regularized. The model incorporates six hidden layers, employing two different activation functions. The first five layers utilize the rectified linear unit (ReLU) function, while the final layer employs the Leaky ReLU activation function. The choice of Leaky ReLU for the final layer was deliberate to mitigate the risk of encountering the vanishing gradient problem.

Results of Model Training:

As you can see, across ten epochs, the neural network was able to achieve a final accuracy of 0.9888 and a validation loss of 0.0581. The slight increase in validation loss at the last epoch may indicate the beginning of overfitting, where the model starts to memorize the training data instead of generalizing well to unseen data. This suggests that further training may lead to diminishing returns or even a decrease in performance on new data. Regularization techniques, such as dropout or weight decay, could be employed to combat overfitting and improve the model's generalization ability. Additionally, monitoring the validation loss closely in future training sessions can help determine the optimal number of epochs and prevent overfitting.

Training Analysis & Conclusion:

Increasing the batch size from 1 to 500 significantly decreased the computation time per epoch by over 50%. Larger batch sizes enhance training efficiency, though there is a trade-off between batch size and the precision of weight updates. Additionally, increasing the number of epochs helps the model reduce both training and validation loss. More epochs allow the model to backpropagate errors and fine-tune the weights using partial derivatives, thereby improving performance by minimizing errors and optimizing the influence of neurons on the output. Through this process, the model becomes more accurate and better at generalizing from the training data to unseen data.