Build Your Own Image Classifier with PyTorch

Dive into the world of Convolutional Neural Networks (CNNs) by building your own image classifier using PyTorch. This blog explores the steps involved, referencing the tiesen243/cnn repository on GitHub.

Introduction

Convolutional Neural Networks (CNNs) are a powerful type of deep learning architecture that excel at image recognition and classification tasks. In this blog, we'll embark on a hands-on journey to build a CNN model using PyTorch, a popular deep learning framework.

We'll be referencing the tiesen243/cnn repository on GitHub as a guide. This repository provides a basic framework for building a CNN model, allowing us to understand the core concepts without getting bogged down in complex details.

Getting Started

Prerequisites

Before we dive into building our image classifier, make sure you have the following prerequisites installed on your system:

Anaconda or Miniconda
Nvidia GPU (optional but recommended for faster training) with CUDA support. You can check if your GPU has CUDA support by running nvidia-smi in the terminal.

Setting Up the Environment

To get started, we'll create a new conda environment with packages from the environment.yml file provided in the repository. Run the following commands in your terminal:

conda env create -f environment.yml
conda activate ml

This will set up a new conda environment named ml with all the necessary packages for our project.

Dataset Preparation

For this project, we'll be using the MNIST dataset, a popular dataset of handwritten digits from pytorch torchvision. The dataset consists of 60,000 training images and 10,000 test images.

We can load the MNIST dataset using PyTorch's torchvision library and create data loaders for training and testing as follows:

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

preprocess = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)),
    ]
)

train_set = datasets.MNIST(
    root="./data", train=True, download=True, transform=preprocess
)
test_set = datasets.MNIST(
    root="./data", train=False, download=True, transform=preprocess
)

train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
test_loader = DataLoader(test_set, batch_size=64, shuffle=False)

Building the CNN Model

Now that we have our dataset ready, let's build our CNN model. We'll define a simple CNN architecture with two convolutional layers followed by two fully connected layers. Here's the code snippet for building the model:

Import the necessary libraries:

import torch
import matplotlib.pyplot as plt

from tqdm import tqdm
from torch import nn, optim
from torchsummary import summary
from torch.utils.data import DataLoader, random_split

Check if CUDA is available and set the device accordingly:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Define the CNN model architecture:

class CNN(nn.Module):
    def __init__(self, layers: list[nn.Module]):
        super().__init__()
        self.history = []
        self.layers = nn.ModuleList(layers)
        self.to(device)

    def add(self, layer: nn.Module):
        self.layers.append(layer)

    def summary(self, input_shape, batch_size):
        summary(self, input_shape, batch_size)

In this __init__ method, we define the layers of our CNN model and move the model to the GPU if CUDA is available.

history is used to store the training history of the model.
layers is a list of layers in the model, initialized as an empty list.

And add method allows us to add layers to the model dynamically. Finally, the summary method provides a summary of the model architecture. For example, we can call model.summary((1, 28, 28), 64) to get a summary of the model with input shape (1, 28, 28) and batch size 64.

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
            Conv2d-1           [64, 32, 28, 28]             320
              ReLU-2           [64, 32, 28, 28]               0
            Conv2d-3           [64, 64, 28, 28]          18,496
              ReLU-4           [64, 64, 28, 28]               0
         MaxPool2d-5           [64, 64, 14, 14]               0
           Flatten-6                [64, 12544]               0
            Linear-7                  [64, 128]       1,605,760
              ReLU-8                  [64, 128]               0
            Linear-9                   [64, 10]           1,290
          Softmax-10                   [64, 10]               0
================================================================
Total params: 1,625,866
Trainable params: 1,625,866
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.19
Forward/backward pass size (MB): 85.88
Params size (MB): 6.20
Estimated Total Size (MB): 92.28
----------------------------------------------------------------

Define the forward pass method:

class CNN(nn.Module)

def forward(self, x):
    for layer in self.layers:
        x = layer(x)
    return x

In this method, we pass the input x through each layer of the model sequentially to get the final output. This is the core functionality of our CNN model.

Define the configuration for the CNN model:

class CNN(nn.Module)

def config(self, loss: nn.Module, optimizer: optim.Optimizer):
    if not isinstance(loss, nn.Module):
        raise TypeError("loss must be a torch.nn.Module instance")
    if not isinstance(optimizer, optim.Optimizer):
        raise TypeError("optimizer must be a torch.optim.Optimizer instance")

    self.criterion = loss
    self.optimizer = optimizer

This method allows us to configure the loss function and optimizer for our model. We can set the loss function and optimizer using this method before training the model.

loss is the loss function used for training the model (e.g., CrossEntropyLoss, MSELoss). For classification tasks, we typically use nn.CrossEntropyLoss.
optimizer is the optimization algorithm used to update the model parameters during training (e.g., SGD, Adam). We can set the optimizer using this method. For example, model.config(nn.CrossEntropyLoss(), optim.Adam(model.parameters(), lr=0.001)).

Define the training loop:

class CNN(nn.Module)

def fit(self, train_loader: DataLoader, epochs: int = 10, verbose: bool = True):
    for epoch in range(epochs):
        self.train()

        # Split the train set into train and validation set
        train_set, val_set = random_split(train_loader.dataset, [50000, 10000])
        train_set = DataLoader(train_set, batch_size=64, shuffle=True)
        val_set = DataLoader(val_set, batch_size=64, shuffle=True)

        loss_list = []

        for images, labels in tqdm(train_set, desc=f"Epoch {epoch+1}/{epochs}"):
            images, labels = images.to(device), labels.to(device)

            self.optimizer.zero_grad()

            outputs = self(images)
            loss = self.criterion(outputs, labels)
            loss_list.append(loss.item())

            loss.backward()
            self.optimizer.step()

        with torch.no_grad():
            self.eval()

            total = 0
            accuracy = 0
            val_loss = []

            for images, labels in val_set:
                images, labels = images.to(device), labels.to(device)

                outputs = self(images)
                total += labels.size(0)
                predicted = torch.argmax(outputs, dim=1)

                accuracy += (predicted == labels).sum().item()
                val_loss.append(self.criterion(outputs, labels).item())

            # Calculate the mean loss and accuracy
            mean_val_loss = sum(val_loss) / len(val_loss)
            mean_val_acc = 100 * (accuracy / total)
            loss = sum(loss_list) / len(loss_list)
            self.history.append((loss, mean_val_loss, mean_val_acc))

            if verbose:
                print(
                    f"Loss: {loss:.4f}, Val Loss: {mean_val_loss:.4f}, Val Accuracy: {mean_val_acc:.2f}%"
                )

    return self.history

This method trains the model on the training set for a specified number of epochs. It also calculates the validation loss and accuracy after each epoch.

In each epoch, we iterate over the training set, compute the loss, backpropagate the gradients, and update the model parameters using the optimizer. We also evaluate the model on the validation set to monitor its performance.

First, we set the model to training mode using self.train(). Then, we split the training set into a new training set and a validation set using random_split. We create new data loaders for the training and validation sets.

Next, we iterate over the training set using tqdm to display a progress bar. We move the images and labels to the GPU if available. We zero out the gradients using self.optimizer.zero_grad().

We pass the images through the model to get the outputs and calculate the loss using the specified loss function. We append the loss to the loss_list for tracking.

We backpropagate the loss and update the model parameters using self.optimizer.step().

After training, we evaluate the model on the validation set. We calculate the validation loss and accuracy by comparing the predicted labels with the ground truth labels.

Finally, we calculate the mean loss and accuracy for the epoch and store them in the self.history list. We print the loss, validation loss, and validation accuracy if verbose is set to True.

The method returns the training history, which contains the training loss, validation loss, and validation accuracy for each epoch.

Define the prediction method:

class CNN(nn.Module)

def predict(self, x):
    predicted = []

    with torch.no_grad():
        self.eval()
        for images, _ in x:
            images = images.to(device)

            outputs = self(images)
            predicted.append(torch.argmax(outputs, 1))

    return predicted

This method allows us to make predictions using the trained model. We pass a batch of images through the model and return the predicted labels.

Putting It All Together:

class CNN(nn.Module):
    def __init__(self, layers: list[nn.Module]):
        super().__init__()
        self.history = []
        self.layers = nn.ModuleList(layers)
        self.to(device)

    def add(self, layer: nn.Module):
        self.layers.append(layer)

    def forward(self, x: torch.Tensor):
        for layer in self.layers:
            x = layer(x)

        return x

    def summary(self, input_shape, batch_size):
        summary(self, input_shape, batch_size)

    def config(self, loss: nn.Module, optimizer: optim.Optimizer):
        if not isinstance(loss, nn.Module):
            raise TypeError("loss must be a torch.nn.Module instance")
        if not isinstance(optimizer, optim.Optimizer):
            raise TypeError("optimizer must be a torch.optim.Optimizer instance")

        self.criterion = loss
        self.optimizer = optimizer

    def fit(self, train_loader: DataLoader, epochs: int = 10, verbose: bool = True):
        for epoch in range(epochs):
            self.train()

            # Split the train set into train and validation set
            train_set, val_set = random_split(train_loader.dataset, [50000, 10000])
            train_set = DataLoader(train_set, batch_size=64, shuffle=True)
            val_set = DataLoader(val_set, batch_size=64, shuffle=True)

            loss_list = []

            for images, labels in tqdm(train_set, desc=f"Epoch {epoch+1}/{epochs}"):
                images, labels = images.to(device), labels.to(device)

                self.optimizer.zero_grad()

                outputs = self(images)
                loss = self.criterion(outputs, labels)
                loss_list.append(loss.item())

                loss.backward()
                self.optimizer.step()

            with torch.no_grad():
                self.eval()

                total = 0
                accuracy = 0
                val_loss = []

                for images, labels in val_set:
                    images, labels = images.to(device), labels.to(device)

                    outputs = self(images)
                    total += labels.size(0)
                    predicted = torch.argmax(outputs, dim=1)

                    accuracy += (predicted == labels).sum().item()
                    val_loss.append(self.criterion(outputs, labels).item())

                # Calculate the mean loss and accuracy
                mean_val_loss = sum(val_loss) / len(val_loss)
                mean_val_acc = 100 * (accuracy / total)
                loss = sum(loss_list) / len(loss_list)
                self.history.append((loss, mean_val_loss, mean_val_acc))

                if verbose:
                    print(
                        f"Loss: {loss:.4f}, Val Loss: {mean_val_loss:.4f}, Val Accuracy: {mean_val_acc:.2f}%"
                    )

        return self.history

    def predict(self, x):
        predicted = []

        with torch.no_grad():
            self.eval()
            for images, _ in x:
                images = images.to(device)

                outputs = self(images)
                predicted.append(torch.argmax(outputs, 1))

        return predicted

Training the Model

Now that we have defined our CNN model, we can train it on the MNIST dataset. Here's how you can train the model:

Define layers for the CNN model

model = CNN(
    [
        nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
        nn.Flatten(),
        nn.Linear(64 * 14 * 14, 128),
        nn.ReLU(),
        nn.Linear(128, 10),
        nn.Softmax(dim=1),
    ]
)

model.summary(train_loader.dataset[0][0].shape, train_loader.batch_size)

This code snippet defines the layers for the CNN model. We use two convolutional layers followed by ReLU activation functions, max-pooling, and fully connected layers. The model architecture is summarized using the summary method.

The first convolutional layer has 1 input channel, 32 output channels, a kernel size of 3, and padding of 1.
The second convolutional layer has 32 input channels, 64 output channels, a kernel size of 3, and padding of 1.
The max-pooling layer has a kernel size of 2 and a stride of 2.
The first fully connected layer has 64 _ 14 _ 14 input features and 128 output features.
The second fully connected layer has 128 input features and 10 output features (corresponding to the 10 classes in the MNIST dataset).
The softmax layer is used to compute the class probabilities.

Next, we configure the model with the loss function and optimizer:

Configure the model

model.config(nn.CrossEntropyLoss(), optim.Adam(model.parameters(), lr=0.001))

We use the CrossEntropyLoss as the loss function and the Adam optimizer with a learning rate of 0.001.

Finally, we train the model on the MNIST dataset:

Train the model

history = model.fit(train_loader, epochs=10)

This code snippet trains the model on the training set for 10 epochs. The training loop prints the training loss, validation loss, and validation accuracy after each epoch.

Results:

Epoch 1/10: 100%|██████████| 782/782 [00:05<00:00, 137.83it/s]
Loss: 1.6583, Val Loss: 1.4880, Val Accuracy: 97.40%
Epoch 2/10: 100%|██████████| 782/782 [00:05<00:00, 140.73it/s]
Loss: 1.4849, Val Loss: 1.4832, Val Accuracy: 97.85%
Epoch 3/10: 100%|██████████| 782/782 [00:05<00:00, 132.55it/s]
Loss: 1.4787, Val Loss: 1.4797, Val Accuracy: 98.15%
Epoch 4/10: 100%|██████████| 782/782 [00:05<00:00, 142.57it/s]
Loss: 1.4766, Val Loss: 1.4770, Val Accuracy: 98.45%
Epoch 5/10: 100%|██████████| 782/782 [00:05<00:00, 135.03it/s]
Loss: 1.4744, Val Loss: 1.4749, Val Accuracy: 98.62%
Epoch 6/10: 100%|██████████| 782/782 [00:05<00:00, 142.32it/s]
Loss: 1.4734, Val Loss: 1.4752, Val Accuracy: 98.59%
Epoch 7/10: 100%|██████████| 782/782 [00:05<00:00, 148.41it/s]
Loss: 1.4723, Val Loss: 1.4731, Val Accuracy: 98.84%
Epoch 8/10: 100%|██████████| 782/782 [00:05<00:00, 139.67it/s]
Loss: 1.4717, Val Loss: 1.4701, Val Accuracy: 99.11%
Epoch 9/10: 100%|██████████| 782/782 [00:05<00:00, 134.47it/s]
Loss: 1.4712, Val Loss: 1.4710, Val Accuracy: 99.03%
Epoch 10/10: 100%|██████████| 782/782 [00:05<00:00, 138.27it/s]
Loss: 1.4702, Val Loss: 1.4725, Val Accuracy: 98.90%

And yeah, you can see. This model just needs 2-3 epochs to reach 98% accuracy. I used 10 epochs to make the output more readable. You can try with fewer epochs to fasten the training process.

Then, you can plot the training history to visualize the training and validation loss and accuracy:

Plot the training history

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot([x[0] for x in history], label="Train Loss")
plt.plot([x[1] for x in history], label="Val Loss")
plt.xlabel("Epoch")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot([x[2] for x in history], label="Val Accuracy")
plt.xlabel("Epoch")
plt.legend()
plt.show()

This code snippet plots the training and validation loss on the left and the validation accuracy on the right. You can visualize how the model performs during training.

Testing the Model

After training the model, we can evaluate its performance on the test set. Here's how you can test the model:

x = [images for images, _ in test_loader]
y_true = [labels for _, labels in test_loader]
y_pred = model.predict(test_loader)

This code snippet gets the images and labels from the test loader and makes predictions using the trained model. We compare the predicted labels with the ground truth labels to evaluate the model's performance.

Next, we calculate the accuracy of the model on the test set:

total = 0
correct = 0

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for true, pred in zip(y_true, y_pred):
    true, pred = true.to(device), pred.to(device)
    total += len(true)
    correct += (true == pred).sum().item()

accuracy = 100 * (correct / total)
print(f"Test Accuracy: {accuracy:.2f}%") # Test Accuracy: 98.58%

Finally, we plot a random selection of test images along with their true and predicted labels:

_, axs = plt.subplots(2, 5, figsize=(15, 6))

for i in range(10):
    random_idx = torch.randint(0, 64, (1, 2)).squeeze()
    axs[i // 5, i % 5].imshow(x[random_idx[0]][random_idx[1]].squeeze(), cmap="gray")
    color = "green" if y_true[random_idx[0]][random_idx[1]] == y_pred[random_idx[0]][random_idx[1]] else "red"
    axs[i // 5, i % 5].set_title(
        f"True: {y_true[random_idx[0]][random_idx[1]]}, Pred: {y_pred[random_idx[0]][random_idx[1]]}",
        color=color,
    )
    axs[i // 5, i % 5].axis("off")

plt.show()

This code snippet plots a grid of 10 test images along with their true and predicted labels. The true labels are displayed in green, while the incorrect predictions are displayed in red.

Save and Load the Model

You can save the trained model to a file and load it later for inference. Here's how you can save and load the model:

torch.save(model.state_dict(), "mnist_cnn.pth")

This code snippet saves the model's state dictionary to a file named mnist_cnn.pth.

To load the model for inference, you can use the following code:

model = CNN(
    [
        nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
        nn.ReLU(),
        nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2, padding=0),
        nn.Flatten(),
        nn.Linear(64 * 14 * 14, 128),
        nn.ReLU(),
        nn.Linear(128, 10),
        nn.Softmax(dim=1),
    ]
)
model.load_state_dict(load("mnist_cnn.pth", weights_only=True))
model.eval()

This code snippet loads the model architecture and the trained weights from the file mnist_cnn.pth. The model is then set to evaluation mode for inference.

Web UI for Image Classification

You can create a simple web UI to interact with the image classifier. Here's a basic example using Flask:

app.py

# Import libraries
import cv2
import flask
import flask_cors
import numpy as np

from cnn import CNN
from torch import nn, load
from torchvision import transforms

# Load the model (in the previous section)
model = ...

# Initialize the Flask app
app = flask.Flask(__name__)
flask_cors.CORS(app)


def preprocess_image(image):
    preprocess = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.5,), (0.5,)),
        ]
    )

    image = cv2.imdecode(np.frombuffer(image.read(), np.uint8), cv2.IMREAD_GRAYSCALE) # Read the image
    image = cv2.normalize(image, None, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX) # Normalize the image
    image = cv2.resize(image, (28, 28)) # Resize the image to 28x28

    return preprocess(image).unsqueeze(0).to("cuda")


@app.route("/predict", methods=["POST"])
def predict():
    image = flask.request.files.get("image")
    image = preprocess_image(image)

    prediction = model(image).argmax().item()

    return flask.jsonify({"prediction": prediction})


@app.route("/")
def index():
    return flask.render_template("index.html")


if __name__ == "__main__":
    app.run()

Then, create an HTML template for the web UI. I use Tailwind CSS for styling, so make sure to include the Tailwind CSS CDN in your HTML file.

templates/index.html

<script src="https://cdn.tailwindcss.com"></script>

templates/index.html

<body class="bg-[#0a0a0a] text-[#fafafa]">
  <div id="app">
    <main
      class="container mx-auto my-20 max-w-(--breakpoint-md) px-8 text-center"
    >
      <h1 class="mb-6 text-4xl font-bold">Handwritten Digit Prediction</h1>

      <div class="grid grid-cols-3 gap-4">
        <canvas
          ref="image"
          class="col-span-2 aspect-square w-full rounded-lg border border-gray-300"
        ></canvas>

        <div class="grid grid-cols-2 gap-4">
          <button
            v-on:click="predict"
            class="inline-flex h-9 items-center justify-center rounded-lg bg-[#fafafa] px-3 py-2 text-[#0a0a0a] transition-colors hover:bg-[#d4d4d4]"
          >
            Predict
          </button>

          <button
            v-on:click="clear"
            class="inline-flex h-9 items-center justify-center rounded-lg border border-[#404040] bg-[#0a0a0a] px-3 py-2 text-[#fafafa] hover:bg-[#404040]"
          >
            Clear
          </button>

          <p class="inline-flex gap-2 text-left text-2xl font-bold">
            Prediction:
            <span v-text="prediction" class="font-normal"></span>
          </p>
        </div>
      </div>
    </main>
  </div>

  <script type="module" src="../static/main.js"></script>
</body>

Finally, create a JavaScript file to handle the interactions with the web UI and make predictions using the model. In this example, I use Vue.js for the frontend logic:

static/main.js

import { createApp, ref } from 'https://unpkg.com/vue@3/dist/vue.esm-browser.js'

createApp({
  setup() {
    const prediction = ref('')
    return { prediction }
  },
  methods: {
    async initDrawings() {
      /** @type {HTMLCanvasElement} */
      const image = this.$refs.image
      image.width = 280
      image.height = 280
      const ctx = image.getContext('2d')

      let isDrawing = false
      let pos = { x: 0, y: 0 }

      image.addEventListener('mousedown', (e) => {
        isDrawing = true
        getPos(e)
      })

      image.addEventListener('mouseup', (e) => {
        isDrawing = false
      })

      image.addEventListener('mousemove', (e) => {
        if (!isDrawing) return
        ctx.beginPath()
        ctx.fillStyle = '#0a0a0a'
        ctx.strokeStyle = '#fafafa'
        ctx.lineCap = 'round'
        ctx.lineWidth = 20

        ctx.moveTo(pos.x, pos.y)
        getPos(e)
        ctx.lineTo(pos.x, pos.y)
        ctx.stroke()
      })

      const getPos = (e) => {
        pos.x = e.clientX - image.offsetLeft
        pos.y = e.clientY - image.offsetTop
      }
    },

    async predict() {
      /** @type {HTMLCanvasElement} */
      const image = this.$refs.image

      image.toBlob(async (blob) => {
        const formData = new FormData()
        formData.append('image', blob)

        const res = await fetch('/predict', {
          method: 'POST',
          body: formData,
        }).then((res) => res.json())

        this.prediction = res.prediction
      })
    },

    async clear() {
      /** @type {HTMLCanvasElement} */
      const image = this.$refs.image
      const ctx = image.getContext('2d')
      ctx.clearRect(0, 0, 280, 280)
      ctx.fillStyle = '#0a0a0a'
      ctx.fillRect(0, 0, 280, 280)
      this.prediction = ''
    },
  },
  mounted() {
    this.initDrawings()
  },
}).mount('#app')

This JavaScript file sets up the canvas for drawing digits, captures the drawn image, sends it to the Flask server for prediction, and displays the predicted digit on the web UI. The predict method sends the drawn image to the Flask server for prediction, and the clear method clears the canvas and resets the prediction. The initDrawings method initializes the canvas for drawing. The canvas is set up to capture mouse events for drawing digits. The mounted lifecycle hook calls the initDrawings method when the Vue app is mounted. The Vue app is mounted on the #app element in the HTML template. You can run the Flask app by executing the following command in your terminal:

python src/app.py

This will start the Flask app, and you can access it in your browser at http://localhost:5000.

Conclusion

In this blog, we've explored the process of building a Convolutional Neural Network (CNN) image classifier using PyTorch. We've covered the steps involved in defining the CNN model, training it on the MNIST dataset, and evaluating its performance on the test set.

We've also demonstrated how to save and load the trained model for future use and create a simple web UI for interacting with the image classifier.

I hope this blog has provided you with a solid foundation for building your own image classifier using CNNs and PyTorch. Feel free to experiment with different model architectures, datasets, and hyperparameters to further enhance your understanding of CNNs and deep learning.

For more details, you can refer to the tiesen243/cnn repository on GitHub.

Build Your Own Image Classifier with PyTorch

On this page