Training a Deep Learning Model for Solving Captchas

Posted on Saturday, 13 September 2025

Hi all. Several months ago, I trained a deep neural network (DNN) model for solving captcha from a certain website. Previously, I have created a python project for scraping some data from this website, but the user must manually input the captcha for logging into the website, which is irritating because the web-scraper script need to be run every day. With this model, the user doesn’t have to do that anymore.

The model pipeline is like this:

Open the login page and retrieve the captcha image.
Preprocess the image by segmenting each character, pad each image to be square, and shrunk the size of the image.
For each character image, feed it into the neural net model, and get the answer. Concatenate the characters to get the captcha word
Send the login credentials along with the captcha answer

The hardest problem is to train the model so that it can accurately predict the characters.

To get a sense of the problem at hand, here are some examples of the captcha:

Captcha examples

I noticed the following properties of those captchas:

The characters are randomly skewed or rotated slightly
The characters are perfectly separable (i.e., no overlapping characters)
The characters are only limited to this list of 43 characters: 2345678ABDEFHJMNQRTYabcdefhijkmnpqrstuvwxyz

Now, why didn’t I use Tesseract OCR (a very good open-source OCR program)? Well, I tried it at first, but I found the result to be too inaccurate to be acceptable. In many cases, Tesseract couldn’t even accurately predict how many characters there are in the image.

Building the training data

As many of you already know, training a machine learning model requires a lot of data, especially for a DNN model. So, I wrote a python script to download a bunch of captcha images from the website. In the end, I ended up with 398 images, which I then proceeded to rename manually so that each file name is equal to the content of the image, like so:

Captcha image files

The labeling process was a little bit tedious, but still manageable.

Preprocessing the image

To make the model training significantly easier, I need to separate each image into individual characters. In other words, instead of having to process a single picture containing five letters, I want the model to only need to process each character individually.

Separating each character is trivial since the characters have no overlap in the X-axis. All I had to do was to project the pixels to the X-axis, and locate where there are changes in the pixel’s color. For example, given this image …

Captcha image

… we can project the pixels to the X-axis to get something like this:

Captcha image projection

After that, I isolated the characters in the yellow blocks above, and padded each character so that all of them have equal dimension, like so:

Captcha image separated

As you can see, now we have five images with equal dimension, and in each image there’s only one character.

The complete algorithm is as follows:

from PIL import Image
import numpy as np

N_CHARS = 5

def preprocess_image(im: Image.Image) -> np.ndarray:
    """
    Convert image into (n_chars x width x height) numpy array, where
    n_chars is equal to 5. It is assumed that each letter in the image
    is perfectly separable using vertical lines.

    Parameters
    ----------
    im : Image.Image
        The input PIL Image object containing the text.

    Returns
    -------
    np.ndarray
        A NumPy array of shape `(5, width, height)`, representing each character segment
        as a separate (width x height) array.

    Notes
    -----
    - This function assumes that the image contains exactly 5 characters arranged horizontally.
    - Any necessary preprocessing, such as resizing or grayscale conversion, should be handled
      before calling this function if required.

    Example
    -------
    >>> from PIL import Image
    >>> im = Image.open("path/to/image.png")
    >>> X = preprocess_image(im)
    """
    width_scale = 1 # times the height
    scale = 0.5
    n_chars = N_CHARS
    threshold = 180

    im = im.resize(
        size=(int(im.width * scale), int(im.height * scale))
    )
    arr = np.array(im)[:,:,0] # only use the 1st channel
    x_projection = (arr < threshold).any(axis=0)
    is_cutoffs = (x_projection[1:] != x_projection[:-1])

    assert is_cutoffs.sum() == n_chars * 2

    width = int(width_scale * arr.shape[0])
    cutoffs = 1 + np.where(is_cutoffs)[0]
    cutoffs = cutoffs.reshape(n_chars, 2)

    x_list = []

    for _, (x1, x2) in enumerate(cutoffs):
        subset = arr[:,x1:x2]
        pad_size = width - subset.shape[1]
        assert pad_size > 0
        pad_left = pad_size // 2
        pad_right = pad_size - pad_left
        subset = np.pad(subset, [
            (0, 0),
            (pad_left, pad_right)
        ], constant_values=255)

        x_list.append(subset)

    X = np.array(x_list)
    return X

Synthesizing new data using augmentation

Recall that I scraped 398 captcha images, and each image contains five characters, which means that I have 398 x 5 = 1990 character images in total. There are 43 possible characters, so on average each character has ~46 images. However, the distribution of each character is not balanced, so some characters got less than that, and some got more. I believed that the number of training data is not enough for training a DNN model. Since I didn’t want to scrape hundreds more images and manually label them again, I resorted to data augmentation technique. What this means is that I used the existing image data and applied some random transformations to them (i.e., rotate or skew slightly). Luckily, pytorch already provided a ready-made function to do that (via the torchvision module).

Here’s the code that I ended up writing:

from torchvision.transforms import v2
import torch

def transform_tensor(X: torch.Tensor, affine_dict: dict[str, Any]) -> torch.Tensor:
    '''
    Transform the given N x H x W tensors using some affine transformations.
    The transformation is done on each N separately. `affine_dict` is input to
    v2.RandomAffine.

    Parameters
    ----------
    X : torch.Tensor
        Tensor of shape (N, H, W) where N
        is the number of samples, H the height,
        and W the weight

    Returns
    -------
    torch.Tensor
        Transformed tensor of shape (N, H, W)
    '''
    X = X[:,None,:] # add channel layer
    transform = v2.RandomAffine(**affine_dict)
    # index [0] to get the first channel layer
    out = torch.stack([transform(xi)[0] for xi in X])
    return out

def augment_data(
    X: torch.Tensor,
    y: torch.Tensor,
    affine_dict: dict[str, Any],
    n: int,
    shuffle: bool,
    keep_original: bool,
) -> tuple[torch.Tensor, torch.Tensor]:
    '''
    Generate new data such that each unique
    class in y consist of n observations in the
    output. The generated data came from X but applied
    some random transformations.

    Parameters
    ----------
    X : torch.Tensor
        Tensor of shape (N, H, W)
    y : torch.Tensor
        Tensor of shape (N) of type int
    n : int
        Number of generated samples per class
    shuffle : bool
        Whether to shuffle the generated data or not
    keep_original : bool
        Whether to include the original data or not

    Returns
    -------
    output : torch.Tensor
        Tensor of shape (K * n, H, W) where
        K is the number of classes in y
    '''
    X_list = []
    y_list = []
    n_class = y.max()
    for k in range(n_class):
        ix = torch.where(y == k)[0]
        # oversample ix
        ix = ix[torch.randint(len(ix), (n, ))]
        X_k = X[ix]
        y_k = y[ix]
        X_list.append(X_k)
        y_list.append(y_k)
    X_out = torch.cat(X_list)
    # apply transformations
    X_out = transform_tensor(X_out, affine_dict)
    y_out = torch.cat(y_list)

    if keep_original:
        X_out = torch.cat([X_out, X])
        y_out = torch.cat([y_out, y])

    # shuffle the data
    if shuffle:
        ix = torch.randperm(X_out.shape[0])
        X_out = X_out[ix]
        y_out = y_out[ix]
    return X_out, y_out

Equipped with the above function, I was able to increase the number of training data by several thousands. Incidentally, augmenting the data like this also serves to improve out-of-sample performance (i.e., reducing overfitting).

Model architecture and training procedure

I tried several models, but in the end I settled with the following procedures:

First, preprocess the 2D image into a flat 1D array so that it can be processed by the neural network. Since this approach already worked very well, I didn’t need to use more complex method like convolution (which could process the 2D image directly).
Then, for the neural network:
- Use a single hidden layer with 200 neurons, and use tanh function for the non-linearity
- Apply batch normalization before feeding into the tanh function
- Use dropout layer and L1 penalty as regularization (useful to get better out-of-sample performance)
- Use batch size=128 and 25k number of epochs

Below is the model architecture:

Model architecture

How many parameters does this model have? Let’s calculate:

The input image dimension is 25 x 25 pixels, which means there are 625 numbers as input
The hidden layer has 200 neurons
The output layer has 43 neurons (since there are 43 possible characters)
Now we only need to multiply all the above numbers: 625 x 200 x 43 = 5,375,000

As you can see, this model have more than 5 million parameters, which means that without proper regularization or with very limited data, the model would likely overfit the training samples.

Here’s the code for setting up the model and loss function:

from torch import nn

model = nn.Sequential(
    nn.Flatten().float(),
    nn.Linear(height * width, n_neurons),
        nn.BatchNorm1d(n_neurons),
        nn.Tanh(),
        nn.Dropout(p=0.4),
    nn.Linear(n_neurons, n_class)
)
loss_fn = nn.CrossEntropyLoss()

Meanwhile, below is the training function:

import torch
from torch import nn

def train(
    model: nn.Module,
    X_train: torch.Tensor,
    y_train: torch.Tensor,
    X_valid: torch.Tensor,
    y_valid: torch.Tensor,
    loss_fn: nn.CrossEntropyLoss,
    batch_size: int,
    n_epochs: int,
    learning_rate: float,
    l1_lambda: float,
    seed: int,
) -> pd.DataFrame:
    torch.manual_seed(seed)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # type: ignore
    train_losses = []
    valid_losses = []

    for param in model.parameters():
        param.requires_grad = True

    for i in range(n_epochs):
        ix = torch.randint(0, len(X_train), (batch_size, ))
        X_train_batch = X_train[ix]
        y_train_batch = y_train[ix]
        logits = model(X_train_batch)
        loss = loss_fn(logits, y_train_batch)

        # add L1 penalty as regularization
        for layer_name,layer_param in model.named_parameters():
            if 'weight' in layer_name:
                loss += l1_lambda * layer_param.abs().sum()

        loss.backward()
        optimizer.step()

        # save train and validation loss
        with torch.no_grad():
            model.eval()
            train_loss = loss_fn(model(X_train_batch), y_train_batch)
            valid_loss = loss_fn(model(X_valid), y_valid)
            train_losses.append(train_loss.item())
            valid_losses.append(valid_loss.item())
        model.train()

        # print training progress
        if i % 200 == 0 or i == n_epochs - 1:
            print(f'epoch:{i:<8}train_loss:{train_loss.item():.4f}  valid_loss:{valid_loss.item():.4f}')

    print('Training finished')
    model.eval()
    losses_df = pd.DataFrame({
        'train': train_losses,
        'valid': valid_losses
    })
    return losses_df

When the training was finished, I plot the time series of the training and validation loss:

Training and validation loss

As you can see, the loss gradually declines overtime, which is a good sign that the model is capable of learning the patterns in the image.

In the end, the best model achieved 98.61% accuracy out-of-sample (on the validation set). Below, I have plotted the characters that are mislabeled by the model:

Mislabeled characters

In the plot above, the X-axis indicate the predicted label, while the Y-axis indicate the actual label. For example, the number “2” in the most bottom row & third column means that the model predicted the letter “z” as the letter “E” two times among all the validation set.

Apparently, the most confusing letter for the model is the letter “z”, since it’s confused with the letter “E”, “r”, and “s”.

Model accuracy

As I have stated, the best model achieved a 98.61% accuracy out-of-sample. However, we need to predict five characters at a time. So, what’s the probability that we get all five characters correct? It’s equal to 0.9861⁵ = 0.9324, or only 93.24%, which I thought was not good enough.

But then, I remembered that the website allows me to retry inputting the captcha several times without any penalty (when retrying, there’s simply a new captcha image). So, what’s the probability that we’ll get the captcha right in at most three attempts? It’s equal to (1-(1-0.9324)³) = 0.999691, or 99.97%, which is very good, so I decided to stop here.

In fact, when the model was deployed to predict new data, it only need two attempts at most, but mostly succeed at first attempt.

Deploying the model

To deploy the model, I simply did the following:

Save the model object into an ONNX file (see this documentation to see how); I named the file model.onnx
Manually copy the model.onnx file into the web scraper project, and use onnxruntime for running the model.

Roughly speaking, the code for running the model looks something like this:

import onnxruntime
from PIL import Image

def solve_captcha(img: Image.Image) -> str:
    model_file = "path/to/model.onnx"
    ort = onnxruntime.InferenceSession(model_file)
    # preprocess the image into array of 2D arrays, and convert to float
    X_arr = preprocess_image(img)
    X = X_arr.astype('float32') / 255.0
    # feed the data into the neural network
    logit_pred: np.ndarray = self.ort.run(None, {'X': X})[0]
    # make prediction
    y_pred = logit_pred.argmax(axis=1)
    chars_pred = [chars[i] for i in y_pred]
    # note: chars = list of chars used for training the model
    captcha_pred = ''.join(chars_pred)
    return captcha_pred

I admit that the deploying process is a bit janky. However, it works well enough for my purpose, so I didn’t purse the matter further.

Anyway, I’m happy to report that the model has been deployed now and used everyday at my office with very good accuracy.

Closing remarks

I think that’s it for this post, I hope you learn something new today.