How I Broke My Own Image Classifier

Last week, I trained a neural network to recognize handwritten digits using the classic MNIST dataset. It achieved 98% accuracy on test data — better than I expected for my first real deep learning project.

Feeling confident, I showed it a clear image of the number "3."

MODEL OUTPUT:

Prediction: 3

Confidence: 97.2%

Then I changed exactly one pixel in that image. Not enough for a human to notice — in fact, I had to zoom in to see the difference myself. I showed the model this "new" image.

MODEL OUTPUT:

Prediction: 8 (was 3)

Confidence: 94.6%

This wasn't a bug. It wasn't a fluke. This is called an adversarial attack, and it reveals something fundamental about how AI systems fail.

The Setup: Why You Should Care

You've probably heard about "AI hallucinations" or models giving confident wrong answers. Adversarial attacks reveal something deeper: models don't "see" the way humans do.

They find patterns in high-dimensional space that have nothing to do with what we care about. This connects to alignment: if we can't make a model robustly recognize a digit, how do we make it robustly follow human values?

My positioning: I'm learning AI safety from scratch, documenting as I go. This week I'm following UCSD's AI Safety course (Week 1 & Week 4 materials). No PhD needed — just curiosity and some Python.

Part 1: Building the Classifier

First, I needed a working image classifier. I followed UCSD's Homework 1: "vibe-code a ResNet image classifier for MNIST."

What is MNIST?

MNIST is a dataset of 70,000 handwritten digits (0-9), each 28×28 pixels. It's the "Hello World" of machine learning. If your model can't handle MNIST, it definitely can't handle real-world problems.

Read-Only Demo

Sample MNIST Digits

Here are examples from the dataset. My model was trained to recognize these:

✓ Model Accuracy on Test Set: 98.2% (9,820 correct out of 10,000)

The Model Architecture

I used a ResNet-18 architecture — a convolutional neural network that's become standard for image classification:

# This is the part where we train the model
# ResNet is "fancy" but the concept is simple:
# Stack of layers that learn patterns in images

import torch
from torchvision import models

model = models.resnet18(pretrained=False)
# Train on MNIST data...
trained_model = train(model, mnist_data)

# Test accuracy
accuracy = evaluate(trained_model, test_data)
print(f"Accuracy: {accuracy:.2%}")  # 98.2%
                

Everything looked good! The model "understands" digits... or does it?

Part 2: Breaking Everything with FGSM

Now comes the fun part. I implemented the FGSM (Fast Gradient Sign Method) attack.

How the Attack Works

1. Take a correctly classified image (e.g., a "3")
2. Calculate the gradient: which direction makes the model MORE wrong?
3. Nudge the image slightly in that direction
4. The model now confidently misclassifies it

def fgsm_attack(image, epsilon, data_grad):
    # Collect the sign of the gradient
    sign_data_grad = data_grad.sign()
    
    # Create perturbed image by adding small noise
    perturbed_image = image + epsilon * sign_data_grad
    
    # Clip to valid pixel range [0, 1]
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    
    return perturbed_image
                

Read-Only Results

Attack Results: Before & After

Here's what happened when I ran the attack on correctly classified digits:

Original Image

Prediction: 3

Confidence: 97.2%

Perturbation

The noise added

Looks like static

Attacked Image

Prediction: 8

Confidence: 94.6%

Key Insight: The original and attacked images look identical to the human eye, but the model's prediction completely changes. The model isn't "seeing" digits the way we do.

Figure 1.0 // Adversarial Attack in High-Dimensional Decision Space

What's happening? The model isn't "seeing" the shape of the digit. It's found shortcuts — patterns in pixel space that correlate with labels during training, but break under adversarial pressure.

This is the alignment problem in miniature: the model optimized for our metric (accuracy on test set) but not for what we actually care about (robust understanding of what a digit IS).

Part 3: Now You Try It

Okay, enough reading. Time to break things yourself. Draw a digit, run the attack, and see how easy it is to fool an AI. Everything runs in your browser.

Interactive Playground

Attack Your Own Digit

Instructions: Draw a digit in the canvas, click "Classify", then "Run Attack" to see adversarial perturbation in action.

1. Draw Here

2. Prediction

Draw & classify first

3. After Attack

Run attack to see results

Attack Strength (ε) 0.10

Higher = more visible, more effective

Note: The model running in your browser is simplified. A full model would be even more vulnerable.

Part 4: My Unique Twist — The Jawi Experiment

Here's where I deviate from the standard UCSD curriculum. Standard MNIST uses Western Arabic numerals (0-9). But what about Jawi numerals — the Eastern Arabic-Indic digits used in Malaysia?

Cultural Context

Eastern Arabic-Indic Numerals (Jawi)

٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩

These are the digits used in many Islamic contexts — from Quranic verse numbering to Malaysian government documents. They look similar to Western numerals, but have subtle differences.

My hypothesis: If I fine-tune my MNIST classifier on Jawi numerals, does it become MORE or LESS robust to adversarial attacks?

Why This Matters

→ Tests whether cultural context affects adversarial robustness
→ Connects to my broader research: epistemic frameworks might have different robustness properties
→ Practical: if we deploy AI in Malaysian schools (Dynamic Textbook project), we need robustness to non-Western data

Status: I'm building a small Jawi handwritten digit dataset (targeting 500 images). Interested in contributing samples or collaborating? DM me on Instagram. Results in the next post!

Why This Tiny Experiment Matters

Let me bring this back to the big picture. Three takeaways:

1. Technical: Adversarial Robustness Isn't Solved

Even for simple tasks like digit recognition, we don't have robust models. Scaling to LLMs with billions of parameters makes this exponentially harder.

2. Epistemic: Different Training = Different Failures

This connects to my research on epistemic frameworks: if models trained on different knowledge traditions have different "shortcuts," they might have different failure modes. The Jawi experiment tests this.

3. Accessibility: You Don't Need a PhD

This experiment is simple enough that a motivated high school student could replicate it. That's the point. AI safety shouldn't require elite credentials to start exploring.

The model optimized for our metric, but not for what we actually care about. This is the alignment problem in every AI system, from digit classifiers to AGI.

What's Next?

Next in this series: I'm diving into jailbreaking and prompt injection (UCSD Week 4). I'll test whether my AsistenKeluarga concept (family AI assistant) is vulnerable to prompt injection attacks in Malay vs. English.

Want to Contribute?

Join the Jawi Dataset Project

I'm building the Jawi handwritten digit dataset and would love help from the Malaysian AI community. Interested in:

→ Contributing handwriting samples
→ Collaborating on the experiment
→ Testing robustness on other cultural datasets

Reach out: [email protected] or @shafiranoh on Instagram