Last week, I trained a neural network to recognize handwritten digits using the classic MNIST dataset. It achieved 98% accuracy on test data — better than I expected for my first real deep learning project.
Feeling confident, I showed it a clear image of the number "3."
Then I changed exactly one pixel in that image. Not enough for a human to notice — in fact, I had to zoom in to see the difference myself. I showed the model this "new" image.
This wasn't a bug. It wasn't a fluke. This is called an adversarial attack, and it reveals something fundamental about how AI systems fail.
The Setup: Why You Should Care
You've probably heard about "AI hallucinations" or models giving confident wrong answers. Adversarial attacks reveal something deeper: models don't "see" the way humans do.
They find patterns in high-dimensional space that have nothing to do with what we care about. This connects to alignment: if we can't make a model robustly recognize a digit, how do we make it robustly follow human values?
Part 1: Building the Classifier
First, I needed a working image classifier. I followed UCSD's Homework 1: "vibe-code a ResNet image classifier for MNIST."
What is MNIST?
MNIST is a dataset of 70,000 handwritten digits (0-9), each 28×28 pixels. It's the "Hello World" of machine learning. If your model can't handle MNIST, it definitely can't handle real-world problems.
Sample MNIST Digits
Here are examples from the dataset. My model was trained to recognize these:
The Model Architecture
I used a ResNet-18 architecture — a convolutional neural network that's become standard for image classification:
Everything looked good! The model "understands" digits... or does it?
Part 2: Breaking Everything with FGSM
Now comes the fun part. I implemented the FGSM (Fast Gradient Sign Method) attack.
How the Attack Works
- 1. Take a correctly classified image (e.g., a "3")
- 2. Calculate the gradient: which direction makes the model MORE wrong?
- 3. Nudge the image slightly in that direction
- 4. The model now confidently misclassifies it
Attack Results: Before & After
Here's what happened when I ran the attack on correctly classified digits:
What's happening? The model isn't "seeing" the shape of the digit. It's found shortcuts — patterns in pixel space that correlate with labels during training, but break under adversarial pressure.
Part 3: Now You Try It
Okay, enough reading. Time to break things yourself. Draw a digit, run the attack, and see how easy it is to fool an AI. Everything runs in your browser.
Attack Your Own Digit
Note: The model running in your browser is simplified. A full model would be even more vulnerable.
Part 4: My Unique Twist — The Jawi Experiment
Here's where I deviate from the standard UCSD curriculum. Standard MNIST uses Western Arabic numerals (0-9). But what about Jawi numerals — the Eastern Arabic-Indic digits used in Malaysia?
Eastern Arabic-Indic Numerals (Jawi)
These are the digits used in many Islamic contexts — from Quranic verse numbering to Malaysian government documents. They look similar to Western numerals, but have subtle differences.
My hypothesis: If I fine-tune my MNIST classifier on Jawi numerals, does it become MORE or LESS robust to adversarial attacks?
Why This Matters
- → Tests whether cultural context affects adversarial robustness
- → Connects to my broader research: epistemic frameworks might have different robustness properties
- → Practical: if we deploy AI in Malaysian schools (Dynamic Textbook project), we need robustness to non-Western data
Status: I'm building a small Jawi handwritten digit dataset (targeting 500 images). Interested in contributing samples or collaborating? DM me on Instagram. Results in the next post!
Why This Tiny Experiment Matters
Let me bring this back to the big picture. Three takeaways:
1. Technical: Adversarial Robustness Isn't Solved
Even for simple tasks like digit recognition, we don't have robust models. Scaling to LLMs with billions of parameters makes this exponentially harder.
2. Epistemic: Different Training = Different Failures
This connects to my research on epistemic frameworks: if models trained on different knowledge traditions have different "shortcuts," they might have different failure modes. The Jawi experiment tests this.
3. Accessibility: You Don't Need a PhD
This experiment is simple enough that a motivated high school student could replicate it. That's the point. AI safety shouldn't require elite credentials to start exploring.
What's Next?
Next in this series: I'm diving into jailbreaking and prompt injection (UCSD Week 4). I'll test whether my AsistenKeluarga concept (family AI assistant) is vulnerable to prompt injection attacks in Malay vs. English.
Join the Jawi Dataset Project
I'm building the Jawi handwritten digit dataset and would love help from the Malaysian AI community. Interested in:
- → Contributing handwriting samples
- → Collaborating on the experiment
- → Testing robustness on other cultural datasets
Reach out: [email protected] or @shafiranoh on Instagram