Khmer OCR Project

Our Pipeline

We built a custom detection model based on YOLOv8-nano instead of using a generic pretrained model, so we can optimize it for Khmer text, smaller model size, and faster inference.

Visualizing Step 1

Visualizing Step 2

Visualizing Step 3

YOLOv8 Detection

Step 01

CRNN Recognition

Step 02

Pipeline Output

Step 03

The OCR Challenge

Why building Khmer OCR is harder than English—and how we solved it.

Data-Scarce Language

There is no large, public Khmer OCR dataset with handwriting, layouts, or detailed annotations. Most Khmer documents are offline or locked in images and PDFs, so researchers have very little clean text or labeled examples to train modern models.

DIY Dataset Generator

Because ready-made datasets don’t exist, we generate our own. Using sources like the Khmer Dictionary 2022, we render paragraphs with random font sizes, line spacing, and padding onto different backgrounds—then automatically create bounding boxes and labels for every word.

Real-World Noise & Confusion

Real scans are never perfect, so our synthetic images aren’t either. We add motion blur, JPEG compression, brightness and color shifts, slight rotations, and even “negative” samples in other languages to teach the model to survive noisy cameras, bad scans, and mixed-language pages.

Our Purpose

Accelerating Khmer LLM Development

High-quality Khmer text is essential for training large language models, but most existing datasets are tiny or inconsistent. Our OCR pipeline helps generate cleaner, larger corpora so Khmer can be supported by the same advanced AI technologies available in other languages.

Research and Development

This page is part of an ongoing project to improve Khmer OCR performance. We’re experimenting with different models, datasets, and layouts to see what works best for real users. Your uploads and feedback help us understand where the system succeeds—and where we still need to improve.

UNLOCKING THE KHMER SCRIPT