· Synthetic Data  · 1 min read

Fixing Imbalanced Datasets with Generative AI

Fraud is rare. That makes it hard to train models to detect it. We show how to use GenAI to synthesize 'rare class' examples and balance your training set.

Fraud is rare. That makes it hard to train models to detect it. We show how to use GenAI to synthesize 'rare class' examples and balance your training set.

In the real world, the thing you want to find is often the needle in the haystack.

  • Fraud: 0.1% of transactions.
  • Cancer: 0.01% of MRI scans.
  • Manufacturing Defects: 0.001% of parts.

Standard ML models fail here. They just learn to guess “Normal” every time and achieve 99.9% accuracy, while missing every single fraudster.

The Old Way: Oversampling

Historically, we just copied the rare rows 100 times. Or we used simple math (SMOTE) to draw lines between existing points. This helps, but it doesn’t create new variety.

The New Way: Generative Augmentation

We can now train a small Generative Model (like a Diffusion Model) specifically on the rare class.

  • “Show me what a fraudulent transaction looks like.”
  • “Now generate 1,000 new, unique examples of fraud.”

This teaches the main model the structure of fraud, rather than just memorising the specific fraud examples we already caught.

Results

In a recent project for a FinTech client, augmenting the training set with 20% synthetic fraud examples improved recall (detection rate) by 15% without increasing false positives.

Model failing on edge cases? Let’s balance your data. Chat to our Data Scientists.

Back to Knowledge Hub

Related Posts

View All Posts »