· Data Engineering  · 2 min read

Making Pipelines Unbreakable: The Power of Idempotency

If your pipeline crashes halfway through, can you just restart it? If not, you have a problem. We explain why Idempotency is the golden rule of Data Engineering.

If your pipeline crashes halfway through, can you just restart it? If not, you have a problem. We explain why Idempotency is the golden rule of Data Engineering.

Imagine you are running a script to insert yesterday’s sales into the database. It runs for 50 minutes and then crashes because of a network blip. You fix the connection and run it again.

Do you now have double the sales for yesterday?

If the answer is “Yes,” your pipeline is not Idempotent. And that is a disaster waiting to happen.

Defining Idempotency

In maths, an operation is idempotent if applying it multiple times has the same result as applying it once. $f(f(x)) = f(x)$. in Data Engineering, it means “Running the pipeline twice yields the same database state as running it once.”

How to Achieve It

  1. Delete-Write Pattern: Before inserting data for date=2024-12-15, explicitly delete any existing data for date=2024-12-15.
    DELETE FROM sales WHERE date = '2024-12-15';
    INSERT INTO sales SELECT * FROM raw_sales WHERE date = '2024-12-15';
  2. Merge / Upsert: Use MERGE statements to update records if they exist and insert them if they don’t, based on a unique Primary Key.
  3. Deterministic Logic: Ensure your transformations don’t rely on ReviewTime() or random numbers, which change on every run.

Sleep Better at Night

When your pipelines are idempotent, on-call incidents become much easier. You don’t have to painstakingly work out which rows were inserted and which weren’t. You just hit “Retry” and go back to sleep.

Are your pipelines fragile? We build resilient data systems. Contact our Data Engineering team.

Back to Knowledge Hub

Related Posts

View All Posts »