· Data Engineering · 2 min read
Making Pipelines Unbreakable: The Power of Idempotency
If your pipeline crashes halfway through, can you just restart it? If not, you have a problem. We explain why Idempotency is the golden rule of Data Engineering.
Imagine you are running a script to insert yesterday’s sales into the database. It runs for 50 minutes and then crashes because of a network blip. You fix the connection and run it again.
Do you now have double the sales for yesterday?
If the answer is “Yes,” your pipeline is not Idempotent. And that is a disaster waiting to happen.
Defining Idempotency
In maths, an operation is idempotent if applying it multiple times has the same result as applying it once. $f(f(x)) = f(x)$. in Data Engineering, it means “Running the pipeline twice yields the same database state as running it once.”
How to Achieve It
- Delete-Write Pattern: Before inserting data for
date=2024-12-15, explicitly delete any existing data fordate=2024-12-15.DELETE FROM sales WHERE date = '2024-12-15'; INSERT INTO sales SELECT * FROM raw_sales WHERE date = '2024-12-15'; - Merge / Upsert: Use
MERGEstatements to update records if they exist and insert them if they don’t, based on a unique Primary Key. - Deterministic Logic: Ensure your transformations don’t rely on
ReviewTime()or random numbers, which change on every run.
Sleep Better at Night
When your pipelines are idempotent, on-call incidents become much easier. You don’t have to painstakingly work out which rows were inserted and which weren’t. You just hit “Retry” and go back to sleep.
Are your pipelines fragile? We build resilient data systems. Contact our Data Engineering team.
