Sep 14, 2022 | 6 min read

Rolling Out Risky Crons In Production

A practical pattern for dangerous cron jobs: do not trust staging alone, start with a tiny allowlist, and widen the blast radius in phases.

cronproductiondata-safetyoperations

Cron jobs are some of the scariest code paths in an application.

Regular request paths usually affect one user, one account, one action. A cron is different. It can wake up and touch a lot of entities at once, often without a human watching it closely. If that cron is doing something destructive, like a retention cleanup or deletion flow, the blast radius gets uncomfortable very quickly.

That is not just theory.

There was a case where one query inside a cron behaved badly for one particular set of data and ended up removing important data from the database. Recovery was possible because backups existed and the data could be backfilled, but it was still the kind of incident that changes how these jobs are viewed. After that, cron safety stopped looking like "did this pass testing?" and started looking more like "how is the damage contained if production still surprises us?"

Why Staging Is Not Enough

Cron logic still needs to be tested locally and in staging. That is necessary. It catches obvious mistakes, broken assumptions, and the simple bugs that should not be shipping in the first place.

But for this kind of job, staging only gets you part of the way.

The hard part is that production has the real combinations. More accounts. More old data. More awkward states created by years of product changes, migrations, partial failures, and human behavior. A cron that looks perfectly fine in a clean staging setup can still behave badly when it meets one weird slice of real production data.

That is why "works in staging" should not be treated as the final safety check for dangerous cron jobs. It is better treated as the point where production rollout can begin carefully.

The Pattern That Works Better

The main thing to do is roll the cron out in phases.

Instead of enabling it for every account at once, we keep an explicit set of accounts or entities in the database that the cron is allowed to run against. The cron reads that allowlist and only processes that subset.

That gives us a much safer operating model:

the code can be deployed without exposing the whole system
the first production run can be intentionally tiny
after each run, we can inspect what happened and then widen the scope
if something looks wrong, the damage is limited to a very small set

The first rollout is usually just one or two accounts. If that looks right, we expand a little. Then a little more. Eventually, after enough runs across enough different kinds of data, we can be comfortable enabling it for everything.

This approach works because it accepts a simple truth: for some cron jobs, production itself is part of the test surface. The goal is not to pretend otherwise. The goal is to make that exposure gradual.

Picking The First Accounts Matters

The small sets should not be random.

The early batches should include both kinds of cases:

the entities that really should be touched by the cron
the entities that look similar but absolutely should not be touched

For deletion or retention jobs, that means the tiny test set should include records that should be deleted and records that should survive. If only obvious positive cases are picked, the dangerous part is not really being tested. The dangerous part is often the false positive.

So those early accounts need to be chosen carefully. Not just "something small," but "something representative."

That usually means looking for:

an internal or test account first
a low-risk account with real but non-critical data
accounts that contain known edge cases
accounts where both delete and no-delete outcomes exist side by side

That last one matters a lot. A cron can look correct when every row in the batch should be deleted. Confidence goes up much more when the same run includes rows it must skip.

The Rollout Order That Usually Works Best

The order is usually more important than the total number.

The first runs should happen where the cost of being wrong is lowest. So the progression is usually something like this:

local and staging for the obvious checks
internal test accounts in production
low-criticality customer accounts
broader batches with more variation
global enablement only after enough clean runs

This is not elegant, but it is practical.

A lot of production safety work is just refusing to create a large blast radius before you have earned the right to do that.

What Changes After Each Run

The important part is that the rollout does not happen in one leap.

After each cron run, the allowlist in the database gets updated and the set expands for the next run. That update is a deliberate step. It forces a pause to inspect what happened and decide whether the cron has earned more scope.

If the previous batch behaved correctly, the rollout moves forward.

If anything looks suspicious, it stops there.

That pause between runs is doing a lot of work. It turns the rollout into a sequence of small decisions instead of one big irreversible bet.

Why Crons Still Feel Uncomfortable

Cron jobs do not stop being scary just because they are tested.

They stay scary because they usually run outside the normal human feedback loop. No one clicks a button. No one is staring at the screen at the exact moment they wake up. They often run over old data, messy data, or cross-account data. And if they are destructive, they can be wrong in a very expensive way.

That is why the main safety story should not be "the query looked right when it was reviewed."

The safety story should be layered:

we tested the logic before shipping
we had backups in case reality still hurt us
we limited the first production runs to a very small set
we expanded only after seeing clean results

That combination feels much more honest than trying to be overly confident about staging coverage.

If A Dangerous Cron Is Shipping

This is the checklist worth coming back to:

test the logic locally and in staging, but assume production can still surprise you
make the cron read from a database-controlled allowlist of accounts or entities
start with one or two carefully chosen accounts, not a random sample
include both positive cases and negative cases in the first batches
begin with internal or low-criticality accounts before touching important customer data
inspect each run, then expand the allowlist gradually
keep backups and a recovery path ready before the cron is allowed to do destructive work everywhere

The main lesson is simple: do not let a risky cron discover the full production dataset on day one.

Let it earn its reach in phases.