Deterministic SQL rewrites that are always safe to apply

A SQL rewrite that is probably equivalent is worth nothing. If a tool changes the meaning of a query even once — a row dropped, a column reordered, a NULL coerced — you can no longer trust any answer it returns, and you will spend more time auditing it than you ever saved. So chukei’s rewrite lever starts from the opposite premise: a rule is applied only when its output is provably equivalent to the original on this query, and otherwise the query passes through to Snowflake untouched.

chukei is an open source, Apache-2.0 cost optimization engine for Snowflake — a transparent wire-protocol proxy that runs in your own VPC. SQL rewriting is one of its six P0 levers. Of the six, it is the one most people assume needs machine learning. It does not. It needs an abstract syntax tree, a pack of equivalence-tested rules, and the discipline to do nothing when it isn’t sure.

01 / THE RULE-PACKAn AST, not a guess

When a query arrives on the wire, chukei parses it into an abstract syntax tree. The rewrite lever walks that tree and matches it against a fixed pack of deterministic rules. Each rule is a pure function from one AST to another, paired with a proof — encoded as a property test over generated inputs — that the two trees produce identical result sets under Snowflake semantics. If a rule’s preconditions hold, it fires; the transformed tree is re-serialised to SQL and forwarded. If they don’t, nothing happens.

A query AST walked against the deterministic rule-pack. Each rule carries an equivalence proof; a rule fires only when its preconditions hold, otherwise the original tree is forwarded verbatim.

There is no model in this loop, no inference call, no probability threshold. The same query produces the same rewrite every time, on every node, forever. That is the whole point.

02 / DETERMINISM VS THE “AI OPTIMIZER”Why a proof beats a prompt

The crowded end of this market sells LLM-based optimizers: feed the query to a model, let it suggest a faster version. That can surface genuinely clever rewrites a fixed rule-pack will never find. But it also inherits the model’s failure mode — it is plausible, not proven. A model that is right 99% of the time is catastrophic on the hot path, because the 1% silently returns wrong numbers to a dashboard nobody is re-checking, and you have no way to know which 1%.

A rewrite engine you cannot fully trust is one you have to fully audit — which costs more than the compute it saves.
— the design constraint

chukei makes the opposite trade deliberately. The rule-pack is smaller than what a model could propose, but every rule in it is closed-form and equivalence-tested, so the rewrite is correct by construction. Determinism here is not a limitation we apologise for — it is the feature. You get rewrites you never have to second-guess, and you keep the LLMs off the wire where they belong.

No LLM on the hot path — by invariant, not by preference. chukei’s rewrite lever runs in the deterministic query path with a ~2 ms p99 overhead inside a +5 ms budget. An inference call cannot meet that budget and cannot offer a proof, so it is never in the loop. LLM-based tools sit beside your warehouse and advise; chukei sits in front of it and acts only when it can prove it is safe.

03 / WHAT THE RULES ACTUALLY DOCheaper shapes, identical answers

The rules target query shapes that make Snowflake scan or shuffle more than the result requires. None of them change what the query means; they change what the warehouse has to do to answer it.

Rule	What it does	Why it’s cheaper	Skips when
`prune-select-star`	Replace `SELECT *` with the columns actually consumed	Less micro-partition scan, narrower shuffle	Output schema is observed downstream
`push-predicate`	Push `WHERE` filters below joins / into subqueries	Prunes partitions before the join	Predicate references a post-join expression
`drop-redundant-distinct`	Remove `DISTINCT` already guaranteed by a key	Skips a needless sort/aggregate	Uniqueness can’t be proven from constraints
`collapse-nested-subquery`	Flatten provably equivalent nested selects	Fewer materialisation steps	Any layer is non-deterministic
`eliminate-no-op-cast`	Drop casts to a column’s existing type	Avoids per-row work	Cast changes precision or semantics

Take the most common one. A BI tool emits SELECT * against a wide fact table, then the report uses three columns. Snowflake is columnar, so the wasted columns are wasted partition reads on every refresh:

-- before: the dashboard only renders order_id, amount, ts
SELECT *
FROM analytics.fct_orders
WHERE order_date = CURRENT_DATE;

-- after: provably equivalent for this consumer, far less scanned
SELECT order_id, amount, ts
FROM analytics.fct_orders
WHERE order_date = CURRENT_DATE;

The rewrite is only applied when chukei can see, from the query and its consumed projection, that the dropped columns are genuinely unused. If the result schema is observed in a way that would change — for instance the client depends on the full column set — the rule does not fire and the original SELECT * is forwarded byte-for-byte.

04 / FAIL OPEN, ALWAYSThe query is sacred

Every lever in chukei shares one invariant, and rewriting is no exception: when in doubt, do nothing. A parse error, an unrecognised dialect feature, a function chukei can’t prove deterministic, a precondition that doesn’t hold — any of these ends the rewrite attempt and the original SQL goes to Snowflake verbatim. The proxy never persists or logs credentials, and it never rewrites a query it cannot prove it is allowed to rewrite.

chukei rewrite · per-query decision
  parse ok? ───── no ──▶ passthrough (verbatim)
   │ yes
  rule matches? ─ no ──▶ passthrough (verbatim)
   │ yes
  equivalence preconditions hold? ─ no ──▶ passthrough (verbatim)
   │ yes
  apply rewrite ──▶ forward rewritten SQL

This is what makes the lever safe to leave on. The worst case is that chukei declines to optimise a query and you pay exactly what you would have paid without it. There is no failure mode in which a rewrite corrupts a result, because a rewrite that cannot be proven equivalent is, definitionally, never applied.

Because rewriting acts on query shape rather than repetition, it stacks with the verified cache and the auto-suspend levers rather than overlapping them. Across the rule-pack, rewriting contributes its share of the 15–30% savings band we ask every team to validate against their own workload — a target to confirm with the replay simulator before anything touches the path, never a promise.

Key takeaways

chukei rewrites Snowflake queries by walking an AST against a deterministic, equivalence-tested rule-pack — same input, same rewrite, every time.
A rule is applied only when it is provably equivalent on that query; otherwise the original SQL is forwarded verbatim.
No LLM on the hot path — determinism is the differentiator vs plausible-but-unproven AI optimizers, and it fits the ~2 ms p99 / +5 ms budget.
The lever is fail-open and false-positive-intolerant, so it is safe to leave on; the worst case is no rewrite, never a wrong answer.
Rewriting stacks with caching and suspend toward the 15–30% target you validate with replay first — never a guarantee.

The full rule-pack, its property tests, and the equivalence proofs are in the repository — every rule ships with the test that keeps it honest. If you have a query shape you think we should rewrite safely, open an issue with the (redacted) SQL and we will tell you whether it can be proven equivalent.

01 / THE RULE-PACKAn AST, not a guess

02 / DETERMINISM VS THE “AI OPTIMIZER”Why a proof beats a prompt

03 / WHAT THE RULES ACTUALLY DOCheaper shapes, identical answers

04 / FAIL OPEN, ALWAYSThe query is sacred

Key takeaways

Snowflake Query Optimization: Cut Compute Without Rewriting Dashboards

Snowflake Query Result Cache: How It Stays Correct

How chukei speaks Snowflake's wire protocol without breaking a single client