Deterministic SQL rewrites that are always safe to apply
chukei's SQL-rewrite lever proves each rule equivalent before it touches a query — pruning SELECT *, pushing predicates, eliminating redundant work — and falls open the instant it can't.
A SQL rewrite that is probably equivalent is worth nothing. If a tool changes the meaning of a query even once — a row dropped, a column reordered, a NULL coerced — you can no longer trust any answer it returns, and you will spend more time auditing it than you ever saved. So chukei’s rewrite lever starts from the opposite premise: a rule is applied only when its output is provably equivalent to the original on this query, and otherwise the query passes through to Snowflake untouched.
chukei is an open source, Apache-2.0 cost optimization engine for Snowflake — a transparent wire-protocol proxy that runs in your own VPC. SQL rewriting is one of its six P0 levers. Of the six, it is the one most people assume needs machine learning. It does not. It needs an abstract syntax tree, a pack of equivalence-tested rules, and the discipline to do nothing when it isn’t sure.
01 / THE RULE-PACKAn AST, not a guess
When a query arrives on the wire, chukei parses it into an abstract syntax tree. The rewrite lever walks that tree and matches it against a fixed pack of deterministic rules. Each rule is a pure function from one AST to another, paired with a proof — encoded as a property test over generated inputs — that the two trees produce identical result sets under Snowflake semantics. If a rule’s preconditions hold, it fires; the transformed tree is re-serialised to SQL and forwarded. If they don’t, nothing happens.
There is no model in this loop, no inference call, no probability threshold. The same query produces the same rewrite every time, on every node, forever. That is the whole point.
02 / DETERMINISM VS THE “AI OPTIMIZER”Why a proof beats a prompt
The crowded end of this market sells LLM-based optimizers: feed the query to a model, let it suggest a faster version. That can surface genuinely clever rewrites a fixed rule-pack will never find. But it also inherits the model’s failure mode — it is plausible, not proven. A model that is right 99% of the time is catastrophic on the hot path, because the 1% silently returns wrong numbers to a dashboard nobody is re-checking, and you have no way to know which 1%.
A rewrite engine you cannot fully trust is one you have to fully audit — which costs more than the compute it saves.
— the design constraint
chukei makes the opposite trade deliberately. The rule-pack is smaller than what a model could propose, but every rule in it is closed-form and equivalence-tested, so the rewrite is correct by construction. Determinism here is not a limitation we apologise for — it is the feature. You get rewrites you never have to second-guess, and you keep the LLMs off the wire where they belong.
No LLM on the hot path — by invariant, not by preference. chukei’s rewrite lever runs in the deterministic query path with a ~2 ms p99 overhead inside a +5 ms budget. An inference call cannot meet that budget and cannot offer a proof, so it is never in the loop. LLM-based tools sit beside your warehouse and advise; chukei sits in front of it and acts only when it can prove it is safe.
03 / WHAT THE RULES ACTUALLY DOCheaper shapes, identical answers
The rules target query shapes that make Snowflake scan or shuffle more than the result requires. None of them change what the query means; they change what the warehouse has to do to answer it.
| Rule | What it does | Why it’s cheaper | Skips when |
|---|---|---|---|
prune-select-star | Replace SELECT * with the columns actually consumed | Less micro-partition scan, narrower shuffle | Output schema is observed downstream |
push-predicate | Push WHERE filters below joins / into subqueries | Prunes partitions before the join | Predicate references a post-join expression |
drop-redundant-distinct | Remove DISTINCT already guaranteed by a key | Skips a needless sort/aggregate | Uniqueness can’t be proven from constraints |
collapse-nested-subquery | Flatten provably equivalent nested selects | Fewer materialisation steps | Any layer is non-deterministic |
eliminate-no-op-cast | Drop casts to a column’s existing type | Avoids per-row work | Cast changes precision or semantics |
Take the most common one. A BI tool emits SELECT * against a wide fact table,
then the report uses three columns. Snowflake is columnar, so the wasted columns
are wasted partition reads on every refresh:
-- before: the dashboard only renders order_id, amount, ts
SELECT *
FROM analytics.fct_orders
WHERE order_date = CURRENT_DATE;
-- after: provably equivalent for this consumer, far less scanned
SELECT order_id, amount, ts
FROM analytics.fct_orders
WHERE order_date = CURRENT_DATE;
The rewrite is only applied when chukei can see, from the query and its consumed
projection, that the dropped columns are genuinely unused. If the result schema
is observed in a way that would change — for instance the client depends on the
full column set — the rule does not fire and the original SELECT * is forwarded
byte-for-byte.
04 / FAIL OPEN, ALWAYSThe query is sacred
Every lever in chukei shares one invariant, and rewriting is no exception: when in doubt, do nothing. A parse error, an unrecognised dialect feature, a function chukei can’t prove deterministic, a precondition that doesn’t hold — any of these ends the rewrite attempt and the original SQL goes to Snowflake verbatim. The proxy never persists or logs credentials, and it never rewrites a query it cannot prove it is allowed to rewrite.
chukei rewrite · per-query decision
parse ok? ───── no ──▶ passthrough (verbatim)
│ yes
rule matches? ─ no ──▶ passthrough (verbatim)
│ yes
equivalence preconditions hold? ─ no ──▶ passthrough (verbatim)
│ yes
apply rewrite ──▶ forward rewritten SQL
This is what makes the lever safe to leave on. The worst case is that chukei declines to optimise a query and you pay exactly what you would have paid without it. There is no failure mode in which a rewrite corrupts a result, because a rewrite that cannot be proven equivalent is, definitionally, never applied.
Because rewriting acts on query shape rather than repetition, it stacks with the verified cache and the auto-suspend levers rather than overlapping them. Across the rule-pack, rewriting contributes its share of the 15–30% savings band we ask every team to validate against their own workload — a target to confirm with the replay simulator before anything touches the path, never a promise.
Key takeaways
- chukei rewrites Snowflake queries by walking an AST against a deterministic, equivalence-tested rule-pack — same input, same rewrite, every time.
- A rule is applied only when it is provably equivalent on that query; otherwise the original SQL is forwarded verbatim.
- No LLM on the hot path — determinism is the differentiator vs plausible-but-unproven AI optimizers, and it fits the ~2 ms p99 / +5 ms budget.
- The lever is fail-open and false-positive-intolerant, so it is safe to leave on; the worst case is no rewrite, never a wrong answer.
- Rewriting stacks with caching and suspend toward the 15–30% target you validate with replay first — never a guarantee.
The full rule-pack, its property tests, and the equivalence proofs are in the repository — every rule ships with the test that keeps it honest. If you have a query shape you think we should rewrite safely, open an issue with the (redacted) SQL and we will tell you whether it can be proven equivalent.
Builds the Rust wire-protocol core of chukei. Spends his time making sure the proxy adds milliseconds, never breakage.