Example-based tests prove your code works on the cases you thought of. Compliance bugs live in the cases you didn't. Property-based testing is the tool that attacks that gap, and it is especially powerful when you are building a rules engine you cannot fully benchmark.
The problem: you can't see the corpus
In regulated work, the reference test set is often private — a client's benchmark, or real data you cannot touch under GDPR or an NDA. You are asked to be confident your engine is correct without ever seeing the data it will be judged against. Hand-written examples cannot get you there; you will only test what you imagined.
Derive properties from the regulation
The move is to stop testing specific outputs and start testing invariants — statements that must hold for any input, derived from the regulation itself:
- Monotonicity — making a breach worse must never lower its severity.
- Conservation — measured durations sum to the covered window minus annotated gaps. No time is invented, none is lost.
- Idempotence — merging or de-duplicating the same input twice equals doing it once.
- Locality — an exception applied to one segment must not change the result elsewhere.
These are not guesses about behaviour; they are properties the regulation logically requires. If your engine violates one, it is wrong — even if every hand-written example passed.
How it works
A property-based framework — rapid or gopter in Go, Hypothesis in Python — generates large numbers of structured random inputs and checks each property against them. When it finds a violation, it shrinks the failure to the smallest input that still breaks, handing you a minimal reproduction instead of a haystack.
Three layers that work together
- Fixtures — the canonical violation, the boundary cases one unit over and under each threshold, and the explicit non-violations.
- Property tests — the invariants above, run on adversarially generated input.
- Mutation / differential checks — vary one axis at a time; a monotonic detector should fire monotonically.
Why this earns confidence without the benchmark
If the engine satisfies invariants derived from the regulation on adversarially generated input, the surface where it can still be wrong against a private corpus is small — and when a diagnostic does come back, the failing property localises it fast. You are not hoping you guessed the test cases right. You are proving the rules hold everywhere.
I build rules engines with this testing discipline so they hold up against benchmarks I never get to see. If you have compliance logic that has to be right, get in touch.