Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.
We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.
One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.
Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.
We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.
> We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%.
I don't understand what you're comparing. Can't you increase bayesect accuracy arbitrarily by running it longer? When are you choosing to terminate? Perhaps I don't understand this after all.
Yes, bayesect accuracy increases with more iterations. The comparison was at a fixed budget(300 test runs) when I was running. Sorry should have clarified more on that.
You're right, at 300 tests bayesect converges to ~97-100% across the board. I reran with calibration.py and confirmed.
Went a step further and tested graph-weighted priors (per-commit weight proportional to transitive dependents, Pareto-distributed). The prior helps in the budget-constrained regime:
128 commits, 500 trials:
Budget=50, 70/30: uniform 22% → graph 33%
Budget=50, 80/20: uniform 71% → graph 77%
Budget=100, 70/30: uniform 56% → graph 65%
At 300 tests the gap disappears since there's enough data to converge anyway. The prior is worth a few bits, which matters when bits are scarce.
We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.
One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.
Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.
We're building this into sem (https://github.com/ataraxy-labs/sem), which has an entity dependency graph that provides the structural signal.