What is biston?
biston is a structural clone detector and refactor suggester for Python. It parses your code with tree-sitter, normalizes each function into a canonical AST, and finds groups of functions that are structurally similar — even when local names, literals, and argument order differ. For each match it can also propose an anti-unified template with typed “holes” that you could extract into a shared helper.
Written in Rust and distributed as a Python package, biston runs fast enough to drop into CI pipelines.
Who it’s for
- Python teams tracking copy-paste drift across modules as a codebase grows.
- CI pipelines that want SARIF output wired into code-quality dashboards.
- AI coding agents (and the humans reviewing their PRs) where boilerplate tends to accumulate function by function.
Next
- How It Works — the pipeline, from discovery to anti-unified templates.
Machine-readable docs
Every page on this site is also served as raw Markdown, following the llms.txt convention:
llms.txt— compact index with links to every page as.md.llms-full.txt— all pages concatenated into a single document.
Drop either one into an LLM context window to give the model the full picture without scraping HTML.
How It Works
biston is a pipeline of small passes. Each pass has a single job: discover files, parse them, extract functions, normalize, hash, bucket by locality-sensitive-hashing, compare within buckets, optionally anti-unify matched pairs, and render the report. Nothing talks across pass boundaries except through plain data types, which makes the whole thing easy to test and cheap to run in parallel.
Pipeline overview
graph LR
A[discovery] --> B[parse]
B --> C[extract]
C --> D[normalize]
D --> E[hash + LSH]
E --> F[similarity]
F --> G[anti-unify]
G --> H[report]
Each stage lives in its own module:
| Stage | Module | What it does |
|---|---|---|
| discovery | src/discovery.rs | Walks the tree with the ignore crate; respects .gitignore, include/exclude globs. Test directories and migrations are excluded by default. |
| parse | src/parse.rs | Feeds each file into tree-sitter-python, yields a concrete syntax tree. |
| extract | src/extract.rs | Slices out every function_definition as a FunctionFragment. |
| normalize | src/normalize.rs | Converts each fragment into a NormalizedNode tree — a canonical form. |
| hash + LSH | src/hash.rs | xxhash3 over the normalized tree plus a banded LSH fingerprint. |
| similarity | src/similarity.rs | Pairs candidates that share LSH bands, scores them against a threshold. |
| anti-unify | src/antiunify.rs | Merges matched pairs into a template with typed holes (Phase 2, opt-in via --suggest). |
| report | src/report.rs | Emits CloneReport as text / JSON / SARIF. |
Supporting modules:
| Module | Role |
|---|---|
src/config.rs | TOML config loader (biston.toml or [tool.biston] in pyproject.toml). |
src/suppress.rs | Config-level file globs plus inline # biston: ignore comments. |
src/stats.rs | Aggregate counts used by the stats subcommand. |
src/lib.rs | Public scan() API; src/main.rs wraps it with a clap CLI. |
Normalization
Two functions can be “the same shape” and still differ in all the surface details — local variable names, literal values, the order of operands to a commutative operator. Normalization strips those details so the hash of a canonical tree is invariant under them.
What the pass does by default:
- Replaces local names with canonical placeholders (
v0,v1, …). - Drops decorators and type annotations.
- Optionally anonymizes literals and sorts commutative operators (toggled in config).
- Records the kind of each node as a
&'static strso comparisons stay cheap.
Before (two clearly “the same” functions that differ only in naming and literal values):
def total_price(items):
total = 0
for item in items:
total = total + item.price * 1.2
return total
def sum_scores(entries):
acc = 0
for entry in entries:
acc = acc + entry.value * 1.5
return acc
After normalization (schematic — both functions now map to the same shape):
function_definition
parameters(v0)
body
assign(v1, literal)
for(v2 in v0)
assign(v1, binary(add, v1, binary(mul, attr(v2, v3), literal)))
return(v1)
With anonymize_literals = true and sort_commutative = true the two fragments hash to the same value. Without them they still land in the same LSH bucket because most of their structure coincides.
Similarity via LSH bands
Comparing every function pairwise is O(n²) and unaffordable on a real repo. biston folds the problem into a locality-sensitive hash:
- The normalized tree is serialised into a stream of node-kind tokens.
xxhash3produces a 64-bit fingerprint over that stream, plus a handful of shorter per-band hashes over slices of the same sequence.- Fragments whose fingerprints agree on any one band land in the same bucket.
- Pairs are scored only within buckets.
A higher band count means more candidate pairs (recall up, precision down); a longer band means fewer hits (recall down, precision up). The defaults are tuned so that a 1 000-file repo produces hundreds of candidate pairs, not millions.
Only pairs whose similarity meets the threshold (default 0.7) end up in the report.
Anti-unification
With --suggest (or [suggest] enabled = true in config) biston takes each matched pair and anti-unifies them: it walks both normalized trees in lockstep and replaces every position where they disagree with a typed hole.
Holes are classified by what varied:
literal— a constant differs (e.g.1.2vs1.5).identifier— a name differs that survived normalization (e.g. a global or attribute).subtree— a whole subexpression differs.
Each template gets a quality score based on how much shared structure survived vs. how many holes were introduced. Templates with too many holes, or whose coverage falls below min_quality, are dropped — a template that is mostly holes is no better than the original clone.
A worked example. Given these two matched fragments:
def clamp_int(x, lo, hi):
if x < lo:
return lo
if x > hi:
return hi
return x
def clamp_float(value, floor, ceiling):
if value < floor:
return floor
if value > ceiling:
return ceiling
return value
The renderer produces a template such as:
def <hole:name>(<hole:id:a>, <hole:id:b>, <hole:id:c>):
if <hole:id:a> < <hole:id:b>:
return <hole:id:b>
if <hole:id:a> > <hole:id:c>:
return <hole:id:c>
return <hole:id:a>
That’s a ready-made extraction target: three identifier holes, no literal or subtree holes, high coverage score.
Output
The report format is selected with --format or the [output] config section:
text— the default, grouped by clone family, with source context.json— structured dump ofCloneReport; easy to post-process.sarif— SARIF 2.1.0, for uploading to GitHub code-scanning, GitLab, or other CI dashboards.
The stats subcommand shares the pipeline but emits aggregate counts instead of individual findings.
Configuration & suppression
Config lives in biston.toml or under [tool.biston] in pyproject.toml. CLI flags override config values. File-level and function-level suppression is available via config globs or inline # biston: ignore / # biston: ignore-file comments. The full key-by-key reference lives in the project README.
Scanning tests
Test suites accumulate their own kind of duplication — near-identical cases that could collapse into @pytest.mark.parametrize, copy-pasted arrange/act/assert blocks, repeated fixture plumbing — but that noise usually drowns out production-code findings when mixed into the same report. biston splits the two:
- By default, the
scan.excludeglobs (tests/**,**/conftest.py,migrations/**) drop test files at the discovery stage, sobiston scanandbiston statsonly see your application code. --tests-only(on bothscanandstats) inverts the scope:includeis replaced with common Python test patterns (**/test_*.py,**/*_test.py,**/conftest.py,tests/**/*.py,**/tests/**/*.py— the last covering monorepo layouts likebackend/tests/helpers.py), andexcludeis cleared. Other knobs (min_lines,threshold, normalization) are untouched; tune them inbiston.tomlif your tests want a different baseline than your production code.
Run the two passes separately (e.g. two CI steps, or two cached runs against the same repo) to keep the signal clean.
Focus scanning
For commit hooks and CI steps that only care about the diff, scan and stats accept --files <PATH> (repeatable) and --files-from <PATH|-> (list from file or stdin). Discovery and analysis still run over the whole tree — so a newly-introduced clone of an untouched helper is still found — but only pairs where at least one side lives in the focus set make it to the report. See Commit-hook integration for the git diff recipe.
The llms.txt surface
Every page on this site is also served as raw Markdown at its source path (for this page, how-it-works.md). Two roll-up files round it out:
llms.txt— index following the llms.txt convention.llms-full.txt— all pages concatenated.
That way an LLM can ingest the full docs without scraping HTML, and the links stay stable across deploys.
Commit-hook integration
biston is designed to scan a whole repository, but when you wire it into a pre-commit hook you usually don’t want every unrelated pair in the codebase to fail a commit — only clones that involve the files the committer touched. The --focus-args / --files / --files-from flags narrow the report to those files while still scanning the whole tree, so cross-file clones between a changed file and the rest of the repo are still detected.
With pre-commit / prek
If you use the pre-commit framework (or prek), drop this into .pre-commit-config.yaml:
- repo: https://github.com/mojzis/biston
rev: v0.5.0
hooks:
- id: biston
That wires up biston scan --focus-args, which receives staged Python files as positional arguments and narrows the report to clones involving any of them. An empty staged set (no Python files touched) passes silently. A companion biston-stats hook is available for CI gating on pair counts.
Heads up: if you write your own
localhook definition for biston instead of using the repo above, you must setrequire_serial: true. Without it pre-commit may batch staged files into parallel invocations, and cross-file clones spanning batches will be silently missed — defeating the point of running biston as a hook.
The shell recipe
For raw .git/hooks/pre-commit scripts, or for CI integration outside of pre-commit:
git diff --name-only --diff-filter=ACM -- '*.py' \
| biston scan --files-from - .
What each piece does:
git diff --name-only --diff-filter=ACM -- '*.py'— list Python files that are Added, Copied, or Modified in the current index (swap inHEAD~1..HEADor--cacheddepending on hook timing).biston scan --files-from -— read that list from stdin (one path per line). Paths are resolved relative to the current working directory.- The positional
.— root of the scan. biston still discovers and parses everything under it; the focus list only restricts which pairs make it into the report.
An empty list (no Python files changed) correctly emits no pairs — the hook passes silently. That’s why --files-from - is the right shape for hooks: --files $(git diff --name-only) silently expands to nothing when the diff is empty, which reverts to a full-repo scan and can trip the hook on pre-existing clones unrelated to the commit.
Semantics
Given a repo with clones A ↔ B (inside the committer’s change) and C ↔ D (elsewhere):
| Invocation | Pairs emitted |
|---|---|
biston scan . | A↔B, C↔D |
biston scan --files A.py . | A↔B (and any A↔X with X anywhere in the repo) |
biston scan --focus-args A.py | A↔B (same as --files A.py .) |
biston scan --files-from - . with empty stdin | (none) |
biston scan --focus-args (no positionals) | (none) |
The three focus modes — --files, --files-from, and --focus-args — are mutually exclusive; pass only one per invocation.
A focus path that can’t be resolved (e.g. a file deleted in the same changeset) is warned about and skipped, not treated as a fatal error — the scan continues with whatever focus paths did resolve.
Tips
- Use
--diff-filter=ACMto avoid passing deleted files. biston tolerates them, but it’s clearer at the hook level. - Combine with
--format sarifif your CI wants to upload findings as annotations — the SARIF output is filtered the same way. statssupports the same flags, so you can gate a hook on a numeric threshold:biston stats --files-from - --format json . | jq '.clone_pairs'.- For a dry run, drop
--files-fromto see the full-repo report and confirm the focused scan isn’t hiding surprises.
Repeated --files
For one-off use outside a hook, --files is repeatable:
biston scan --files src/auth.py --files src/session.py .
Each --files takes a single path; repeat the flag to add more. This form conflicts with --files-from — pick one per invocation.