What is biston?

biston is a structural clone detector and refactor suggester for Python. It parses your code with tree-sitter, normalizes each function into a canonical AST, and finds groups of functions that are structurally similar — even when local names, literals, and argument order differ. For each match it can also propose an anti-unified template with typed “holes” that you could extract into a shared helper.

Written in Rust and distributed as a Python package, biston runs fast enough to drop into CI pipelines.

Who it’s for

Python teams tracking copy-paste drift across modules as a codebase grows.
CI pipelines that want SARIF output wired into code-quality dashboards.
AI coding agents (and the humans reviewing their PRs) where boilerplate tends to accumulate function by function.

How It Works — the pipeline, from discovery to anti-unified templates.

Machine-readable docs

Every page on this site is also served as raw Markdown, following the llms.txt convention:

llms.txt — compact index with links to every page as .md.
llms-full.txt — all pages concatenated into a single document.

Drop either one into an LLM context window to give the model the full picture without scraping HTML.

How It Works

biston is a pipeline of small passes. Each pass has a single job: discover files, parse them, extract functions, normalize, hash, bucket by locality-sensitive-hashing, compare within buckets, optionally anti-unify matched pairs, and render the report. Nothing talks across pass boundaries except through plain data types, which makes the whole thing easy to test and cheap to run in parallel.

Pipeline overview

graph LR
    A[discovery] --> B[parse]
    B --> C[extract]
    C --> D[normalize]
    D --> E[hash + LSH]
    E --> F[similarity]
    F --> G[anti-unify]
    G --> H[report]

Each stage lives in its own module:

Stage	Module	What it does
discovery	`src/discovery.rs`	Walks the tree with the `ignore` crate; respects `.gitignore`, include/exclude globs. Test directories and migrations are excluded by default.
parse	`src/parse.rs`	Feeds each file into tree-sitter-python, yields a concrete syntax tree.
extract	`src/extract.rs`	Slices out every `function_definition` as a `FunctionFragment`.
normalize	`src/normalize.rs`	Converts each fragment into a `NormalizedNode` tree — a canonical form.
hash + LSH	`src/hash.rs`	xxhash3 over the normalized tree plus a banded LSH fingerprint.
similarity	`src/similarity.rs`	Pairs candidates that share LSH bands, scores them against a threshold.
anti-unify	`src/antiunify.rs`	Merges matched pairs into a template with typed holes (Phase 2, opt-in via `--suggest`).
report	`src/report.rs`	Emits `CloneReport` as text / JSON / SARIF.

Supporting modules:

Module	Role
`src/config.rs`	TOML config loader (`biston.toml` or `[tool.biston]` in `pyproject.toml`).
`src/suppress.rs`	Config-level file globs plus inline `# biston: ignore` comments.
`src/stats.rs`	Aggregate counts used by the `stats` subcommand.
`src/lib.rs`	Public `scan()` API; `src/main.rs` wraps it with a `clap` CLI.

Normalization

Two functions can be “the same shape” and still differ in all the surface details — local variable names, literal values, the order of operands to a commutative operator. Normalization strips those details so the hash of a canonical tree is invariant under them.

What the pass does by default:

Replaces local names with canonical placeholders (v0, v1, …).
Drops decorators and type annotations.
Optionally anonymizes literals and sorts commutative operators (toggled in config).
Records the kind of each node as a &'static str so comparisons stay cheap.

Before (two clearly “the same” functions that differ only in naming and literal values):

def total_price(items):
    total = 0
    for item in items:
        total = total + item.price * 1.2
    return total

def sum_scores(entries):
    acc = 0
    for entry in entries:
        acc = acc + entry.value * 1.5
    return acc

After normalization (schematic — both functions now map to the same shape):

function_definition
  parameters(v0)
  body
    assign(v1, literal)
    for(v2 in v0)
      assign(v1, binary(add, v1, binary(mul, attr(v2, v3), literal)))
    return(v1)

With anonymize_literals = true and sort_commutative = true the two fragments hash to the same value. Without them they still land in the same LSH bucket because most of their structure coincides.

Similarity via LSH bands

Comparing every function pairwise is O(n²) and unaffordable on a real repo. biston folds the problem into a locality-sensitive hash:

The normalized tree is serialised into a stream of node-kind tokens.
xxhash3 produces a 64-bit fingerprint over that stream, plus a handful of shorter per-band hashes over slices of the same sequence.
Fragments whose fingerprints agree on any one band land in the same bucket.
Pairs are scored only within buckets.

A higher band count means more candidate pairs (recall up, precision down); a longer band means fewer hits (recall down, precision up). The defaults are tuned so that a 1 000-file repo produces hundreds of candidate pairs, not millions.

Only pairs whose similarity meets the threshold (default 0.7) end up in the report.

Anti-unification

With --suggest (or [suggest] enabled = true in config) biston takes each matched pair and anti-unifies them: it walks both normalized trees in lockstep and replaces every position where they disagree with a typed hole.

Holes are classified by what varied:

literal — a constant differs (e.g. 1.2 vs 1.5).
identifier — a name differs that survived normalization (e.g. a global or attribute).
subtree — a whole subexpression differs.

Each template gets a quality score based on how much shared structure survived vs. how many holes were introduced. Templates with too many holes, or whose coverage falls below min_quality, are dropped — a template that is mostly holes is no better than the original clone.

A worked example. Given these two matched fragments:

def clamp_int(x, lo, hi):
    if x < lo:
        return lo
    if x > hi:
        return hi
    return x

def clamp_float(value, floor, ceiling):
    if value < floor:
        return floor
    if value > ceiling:
        return ceiling
    return value

The renderer produces a template such as:

def <hole:name>(<hole:id:a>, <hole:id:b>, <hole:id:c>):
    if <hole:id:a> < <hole:id:b>:
        return <hole:id:b>
    if <hole:id:a> > <hole:id:c>:
        return <hole:id:c>
    return <hole:id:a>

That’s a ready-made extraction target: three identifier holes, no literal or subtree holes, high coverage score.

Output

The report format is selected with --format or the [output] config section:

text — the default, grouped by clone family, with source context.
json — structured dump of CloneReport; easy to post-process.
sarif — SARIF 2.1.0, for uploading to GitHub code-scanning, GitLab, or other CI dashboards.

The stats subcommand shares the pipeline but emits aggregate counts instead of individual findings.

Configuration & suppression

Config lives in biston.toml or under [tool.biston] in pyproject.toml. CLI flags override config values. File-level and function-level suppression is available via config globs or inline # biston: ignore / # biston: ignore-file comments. The full key-by-key reference lives in the project README.

Scanning tests

Test suites accumulate their own kind of duplication — near-identical cases that could collapse into @pytest.mark.parametrize, copy-pasted arrange/act/assert blocks, repeated fixture plumbing — but that noise usually drowns out production-code findings when mixed into the same report. biston splits the two:

By default, the scan.exclude globs (tests/**, **/conftest.py, migrations/**) drop test files at the discovery stage, so biston scan and biston stats only see your application code.
--tests-only (on both scan and stats) inverts the scope: include is replaced with common Python test patterns (**/test_*.py, **/*_test.py, **/conftest.py, tests/**/*.py, **/tests/**/*.py — the last covering monorepo layouts like backend/tests/helpers.py), and exclude is cleared. Other knobs (min_lines, threshold, normalization) are untouched; tune them in biston.toml if your tests want a different baseline than your production code.

Run the two passes separately (e.g. two CI steps, or two cached runs against the same repo) to keep the signal clean.

Focus scanning

For commit hooks and CI steps that only care about the diff, scan and stats accept --files <PATH> (repeatable) and --files-from <PATH|-> (list from file or stdin). Discovery and analysis still run over the whole tree — so a newly-introduced clone of an untouched helper is still found — but only pairs where at least one side lives in the focus set make it to the report. See Commit-hook integration for the git diff recipe.

The llms.txt surface

Every page on this site is also served as raw Markdown at its source path (for this page, how-it-works.md). Two roll-up files round it out:

llms.txt — index following the llms.txt convention.
llms-full.txt — all pages concatenated.

That way an LLM can ingest the full docs without scraping HTML, and the links stay stable across deploys.

Commit-hook integration

biston is designed to scan a whole repository, but when you wire it into a pre-commit hook you usually don’t want every unrelated pair in the codebase to fail a commit — only clones that involve the files the committer touched. The --focus-args / --files / --files-from flags narrow the report to those files while still scanning the whole tree, so cross-file clones between a changed file and the rest of the repo are still detected.

With pre-commit / prek

If you use the pre-commit framework (or prek), drop this into .pre-commit-config.yaml:

  - repo: https://github.com/mojzis/biston
    rev: v0.5.0
    hooks:
      - id: biston

That wires up biston scan --focus-args, which receives staged Python files as positional arguments and narrows the report to clones involving any of them. An empty staged set (no Python files touched) passes silently. A companion biston-stats hook is available for CI gating on pair counts.

Heads up: if you write your own local hook definition for biston instead of using the repo above, you must set require_serial: true. Without it pre-commit may batch staged files into parallel invocations, and cross-file clones spanning batches will be silently missed — defeating the point of running biston as a hook.

The shell recipe

For raw .git/hooks/pre-commit scripts, or for CI integration outside of pre-commit:

git diff --name-only --diff-filter=ACM -- '*.py' \
  | biston scan --files-from - .

What each piece does:

git diff --name-only --diff-filter=ACM -- '*.py' — list Python files that are Added, Copied, or Modified in the current index (swap in HEAD~1..HEAD or --cached depending on hook timing).
biston scan --files-from - — read that list from stdin (one path per line). Paths are resolved relative to the current working directory.
The positional . — root of the scan. biston still discovers and parses everything under it; the focus list only restricts which pairs make it into the report.

An empty list (no Python files changed) correctly emits no pairs — the hook passes silently. That’s why --files-from - is the right shape for hooks: --files $(git diff --name-only) silently expands to nothing when the diff is empty, which reverts to a full-repo scan and can trip the hook on pre-existing clones unrelated to the commit.

Semantics

Given a repo with clones A ↔ B (inside the committer’s change) and C ↔ D (elsewhere):

Invocation	Pairs emitted
`biston scan .`	`A↔B`, `C↔D`
`biston scan --files A.py .`	`A↔B` (and any `A↔X` with X anywhere in the repo)
`biston scan --focus-args A.py`	`A↔B` (same as `--files A.py .`)
`biston scan --files-from - .` with empty stdin	(none)
`biston scan --focus-args` (no positionals)	(none)

The three focus modes — --files, --files-from, and --focus-args — are mutually exclusive; pass only one per invocation.

A focus path that can’t be resolved (e.g. a file deleted in the same changeset) is warned about and skipped, not treated as a fatal error — the scan continues with whatever focus paths did resolve.

Tips

Use --diff-filter=ACM to avoid passing deleted files. biston tolerates them, but it’s clearer at the hook level.
Combine with --format sarif if your CI wants to upload findings as annotations — the SARIF output is filtered the same way.
stats supports the same flags, so you can gate a hook on a numeric threshold: biston stats --files-from - --format json . | jq '.clone_pairs'.
For a dry run, drop --files-from to see the full-repo report and confirm the focused scan isn’t hiding surprises.

Repeated `--files`

For one-off use outside a hook, --files is repeatable:

biston scan --files src/auth.py --files src/session.py .

Each --files takes a single path; repeat the flag to add more. This form conflicts with --files-from — pick one per invocation.

Keyboard shortcuts

biston