# biston — Full Documentation

# What is biston?

biston is a structural clone detector and refactor suggester for Python. It parses your code with [tree-sitter](https://tree-sitter.github.io/tree-sitter/), normalizes each function into a canonical AST, and finds groups of functions that are structurally similar — even when local names, literals, and argument order differ. For each match it can also propose an anti-unified template with typed "holes" that you could extract into a shared helper.

Written in Rust and distributed as a Python package, biston runs fast enough to drop into CI pipelines.

## Who it's for

- **Python teams** tracking copy-paste drift across modules as a codebase grows.
- **CI pipelines** that want SARIF output wired into code-quality dashboards.
- **AI coding agents** (and the humans reviewing their PRs) where boilerplate tends to accumulate function by function.

## Next

- [How It Works](how-it-works.md) — the pipeline, from discovery to anti-unified templates.

## Machine-readable docs

Every page on this site is also served as raw Markdown, following the [llms.txt](https://llmstxt.org) convention:

- [`llms.txt`](llms.txt) — compact index with links to every page as `.md`.
- [`llms-full.txt`](llms-full.txt) — all pages concatenated into a single document.

Drop either one into an LLM context window to give the model the full picture without scraping HTML.

---

<!-- Mermaid pitfalls (so you don't have to debug them again):
     - Timeline: colons in labels (e.g. "00:00") break parsing because ":"
       is the delimiter.  Use "0 sec", "2 sec" etc. instead.
     - Gantt + dateFormat X: "after taskId, 200" treats 200 as an absolute
       timestamp, not a duration.  Use explicit start/end: "taskId, 800, 1000".
     - Keep labels simple — no unescaped parentheses or pipes inside node
       text, they break layout in some renderers. -->

# How It Works

biston is a pipeline of small passes. Each pass has a single job: discover files, parse them, extract functions, normalize, hash, bucket by locality-sensitive-hashing, compare within buckets, optionally anti-unify matched pairs, and render the report. Nothing talks across pass boundaries except through plain data types, which makes the whole thing easy to test and cheap to run in parallel.

## Pipeline overview

```mermaid
graph LR
    A[discovery] --> B[parse]
    B --> C[extract]
    C --> D[normalize]
    D --> E[hash + LSH]
    E --> F[similarity]
    F --> G[anti-unify]
    G --> H[report]
```

Each stage lives in its own module:

| Stage | Module | What it does |
|-------|--------|--------------|
| discovery | `src/discovery.rs` | Walks the tree with the `ignore` crate; respects `.gitignore`, include/exclude globs. Test directories and migrations are excluded by default. |
| parse | `src/parse.rs` | Feeds each file into tree-sitter-python, yields a concrete syntax tree. |
| extract | `src/extract.rs` | Slices out every `function_definition` as a `FunctionFragment`. |
| normalize | `src/normalize.rs` | Converts each fragment into a `NormalizedNode` tree — a canonical form. |
| hash + LSH | `src/hash.rs` | xxhash3 over the normalized tree plus a banded LSH fingerprint. |
| similarity | `src/similarity.rs` | Pairs candidates that share LSH bands, scores them against a threshold. |
| anti-unify | `src/antiunify.rs` | Merges matched pairs into a template with typed holes (Phase 2, opt-in via `--suggest`). |
| report | `src/report.rs` | Emits `CloneReport` as text / JSON / SARIF. |

Supporting modules:

| Module | Role |
|--------|------|
| `src/config.rs` | TOML config loader (`biston.toml` or `[tool.biston]` in `pyproject.toml`). |
| `src/suppress.rs` | Config-level file globs plus inline `# biston: ignore` comments. |
| `src/stats.rs` | Aggregate counts used by the `stats` subcommand. |
| `src/lib.rs` | Public `scan()` API; `src/main.rs` wraps it with a `clap` CLI. |

## Normalization

Two functions can be "the same shape" and still differ in all the surface details — local variable names, literal values, the order of operands to a commutative operator. Normalization strips those details so the hash of a canonical tree is invariant under them.

What the pass does by default:

- Replaces local names with canonical placeholders (`v0`, `v1`, …).
- Drops decorators and type annotations.
- Optionally anonymizes literals and sorts commutative operators (toggled in config).
- Records the kind of each node as a `&'static str` so comparisons stay cheap.

Before (two clearly "the same" functions that differ only in naming and literal values):

```python
def total_price(items):
    total = 0
    for item in items:
        total = total + item.price * 1.2
    return total

def sum_scores(entries):
    acc = 0
    for entry in entries:
        acc = acc + entry.value * 1.5
    return acc
```

After normalization (schematic — both functions now map to the same shape):

```text
function_definition
  parameters(v0)
  body
    assign(v1, literal)
    for(v2 in v0)
      assign(v1, binary(add, v1, binary(mul, attr(v2, v3), literal)))
    return(v1)
```

With `anonymize_literals = true` and `sort_commutative = true` the two fragments hash to the same value. Without them they still land in the same LSH bucket because most of their structure coincides.

## Similarity via LSH bands

Comparing every function pairwise is O(n²) and unaffordable on a real repo. biston folds the problem into a locality-sensitive hash:

1. The normalized tree is serialised into a stream of node-kind tokens.
2. `xxhash3` produces a 64-bit fingerprint over that stream, plus a handful of shorter per-band hashes over slices of the same sequence.
3. Fragments whose fingerprints agree on *any one* band land in the same bucket.
4. Pairs are scored only within buckets.

A higher band count means more candidate pairs (recall up, precision down); a longer band means fewer hits (recall down, precision up). The defaults are tuned so that a 1 000-file repo produces hundreds of candidate pairs, not millions.

Only pairs whose similarity meets the `threshold` (default `0.7`) end up in the report.

## Anti-unification

With `--suggest` (or `[suggest] enabled = true` in config) biston takes each matched pair and **anti-unifies** them: it walks both normalized trees in lockstep and replaces every position where they disagree with a typed *hole*.

Holes are classified by what varied:

- `literal` — a constant differs (e.g. `1.2` vs `1.5`).
- `identifier` — a name differs that survived normalization (e.g. a global or attribute).
- `subtree` — a whole subexpression differs.

Each template gets a quality score based on how much shared structure survived vs. how many holes were introduced. Templates with too many holes, or whose coverage falls below `min_quality`, are dropped — a template that is mostly holes is no better than the original clone.

A worked example. Given these two matched fragments:

```python
def clamp_int(x, lo, hi):
    if x < lo:
        return lo
    if x > hi:
        return hi
    return x

def clamp_float(value, floor, ceiling):
    if value < floor:
        return floor
    if value > ceiling:
        return ceiling
    return value
```

The renderer produces a template such as:

```python
def <hole:name>(<hole:id:a>, <hole:id:b>, <hole:id:c>):
    if <hole:id:a> < <hole:id:b>:
        return <hole:id:b>
    if <hole:id:a> > <hole:id:c>:
        return <hole:id:c>
    return <hole:id:a>
```

That's a ready-made extraction target: three identifier holes, no literal or subtree holes, high coverage score.

## Output

The report format is selected with `--format` or the `[output]` config section:

- `text` — the default, grouped by clone family, with source context.
- `json` — structured dump of `CloneReport`; easy to post-process.
- `sarif` — [SARIF 2.1.0](https://sarifweb.azurewebsites.net/), for uploading to GitHub code-scanning, GitLab, or other CI dashboards.

The `stats` subcommand shares the pipeline but emits aggregate counts instead of individual findings.

## Configuration & suppression

Config lives in `biston.toml` or under `[tool.biston]` in `pyproject.toml`. CLI flags override config values. File-level and function-level suppression is available via config globs or inline `# biston: ignore` / `# biston: ignore-file` comments. The full key-by-key reference lives in the [project README](https://github.com/mojzis/biston#configuration).

## Scanning tests

Test suites accumulate their own kind of duplication — near-identical cases that could collapse into `@pytest.mark.parametrize`, copy-pasted arrange/act/assert blocks, repeated fixture plumbing — but that noise usually drowns out production-code findings when mixed into the same report. biston splits the two:

- **By default**, the `scan.exclude` globs (`tests/**`, `**/conftest.py`, `migrations/**`) drop test files at the discovery stage, so `biston scan` and `biston stats` only see your application code.
- **`--tests-only`** (on both `scan` and `stats`) inverts the scope: `include` is replaced with common Python test patterns (`**/test_*.py`, `**/*_test.py`, `**/conftest.py`, `tests/**/*.py`, `**/tests/**/*.py` — the last covering monorepo layouts like `backend/tests/helpers.py`), and `exclude` is cleared. Other knobs (`min_lines`, `threshold`, normalization) are untouched; tune them in `biston.toml` if your tests want a different baseline than your production code.

Run the two passes separately (e.g. two CI steps, or two cached runs against the same repo) to keep the signal clean.

## Focus scanning

For commit hooks and CI steps that only care about the diff, `scan` and `stats` accept `--files <PATH>` (repeatable) and `--files-from <PATH|->` (list from file or stdin). Discovery and analysis still run over the whole tree — so a newly-introduced clone of an untouched helper is still found — but only pairs where at least one side lives in the focus set make it to the report. See [Commit-hook integration](commit-hooks.md) for the `git diff` recipe.

## The llms.txt surface

Every page on this site is also served as raw Markdown at its source path (for this page, `how-it-works.md`). Two roll-up files round it out:

- [`llms.txt`](llms.txt) — index following the [llms.txt](https://llmstxt.org) convention.
- [`llms-full.txt`](llms-full.txt) — all pages concatenated.

That way an LLM can ingest the full docs without scraping HTML, and the links stay stable across deploys.

---

# Commit-hook integration

biston is designed to scan a whole repository, but when you wire it into a pre-commit hook you usually don't want every unrelated pair in the codebase to fail a commit — only clones that involve the files the committer touched. The `--focus-args` / `--files` / `--files-from` flags narrow the **report** to those files while still scanning the whole tree, so cross-file clones between a changed file and the rest of the repo are still detected.

## With pre-commit / prek

If you use the [`pre-commit`](https://pre-commit.com) framework (or [`prek`](https://github.com/j178/prek)), drop this into `.pre-commit-config.yaml`:

```yaml
  - repo: https://github.com/mojzis/biston
    rev: v0.5.0
    hooks:
      - id: biston
```

That wires up `biston scan --focus-args`, which receives staged Python files as positional arguments and narrows the report to clones involving any of them. An empty staged set (no Python files touched) passes silently. A companion `biston-stats` hook is available for CI gating on pair counts.

> **Heads up:** if you write your own `local` hook definition for biston instead of using the repo above, you **must** set `require_serial: true`. Without it pre-commit may batch staged files into parallel invocations, and cross-file clones spanning batches will be silently missed — defeating the point of running biston as a hook.

## The shell recipe

For raw `.git/hooks/pre-commit` scripts, or for CI integration outside of `pre-commit`:

```bash
git diff --name-only --diff-filter=ACM -- '*.py' \
  | biston scan --files-from - .
```

What each piece does:

- `git diff --name-only --diff-filter=ACM -- '*.py'` — list Python files that are **A**dded, **C**opied, or **M**odified in the current index (swap in `HEAD~1..HEAD` or `--cached` depending on hook timing).
- `biston scan --files-from -` — read that list from stdin (one path per line). Paths are resolved relative to the current working directory.
- The positional `.` — root of the scan. biston still discovers and parses everything under it; the focus list only restricts which pairs make it into the report.

An empty list (no Python files changed) correctly emits no pairs — the hook passes silently. That's why `--files-from -` is the right shape for hooks: `--files $(git diff --name-only)` silently expands to nothing when the diff is empty, which reverts to a full-repo scan and can trip the hook on pre-existing clones unrelated to the commit.

## Semantics

Given a repo with clones `A ↔ B` (inside the committer's change) and `C ↔ D` (elsewhere):

| Invocation | Pairs emitted |
|---|---|
| `biston scan .` | `A↔B`, `C↔D` |
| `biston scan --files A.py .` | `A↔B` (and any `A↔X` with X anywhere in the repo) |
| `biston scan --focus-args A.py` | `A↔B` (same as `--files A.py .`) |
| `biston scan --files-from - .` with empty stdin | *(none)* |
| `biston scan --focus-args` (no positionals) | *(none)* |

The three focus modes — `--files`, `--files-from`, and `--focus-args` — are mutually exclusive; pass only one per invocation.

A focus path that can't be resolved (e.g. a file deleted in the same changeset) is warned about and skipped, not treated as a fatal error — the scan continues with whatever focus paths did resolve.

## Tips

- Use `--diff-filter=ACM` to avoid passing deleted files. biston tolerates them, but it's clearer at the hook level.
- Combine with `--format sarif` if your CI wants to upload findings as annotations — the SARIF output is filtered the same way.
- `stats` supports the same flags, so you can gate a hook on a numeric threshold: `biston stats --files-from - --format json . | jq '.clone_pairs'`.
- For a dry run, drop `--files-from` to see the full-repo report and confirm the focused scan isn't hiding surprises.

## Repeated `--files`

For one-off use outside a hook, `--files` is repeatable:

```bash
biston scan --files src/auth.py --files src/session.py .
```

Each `--files` takes a single path; repeat the flag to add more. This form conflicts with `--files-from` — pick one per invocation.

---