Fine-Tuning JSONL Deduplicator

Deduplicate JSONL training data in seconds. Choose to compare entire lines or extract a specific field (prompt, completion, messages, or the first user turn), optionally normalise whitespace before hashing, and keep either the first or last occurrence. Row counts are shown live so you can see exactly how many duplicates were removed.

How to use the Fine-Tuning JSONL Deduplicator

Paste your JSONL data — one JSON object per line — into the input box, then click Deduplicate.

  1. Key selector: "Whole line" hashes the entire raw line (safe for any schema). Choose a field name (prompt, completion, messages) to deduplicate only on that field's value, ignoring differences in other fields. "First user message" extracts the first messages entry whose role is user — useful for chat-format datasets.
  2. Normalise: when checked, the key string is trimmed, lowercased, and internal runs of whitespace are collapsed to a single space before comparing. The emitted lines are not changed — only the comparison key is normalised, so the output still contains the original formatting.
  3. Keep first / last: controls which copy survives when duplicates are found. "First" is safest for append-heavy pipelines; "Last" lets you overwrite older versions.
  4. The result panel shows rows in, rows out, and the number of duplicates removed. Copy or download the result as .jsonl.

Why deduplication matters for LLM training data

JSONL (JSON Lines) is the dominant format for fine-tuning LLMs: every line is a self-contained JSON object, making streaming, shuffling, and sharding trivial. When datasets are assembled from multiple sources — web scrapes, human annotations, synthetic generation runs — exact or near-exact duplicates accumulate. Research on language model memorisation (Carlini et al., 2022) shows that duplicated examples are memorised at dramatically higher rates, making the model prone to verbatim regurgitation of training data rather than generalisation.

There are two levels of deduplication. Exact deduplication (what this tool does) catches character-for-character copies. Fuzzy deduplication (MinHash, SimHash) catches near-duplicates but requires more compute. For most fine-tuning workflows, exact deduplication on the prompt or first user message is the high-ROI starting point: it is fast, deterministic, and catches the most common contamination pattern — the same example appearing in both train and eval splits.

The normalisation option (trim + lowercase + collapse whitespace) bridges the gap between strict exact-match and fuzzy matching: it catches duplicates that differ only in leading/trailing spaces, capitalisation, or inconsistent whitespace — common when combining CSV exports with API-generated data. Because only the comparison key is normalised (not the emitted line), the output retains the original formatting your training pipeline expects.

Common use cases

  • Fine-tune dataset hygiene — remove copies that slipped in when merging datasets from different sources before a training run.
  • Train/eval contamination check — deduplicate on the prompt field across your train and eval JSONL files to ensure the model is not evaluated on questions it was trained on.
  • Synthetic data pipelines — LLM-generated datasets frequently produce near-identical outputs; deduplicating on completion or first user message keeps variety high.
  • Chat dataset cleaning — normalise + deduplicate on the first user turn to collapse repeated conversation starters across different sessions.
  • Incremental dataset updates — when appending new examples to an existing JSONL file, keep=last lets newer entries overwrite stale ones.

Frequently asked questions

Does the tool modify the JSON content of surviving lines?

No. The output lines are the original raw strings from the input. Normalisation (if enabled) only affects the key used for comparison — it never rewrites the data.

What happens if a line is not valid JSON?

Invalid lines are passed through unchanged and counted in the output. They are compared as raw strings (same as "Whole line" mode) regardless of the key setting, so two identical malformed lines would still be deduplicated.

How are messages or nested objects compared?

The field's value is serialised back to JSON using JSON.stringify before hashing. This is deterministic as long as key order is consistent — which it is for objects produced by the same serialiser.

Can I deduplicate files larger than the textarea can hold?

The textarea handles a few MB comfortably in modern browsers. For datasets in the hundreds of MB or GB range, use a command-line approach: awk '!seen[$0]++' for whole-line dedup, or jq + sort-uniq pipelines for field-level dedup.

Why does "Keep last" matter?

In annotation pipelines, later records are often corrections of earlier ones. Keeping the last occurrence means the most recently written version survives — useful when re-exporting a dataset after fixing labels.