Fine-Tuning JSONL Deduplicator
Deduplicate JSONL training data in seconds. Choose to compare entire lines or extract a specific field (prompt, completion, messages, or the first user turn), optionally normalise whitespace before hashing, and keep either the first or last occurrence. Row counts are shown live so you can see exactly how many duplicates were removed.
How to use the Fine-Tuning JSONL Deduplicator
Paste your JSONL data — one JSON object per line — into the input box, then click Deduplicate.
- Key selector: "Whole line" hashes the entire raw line (safe for any schema). Choose a field name (
prompt,completion,messages) to deduplicate only on that field's value, ignoring differences in other fields. "First user message" extracts the firstmessagesentry whoseroleisuser— useful for chat-format datasets. - Normalise: when checked, the key string is trimmed, lowercased, and internal runs of whitespace are collapsed to a single space before comparing. The emitted lines are not changed — only the comparison key is normalised, so the output still contains the original formatting.
- Keep first / last: controls which copy survives when duplicates are found. "First" is safest for append-heavy pipelines; "Last" lets you overwrite older versions.
- The result panel shows rows in, rows out, and the number of duplicates removed. Copy or download the result as
.jsonl.
Why deduplication matters for LLM training data
JSONL (JSON Lines) is the dominant format for fine-tuning LLMs: every line is a self-contained JSON object, making streaming, shuffling, and sharding trivial. When datasets are assembled from multiple sources — web scrapes, human annotations, synthetic generation runs — exact or near-exact duplicates accumulate. Research on language model memorisation (Carlini et al., 2022) shows that duplicated examples are memorised at dramatically higher rates, making the model prone to verbatim regurgitation of training data rather than generalisation.
There are two levels of deduplication. Exact deduplication (what this tool does) catches character-for-character copies. Fuzzy deduplication (MinHash, SimHash) catches near-duplicates but requires more compute. For most fine-tuning workflows, exact deduplication on the prompt or first user message is the high-ROI starting point: it is fast, deterministic, and catches the most common contamination pattern — the same example appearing in both train and eval splits.
The normalisation option (trim + lowercase + collapse whitespace) bridges the gap between strict exact-match and fuzzy matching: it catches duplicates that differ only in leading/trailing spaces, capitalisation, or inconsistent whitespace — common when combining CSV exports with API-generated data. Because only the comparison key is normalised (not the emitted line), the output retains the original formatting your training pipeline expects.
Common use cases
- Fine-tune dataset hygiene — remove copies that slipped in when merging datasets from different sources before a training run.
- Train/eval contamination check — deduplicate on the prompt field across your train and eval JSONL files to ensure the model is not evaluated on questions it was trained on.
- Synthetic data pipelines — LLM-generated datasets frequently produce near-identical outputs; deduplicating on
completionorfirst user messagekeeps variety high. - Chat dataset cleaning — normalise + deduplicate on the first user turn to collapse repeated conversation starters across different sessions.
- Incremental dataset updates — when appending new examples to an existing JSONL file, keep=last lets newer entries overwrite stale ones.
Frequently asked questions
Does the tool modify the JSON content of surviving lines?
What happens if a line is not valid JSON?
How are messages or nested objects compared?
Can I deduplicate files larger than the textarea can hold?
awk '!seen[$0]++' for whole-line dedup, or jq + sort-uniq pipelines for field-level dedup.