Jaccard Similarity Calculator (Text + Sets)

Jaccard similarity is the size of the intersection divided by the size of the union of two sets — a simple, intuitive way to measure how alike two collections are. Applied to text (with words, characters, or n-grams as the "items"), it gives you a quick similarity score between 0 (no overlap) and 1 (identical). Useful for plagiarism detection, fuzzy matching, deduplication.

Text A

Text B

Tokenization Lowercase Strip punctuation

How to use the Jaccard Similarity Calculator (Text + Sets)

Paste two texts. Choose a tokenization scheme — word for sentence-level comparison, character bigrams/trigrams (also called n-gram shingling) for fuzzy substring matching that handles typos and small variations.

Output: Jaccard similarity (0-1), Jaccard distance (1 - similarity), and the intersection / union counts.

What is the Jaccard Similarity Calculator (Text + Sets)?

Jaccard similarity is defined as:

J(A, B) = |A ∩ B| / |A ∪ B|

For two sets of tokens, this is the count of items in both, divided by the count of items in either. Range: 0 (disjoint) to 1 (identical).

For text, the choice of tokenization matters a lot:

Word-level — fast and intuitive. "The cat sat on the mat" vs "The dog sat on the mat" has Jaccard = 5/7 ≈ 0.71. Works well for prose comparison but breaks on small typos ("color" vs "colour" share zero words).
Character n-grams (shingling) — robust to small variations. With trigrams, "color" → {col, olo, lor} and "colour" → {col, olo, lou, our}; Jaccard = 2/5 = 0.4. Better at fuzzy matching.
Word n-grams — captures phrasing. Two documents using identical phrases but different sentence orders have high word bigram similarity.

Jaccard is the basis of MinHash — a probabilistic algorithm that estimates Jaccard similarity in O(k) time regardless of set size, used by Google for near-duplicate detection at web scale.

Common use cases

Near-duplicate detection — finding articles that are mostly the same text with minor changes.
Plagiarism scoring — quick first-pass before expensive deeper checks.
Fuzzy matching — matching product names, company names with slight variation.
Document clustering — group similar docs without embeddings.
Test-data deduplication — finding semantically-same items in a dataset.

Frequently asked questions

Jaccard vs cosine similarity?

Jaccard treats each token as binary (present or not); cosine on TF-IDF vectors accounts for frequency. Jaccard is simpler and faster; cosine usually performs better for relevance ranking.

What's a "good" Jaccard score?

Depends on use case. For near-duplicate detection, >0.8 is "almost identical"; for fuzzy matching, >0.5 is "related"; for topic similarity, >0.2 may be enough.

Why character bigrams instead of just bigrams?

Different concepts. "Word bigrams" means pairs of adjacent words; "character bigrams" means pairs of adjacent characters. Character n-grams are robust to typos; word n-grams capture phrasing.

Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear:

<iframe src="https://codeswap.net/text/jaccard-similarity/?embed=1" width="100%" height="520" loading="lazy" style="border:1px solid #e5e7eb;border-radius:8px" title="Jaccard Similarity Calculator (Text + Sets)"></iframe>
<p style="font-size:13px">Tool by <a href="https://codeswap.net/text/jaccard-similarity/">Jaccard Similarity Calculator (Text + Sets) — Codeswap</a></p>