TF-IDF Keyword Extractor
Paste one or more documents (separated by a blank line or ---) and this tool computes TF-IDF scores for every term and n-gram phrase. TF (term frequency) measures how often a word appears in a document; IDF (inverse document frequency) down-weights terms that appear across all documents. The product identifies the words that characterise each document, not just the most frequent ones — making it the standard technique for keyword extraction and content analysis.
How to use the TF-IDF Keyword Extractor
Paste your text into the box. To analyse multiple documents (for a multi-document corpus IDF), separate them with a blank line or a line containing just ---. Select an n-gram size: 1 for individual words (best for topic detection), 2 for bigrams (good for collocations like "machine learning" or "user experience"), 3 for trigrams (phrase detection).
Enable Filter stop words to remove common English function words (the, is, and, of…) that inflate TF but carry little meaning. Set Top N to control how many terms appear in the results table.
If you paste a single document, the tool splits it on sentence boundaries and treats each sentence as a sub-document to compute a meaningful IDF — otherwise TF-IDF collapses to plain TF for a single document. A note appears in the output explaining this. Results are a table: term, TF (frequency / total words), IDF (log(N/df)), and TF-IDF score, sorted by TF-IDF descending. Copy the table output for use in reports or further analysis.
What TF-IDF measures and when to use it
TF-IDF (Term Frequency–Inverse Document Frequency) is a numerical statistic used in information retrieval and natural language processing to reflect how important a word is to a document within a collection (corpus). The TF component counts how often a term appears in the document, normalised by document length. The IDF component is the log of the ratio of total documents to the number of documents containing the term: log(N / df). A term that appears in every document (like "the") gets an IDF near zero; a term that appears in only one document gets a high IDF. Multiplying TF × IDF yields a score that is high for terms that are frequent in a specific document but rare across the corpus — which is exactly what makes a term "characteristic" of that document.
TF-IDF was introduced in the 1970s–80s and remains one of the most widely used text analysis techniques because it is simple, fast, and interpretable. It underpins classic search engine ranking, document clustering, topic modelling, and content recommendation systems. For SEO content analysis, TF-IDF is used to compare a page against its ranking competitors to find "missing" semantically relevant terms. For editorial work, it quickly surfaces the key concepts an author has emphasised.
N-grams extend TF-IDF to multi-word phrases. A bigram like "neural network" has much more semantic specificity than either word alone, and its TF-IDF score captures how characteristic that phrase is of the document. Trigrams are less common but can catch meaningful three-word units like "support vector machine" or "large language model". Start with unigrams for a broad overview, then switch to bigrams to discover collocations.
Common use cases
- SEO content gap analysis — compare your article against top-ranking pages (paste each as a document) to see which semantically relevant terms your content is missing.
- Document summarisation — the top-TF-IDF terms are a fast automatic summary of the document\'s core topics, without reading the full text.
- Competitive blog analysis — paste several competitor articles to identify the terms each uniquely emphasises — i.e. their content angle.
- Tag generation — automatically generate candidate tags or metadata keywords for a CMS by extracting the top-scoring unigrams and bigrams.
- Academic paper review — quickly surface the key concepts of a paper or section, useful when reviewing many papers in a literature survey.