TF-IDF Keyword Extractor

Paste one or more documents (separated by a blank line or ---) and this tool computes TF-IDF scores for every term and n-gram phrase. TF (term frequency) measures how often a word appears in a document; IDF (inverse document frequency) down-weights terms that appear across all documents. The product identifies the words that characterise each document, not just the most frequent ones — making it the standard technique for keyword extraction and content analysis.

How to use the TF-IDF Keyword Extractor

Paste your text into the box. To analyse multiple documents (for a multi-document corpus IDF), separate them with a blank line or a line containing just ---. Select an n-gram size: 1 for individual words (best for topic detection), 2 for bigrams (good for collocations like "machine learning" or "user experience"), 3 for trigrams (phrase detection).

Enable Filter stop words to remove common English function words (the, is, and, of…) that inflate TF but carry little meaning. Set Top N to control how many terms appear in the results table.

If you paste a single document, the tool splits it on sentence boundaries and treats each sentence as a sub-document to compute a meaningful IDF — otherwise TF-IDF collapses to plain TF for a single document. A note appears in the output explaining this. Results are a table: term, TF (frequency / total words), IDF (log(N/df)), and TF-IDF score, sorted by TF-IDF descending. Copy the table output for use in reports or further analysis.

What TF-IDF measures and when to use it

TF-IDF (Term Frequency–Inverse Document Frequency) is a numerical statistic used in information retrieval and natural language processing to reflect how important a word is to a document within a collection (corpus). The TF component counts how often a term appears in the document, normalised by document length. The IDF component is the log of the ratio of total documents to the number of documents containing the term: log(N / df). A term that appears in every document (like "the") gets an IDF near zero; a term that appears in only one document gets a high IDF. Multiplying TF × IDF yields a score that is high for terms that are frequent in a specific document but rare across the corpus — which is exactly what makes a term "characteristic" of that document.

TF-IDF was introduced in the 1970s–80s and remains one of the most widely used text analysis techniques because it is simple, fast, and interpretable. It underpins classic search engine ranking, document clustering, topic modelling, and content recommendation systems. For SEO content analysis, TF-IDF is used to compare a page against its ranking competitors to find "missing" semantically relevant terms. For editorial work, it quickly surfaces the key concepts an author has emphasised.

N-grams extend TF-IDF to multi-word phrases. A bigram like "neural network" has much more semantic specificity than either word alone, and its TF-IDF score captures how characteristic that phrase is of the document. Trigrams are less common but can catch meaningful three-word units like "support vector machine" or "large language model". Start with unigrams for a broad overview, then switch to bigrams to discover collocations.

Common use cases

  • SEO content gap analysis — compare your article against top-ranking pages (paste each as a document) to see which semantically relevant terms your content is missing.
  • Document summarisation — the top-TF-IDF terms are a fast automatic summary of the document\'s core topics, without reading the full text.
  • Competitive blog analysis — paste several competitor articles to identify the terms each uniquely emphasises — i.e. their content angle.
  • Tag generation — automatically generate candidate tags or metadata keywords for a CMS by extracting the top-scoring unigrams and bigrams.
  • Academic paper review — quickly surface the key concepts of a paper or section, useful when reviewing many papers in a literature survey.

Frequently asked questions

Why does a single document give less useful results?

TF-IDF needs multiple documents for IDF to discriminate. With one document, df=1 for all terms, so IDF = log(1/1) = 0 for every term, and TF-IDF collapses to TF. This tool works around it by splitting the document into sentences and treating each sentence as a "document" — the IDF then favours terms that are concentrated in specific sentences rather than spread uniformly. A note in the output explains this.

What n-gram size should I start with?

Start with n=1 (unigrams) to get an overview of key topics. Switch to n=2 (bigrams) to discover meaningful two-word phrases like "keyword density" or "responsive image". Trigrams are useful only for technical writing with common three-word compounds. You can run all three and compare.

Why are some important words scored low?

A word with a low TF-IDF might be frequent in the document but also very common across your other documents — so IDF penalises it. This is by design: TF-IDF highlights what is unique to each document, not what is universally common. If you only have one document, enable the sentence-split mode (paste a single document) and review the note.

How is TF computed here?

TF = (number of times term appears in the document) / (total number of tokens in the document). This "relative TF" prevents long documents from unfairly dominating shorter ones. IDF = log(N / df) where N is the number of documents and df is the count of documents containing the term.

Can I use the results for Google's NLP-based ranking?

TF-IDF is a useful heuristic for identifying terms Google may associate with a topic, but Google's ranking uses BERT and MUM for semantic understanding — not classic TF-IDF. Treat TF-IDF results as a starting list of candidate terms to review, not a definitive ranking signal. Validate against actual SERP competitor analysis.