How-To & Life · Guide · Text & Writing Utilities

How to Count Word Frequency

Count word frequency online instantly with our free text analyzer. Remove stop-words and explore n-grams for SEO research without registration.

By FreeToolArena Staff · Updated June 2026 · 6 min read

Word frequency is one of the oldest text-analysis techniques and still powers SEO, content audits, plagiarism detection, style checking, and basic NLP. Count how often each word appears in a document and you can spot overused terms, keyword coverage gaps, tonal tics, and the shape of a corpus. But naivesplit(” ”) breaks on punctuation, treats “run,” “runs,” and “running” as three different words, and conflates meaningful content with high-frequency filler. This guide covers tokenization rules, stop words, stemming, n-grams, and the specific tuning you need for SEO vs style vs research applications.

Tokenization: the first hard choice

How you cut text into words determines every count downstream. Naive whitespace split:

"Don't split on hyphens, maybe."
 -&gt;  ["Don't", "split", "on", "hyphens,", "maybe."]

Punctuation is attached. Better: split on non-word characters, then lowercase:

str.toLowerCase().match(/\[\p{L}\p{N}']+\/gu)

This keeps contractions (“don’t”) but strips commas and periods. Add hyphens to the class if you want “state-of-the-art” as one token.

Case folding

“The” and “the” should count as the same word unless you’re analyzing capitalization patterns.toLowerCase is usually fine, but remember locale-specific rules (Turkish dotted/dotless i, German ß).

Stop words

The top 20 words in English text are “the, of, and, a, to, in, is, you, that, it, he, was, for, on, are” — rarely interesting. Standard stop-word lists strip them so the remaining counts reflect content.

const STOP = new Set([
  "the","a","an","and","or","but","if","then","else","of",
  "to","in","on","at","for","from","by","with","as","is",
  "are","was","were","be","been","being","have","has","had",
  "do","does","did","will","would","shall","should","may","might",
  "must","can","could","this","that","these","those","i","you",
  "he","she","it","we","they"
]);

tokens.filter(t =&gt; !STOP.has(t))

Customize the list for your domain. SEO stop-word lists usually keep more terms than research-corpus lists.

Stemming vs lemmatization

Both collapse word variants to a single form:

Stemming — algorithmic, cheap, aggressive. “running” → “run”, “better” → “better” (doesn’t handle irregulars)
Lemmatization — dictionary-based, accurate, slower. “running” → “run”, “better” → “good”

Stemming is usually good enough for frequency counting. Porter stemmer is the classic; Snowball is its modern descendant. For accuracy-critical work, use spaCy orNLTK with WordNet.

Counting

Trivial with a map:

const counts = new Map();
for (const t of tokens) {
  counts.set(t, (counts.get(t) || 0) + 1);
}

// sort desc
const sorted = [...counts.entries()]
  .sort((a, b) =&gt; b[1] - a[1]);

N-grams: beyond single words

Single-word counts miss phrases. “San Francisco” carries information that “san” + “francisco” separately doesn’t. Bigrams (2-word) and trigrams (3-word) capture this:

function ngrams(tokens, n) {
  const out = [];
  for (let i = 0; i &lt;= tokens.length - n; i++) {
    out.push(tokens.slice(i, i + n).join(" "));
  }
  return out;
}

const bigrams = ngrams(tokens, 2);
const trigrams = ngrams(tokens, 3);

Bigram stop-word filtering is trickier — “of the” is noise but “state of the art” is signal. Strip bigrams where both tokens are stop words, keep the rest.

TF-IDF: frequency in context

Raw frequency favors stop words and common terms. TF-IDF (term frequency inverse document frequency) measures how distinctive a word is to this document relative to a corpus.

tf(t, d)   = count of t in d / total terms in d
idf(t)     = log(N / n_t)   // N docs total, n_t docs with t
tfidf(t,d) = tf(t, d) * idf(t)

High TF-IDF = characteristic of the document. Great for tagging, topic extraction, and finding the “gist” words.

SEO application: keyword density

Keyword density = (count of keyword / total words) × 100. Old SEO target was 1–3%. Modern consensus: natural language beats forced density. Use frequency counting to:

Catch obvious keyword stuffing (>5% for any one term)
Find coverage gaps where expected terms are missing
Audit multi-page consistency
Spot over-indexing on stop-word-like phrases

Style checking

Frequency counts reveal habitual tics: “really,” “just,” “very,” “that” overused as filler. Run your draft through a frequency pass and the top 30 content words show your patterns.

Research and corpus analysis

For larger corpora:

Normalize to NFKC, lowercase, strip punctuation
Apply a domain-specific stop-word list
Stem with Snowball or Porter
Generate uni/bi/trigrams, report top 50 of each
For larger analysis, compute TF-IDF across document-by-document breakdown

Hapax legomena and Zipf’s law

Natural-language frequency distributions follow Zipf’s law: the Nth most common word has frequency roughly proportional to 1/N. The single most common word appears twice as often as the second, three times as often as the third, etc. Deviations from Zipf’s law often indicate artificially generated or translated text. Hapax legomena (words that appear exactly once) typically make up 40–60% of the distinct vocabulary in any corpus — a useful sanity check.

Common mistakes

Splitting on whitespace only and keeping punctuation attached to tokens. Case-folding too early and losing proper-noun distinction. Applying an English stop-word list to non-English text. Counting “run,” “runs,” and “running” separately when you meant them as one concept. Forgetting that HTML tags, URLs, and numbers need separate handling. And confusing frequency rank with importance — Zipf’s law guarantees “the” wins every time.

Run the numbers

Word frequency counter Keyword density checker Word counter

Use these while you read

Tools that pair with this guide

Found this useful?Email Buy Me a Coffee

Continue reading

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →