How-To & Life · Guide · Text & Writing Utilities
How to Count Word Frequency
Tokenization, stop-words, stemming, n-grams, and use cases (SEO, style checks, research corpora).
Word frequency is one of the oldest text-analysis techniques and still powers SEO, content audits, plagiarism detection, style checking, and basic NLP. Count how often each word appears in a document and you can spot overused terms, keyword coverage gaps, tonal tics, and the shape of a corpus. But naivesplit(” ”) breaks on punctuation, treats “run,” “runs,” and “running” as three different words, and conflates meaningful content with high-frequency filler. This guide covers tokenization rules, stop words, stemming, n-grams, and the specific tuning you need for SEO vs style vs research applications.
Advertisement
Tokenization: the first hard choice
How you cut text into words determines every count downstream. Naive whitespace split:
"Don't split on hyphens, maybe." -> ["Don't", "split", "on", "hyphens,", "maybe."]
Punctuation is attached. Better: split on non-word characters, then lowercase:
str.toLowerCase().match(/\[\p{L}\p{N}']+\/gu)This keeps contractions (“don’t”) but strips commas and periods. Add hyphens to the class if you want “state-of-the-art” as one token.
Case folding
“The” and “the” should count as the same word unless you’re analyzing capitalization patterns.toLowerCase is usually fine, but remember locale-specific rules (Turkish dotted/dotless i, German ß).
Stop words
The top 20 words in English text are “the, of, and, a, to, in, is, you, that, it, he, was, for, on, are” — rarely interesting. Standard stop-word lists strip them so the remaining counts reflect content.
const STOP = new Set([ "the","a","an","and","or","but","if","then","else","of", "to","in","on","at","for","from","by","with","as","is", "are","was","were","be","been","being","have","has","had", "do","does","did","will","would","shall","should","may","might", "must","can","could","this","that","these","those","i","you", "he","she","it","we","they" ]); tokens.filter(t => !STOP.has(t))
Customize the list for your domain. SEO stop-word lists usually keep more terms than research-corpus lists.
Stemming vs lemmatization
Both collapse word variants to a single form:
- Stemming — algorithmic, cheap, aggressive. “running” → “run”, “better” → “better” (doesn’t handle irregulars)
- Lemmatization — dictionary-based, accurate, slower. “running” → “run”, “better” → “good”
Stemming is usually good enough for frequency counting. Porter stemmer is the classic; Snowball is its modern descendant. For accuracy-critical work, use spaCy orNLTK with WordNet.
Counting
Trivial with a map:
const counts = new Map();
for (const t of tokens) {
counts.set(t, (counts.get(t) || 0) + 1);
}
// sort desc
const sorted = [...counts.entries()]
.sort((a, b) => b[1] - a[1]);N-grams: beyond single words
Single-word counts miss phrases. “San Francisco” carries information that “san” + “francisco” separately doesn’t. Bigrams (2-word) and trigrams (3-word) capture this:
function ngrams(tokens, n) {
const out = [];
for (let i = 0; i <= tokens.length - n; i++) {
out.push(tokens.slice(i, i + n).join(" "));
}
return out;
}
const bigrams = ngrams(tokens, 2);
const trigrams = ngrams(tokens, 3);Bigram stop-word filtering is trickier — “of the” is noise but “state of the art” is signal. Strip bigrams where both tokens are stop words, keep the rest.
TF-IDF: frequency in context
Raw frequency favors stop words and common terms. TF-IDF (term frequency inverse document frequency) measures how distinctive a word is to this document relative to a corpus.
tf(t, d) = count of t in d / total terms in d idf(t) = log(N / n_t) // N docs total, n_t docs with t tfidf(t,d) = tf(t, d) * idf(t)
High TF-IDF = characteristic of the document. Great for tagging, topic extraction, and finding the “gist” words.
SEO application: keyword density
Keyword density = (count of keyword / total words) × 100. Old SEO target was 1–3%. Modern consensus: natural language beats forced density. Use frequency counting to:
- Catch obvious keyword stuffing (>5% for any one term)
- Find coverage gaps where expected terms are missing
- Audit multi-page consistency
- Spot over-indexing on stop-word-like phrases
Style checking
Frequency counts reveal habitual tics: “really,” “just,” “very,” “that” overused as filler. Run your draft through a frequency pass and the top 30 content words show your patterns.
Research and corpus analysis
For larger corpora:
- Normalize to NFKC, lowercase, strip punctuation
- Apply a domain-specific stop-word list
- Stem with Snowball or Porter
- Generate uni/bi/trigrams, report top 50 of each
- For larger analysis, compute TF-IDF across document-by-document breakdown
Hapax legomena and Zipf’s law
Natural-language frequency distributions follow Zipf’s law: the Nth most common word has frequency roughly proportional to 1/N. The single most common word appears twice as often as the second, three times as often as the third, etc. Deviations from Zipf’s law often indicate artificially generated or translated text. Hapax legomena (words that appear exactly once) typically make up 40–60% of the distinct vocabulary in any corpus — a useful sanity check.
Common mistakes
Splitting on whitespace only and keeping punctuation attached to tokens. Case-folding too early and losing proper-noun distinction. Applying an English stop-word list to non-English text. Counting “run,” “runs,” and “running” separately when you meant them as one concept. Forgetting that HTML tags, URLs, and numbers need separate handling. And confusing frequency rank with importance — Zipf’s law guarantees “the” wins every time.
Run the numbers
Advertisement