How-To & Life · Guide · Text & Writing Utilities
How to Count Word Frequency
Count word frequency online instantly with our free text analyzer. Remove stop-words and explore n-grams for SEO research without registration.
Word frequency is one of the oldest text-analysis techniques and still powers SEO, content audits, plagiarism detection, style checking, and basic NLP. Count how often each word appears in a document and you can spot overused terms, keyword coverage gaps, tonal tics, and the shape of a corpus. But naivesplit(” ”) breaks on punctuation, treats “run,” “runs,” and “running” as three different words, and conflates meaningful content with high-frequency filler. This guide covers tokenization rules, stop words, stemming, n-grams, and the specific tuning you need for SEO vs style vs research applications.
Advertisement
Tokenization: the first hard choice
How you cut text into words determines every count downstream. Naive whitespace split:
"Don't split on hyphens, maybe." -> ["Don't", "split", "on", "hyphens,", "maybe."]
Punctuation is attached. Better: split on non-word characters, then lowercase:
str.toLowerCase().match(/\[\p{L}\p{N}']+\/gu)This keeps contractions (“don’t”) but strips commas and periods. Add hyphens to the class if you want “state-of-the-art” as one token.
Case folding
“The” and “the” should count as the same word unless you’re analyzing capitalization patterns.toLowerCase is usually fine, but remember locale-specific rules (Turkish dotted/dotless i, German ß).
Stop words
The top 20 words in English text are “the, of, and, a, to, in, is, you, that, it, he, was, for, on, are” — rarely interesting. Standard stop-word lists strip them so the remaining counts reflect content.
const STOP = new Set([ "the","a","an","and","or","but","if","then","else","of", "to","in","on","at","for","from","by","with","as","is", "are","was","were","be","been","being","have","has","had", "do","does","did","will","would","shall","should","may","might", "must","can","could","this","that","these","those","i","you", "he","she","it","we","they" ]); tokens.filter(t => !STOP.has(t))
Customize the list for your domain. SEO stop-word lists usually keep more terms than research-corpus lists.
Stemming vs lemmatization
Both collapse word variants to a single form:
- Stemming — algorithmic, cheap, aggressive. “running” → “run”, “better” → “better” (doesn’t handle irregulars)
- Lemmatization — dictionary-based, accurate, slower. “running” → “run”, “better” → “good”
Stemming is usually good enough for frequency counting. Porter stemmer is the classic; Snowball is its modern descendant. For accuracy-critical work, use spaCy orNLTK with WordNet.
Counting
Trivial with a map:
const counts = new Map();
for (const t of tokens) {
counts.set(t, (counts.get(t) || 0) + 1);
}
// sort desc
const sorted = [...counts.entries()]
.sort((a, b) => b[1] - a[1]);N-grams: beyond single words
Single-word counts miss phrases. “San Francisco” carries information that “san” + “francisco” separately doesn’t. Bigrams (2-word) and trigrams (3-word) capture this:
function ngrams(tokens, n) {
const out = [];
for (let i = 0; i <= tokens.length - n; i++) {
out.push(tokens.slice(i, i + n).join(" "));
}
return out;
}
const bigrams = ngrams(tokens, 2);
const trigrams = ngrams(tokens, 3);Bigram stop-word filtering is trickier — “of the” is noise but “state of the art” is signal. Strip bigrams where both tokens are stop words, keep the rest.
TF-IDF: frequency in context
Raw frequency favors stop words and common terms. TF-IDF (term frequency inverse document frequency) measures how distinctive a word is to this document relative to a corpus.
tf(t, d) = count of t in d / total terms in d idf(t) = log(N / n_t) // N docs total, n_t docs with t tfidf(t,d) = tf(t, d) * idf(t)
High TF-IDF = characteristic of the document. Great for tagging, topic extraction, and finding the “gist” words.
SEO application: keyword density
Keyword density = (count of keyword / total words) × 100. Old SEO target was 1–3%. Modern consensus: natural language beats forced density. Use frequency counting to:
- Catch obvious keyword stuffing (>5% for any one term)
- Find coverage gaps where expected terms are missing
- Audit multi-page consistency
- Spot over-indexing on stop-word-like phrases
Style checking
Frequency counts reveal habitual tics: “really,” “just,” “very,” “that” overused as filler. Run your draft through a frequency pass and the top 30 content words show your patterns.
Research and corpus analysis
For larger corpora:
- Normalize to NFKC, lowercase, strip punctuation
- Apply a domain-specific stop-word list
- Stem with Snowball or Porter
- Generate uni/bi/trigrams, report top 50 of each
- For larger analysis, compute TF-IDF across document-by-document breakdown
Hapax legomena and Zipf’s law
Natural-language frequency distributions follow Zipf’s law: the Nth most common word has frequency roughly proportional to 1/N. The single most common word appears twice as often as the second, three times as often as the third, etc. Deviations from Zipf’s law often indicate artificially generated or translated text. Hapax legomena (words that appear exactly once) typically make up 40–60% of the distinct vocabulary in any corpus — a useful sanity check.
Common mistakes
Splitting on whitespace only and keeping punctuation attached to tokens. Case-folding too early and losing proper-noun distinction. Applying an English stop-word list to non-English text. Counting “run,” “runs,” and “running” separately when you meant them as one concept. Forgetting that HTML tags, URLs, and numbers need separate handling. And confusing frequency rank with importance — Zipf’s law guarantees “the” wins every time.
Run the numbers
Use these while you read
Tools that pair with this guide
- Word Frequency CounterPaste any text to see the most common words, counts, and percentages. Filters stop-words optional.Text & Writing Utilities
- Keyword Density CheckerPaste an article and see the frequency of each word and 2-3 word phrase. Essential for SEO content optimization.Developer Utilities
- Word CounterFree word counter. Paste text and see words, characters, sentences, and reading time instantly. Works offline after load.Text & Writing Utilities
- HTML Table GeneratorBuild an HTML table visually and copy the markup. Supports header row, striped rows, borders, and alignment.Text & Writing Utilities
Advertisement
Continue reading
- How-To & LifeHow to Convert to snake_caseConvert text to snake_case online instantly. Free tool handles acronyms and PascalCase for Python and Ruby naming with no download needed.
- How-To & LifeHow to Write Numbers in WordsConvert any number into words for checks or legal documents instantly. This free online tool handles cardinal and ordinal formats, with no registration or sign-up needed.
- How-To & LifeHow to Remove Duplicate LinesDedup text lines using exact, trimmed, or case-insensitive comparison free online. Process large files and choose preserve-first or preserve-last instantly with no signup.
- How-To & LifeHow to Detect Invisible CharactersFind zero-width joiners, non-breakers, and byte order marks that break your regex after paste. Use this free instant guide for clean text processing online.
- How-To & LifeHow to Normalize Unicode TextNormalize Unicode text using NFC, NFD, NFKC, and NFKD forms for search and security. A free online guide to handling homoglyph attacks and key normalization in seconds.
- How-To & LifeHow to Strip Special CharactersStrip special characters from text: define what 'special' means, clean to ASCII‑only, preserve spaces and punctuation, and produce URL‑safe output. Free guide, no sign‑up.