How-To & Life · Guide · Text & Writing Utilities
How to Detect Invisible Characters
Find zero-width joiners, non-breakers, and byte order marks that break your regex after paste. Use this free instant guide for clean text processing online.
Some of the most frustrating bugs in text processing are caused by characters you literally cannot see. Zero-width spaces, non-breaking spaces, byte-order marks, zero-width joiners, and the exotic tag characters used in Unicode hide in pasted text, survive regex cleanup, and silently break string matching, search indexes, and CSV parsing. A password field rejects your input; a regex doesn’t match what you know is there; a file has a mysterious first character. This guide covers the most common invisible characters, how they sneak into your workflow, and the detection and stripping patterns that actually work.
Advertisement
The usual suspects
- NBSP (U+00A0) — non-breaking space, looks like space
- ZWSP (U+200B) — zero-width space, takes no width
- ZWNJ (U+200C) — zero-width non-joiner
- ZWJ (U+200D) — zero-width joiner (used in emoji)
- BOM (U+FEFF) — byte-order mark, often at file start
- Soft hyphen (U+00AD) — only visible at line-break points
- LRM/RLM (U+200E/U+200F) — LTR/RTL marks
- Tag characters (U+E0000 block) — invisible hidden-message vector
- Variation selectors (U+FE00–U+FE0F)
- Ideographic space (U+3000) — full-width space
How they get in
Paste workflows are the main source:
- Copy from Microsoft Word → NBSP and smart quotes
- Copy from web pages → ZWSP for word-break hints
- Save from Excel CSV → BOM at file start
- Copy from terminal → ANSI escape sequences
- Paste from messaging apps → ZWJ in emoji sequences
- Malicious paste → intentionally hidden payloads
Detecting with a hex dump
The most reliable inspection: view the hex. Anything in a visible region that isn’t ASCII is suspect.
// JS: dump each code point
[...str].forEach((c, i) => {
const cp = c.codePointAt(0).toString(16).padStart(4, "0");
console.log(i, c, "U+" + cp.toUpperCase());
});On the command line: xxd, od -c, orhexyl.
Regex detection
Match anything that renders with zero or ambiguous width:
const INVISIBLES = /[\u00A0\u00AD\u034F\u061C\u115F\u1160\u17B4\u17B5\u180E\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\u2060-\u2064\uFEFF\uFFF9-\uFFFB]/gu; str.match(INVISIBLES);
This covers the Unicode ranges explicitly listed as “default-ignorable” or known invisible-space characters. Extend with the E0000 tag block if you’re paranoid about hidden-message attacks.
Zero-width characters
U+200B–U+200D and U+FEFF take zero rendering width. They’re functionally invisible but affect:
- String length (
”a\u200Bb”.length === 3) - Regex matching (
/ab/won’t match) - Search indexes
- URL parsing and domain validation
- Password comparison
Strip aggressively for input normalization:
str.replace(/[\u200B-\u200D\uFEFF]/g, "")
The BOM problem
U+FEFF at the start of a file is a byte-order mark, used by some tools to signal UTF encoding. It causes:
- CSV parsers reading “Name” as the first column header
- JSON parsers failing with “unexpected character”
- Shell scripts failing because the shebang line is invalidated
- Diff tools showing a byte at position 0 that “isn’t there”
// strip BOM at start of file only str.replace(/^\uFEFF/, "")
Non-breaking space variants
NBSP (U+00A0) is the most common impostor. Looks identical to space. Breaks:
/ /regexstr.split(” ”)- Trim (in older engines)
Other space variants to watch: U+2007 (figure space), U+2008 (punctuation space), U+202F (narrow no-break), U+3000 (ideographic space). Normalize all to regular space:
str.replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")
Tag characters — the hidden-message vector
Unicode’s U+E0020–U+E007F block mirrors ASCII but is default-ignorable. You can encode an entire message in “invisible” tag characters and append it to normal text. It survives most regex, most UI display, and most copy-paste. Used in watermarking and some attack scenarios. Strip unless you have a specific reason to keep them.
str.replace(/[\u{E0020}-\u{E007F}]/gu, "")Detection UI patterns
When a user complains “the form says my input is invalid but it looks fine,” show a character-by-character diagnostic:
- Render each character with its code point
- Highlight any not in a safe allow-list
- Suggest an auto-cleaned version
Prevention at input boundaries
The fix is at ingest, not at query. On every user-text input:
function cleanInput(s) {
return s
.normalize("NFC")
.replace(/^\uFEFF/, "")
.replace(/[\u200B-\u200D\u2060-\u2064\uFEFF]/g, "")
.replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")
.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "")
.trim();
}When invisible characters are wanted
Not always garbage. Legitimate uses include:
- ZWJ in emoji sequences (family, professions, flags)
- ZWNJ in Persian and Arabic text for correct letter-joining
- Soft hyphens for line-break hints in print typography
- LRM/RLM for fixing bidirectional text display
Context-aware stripping only. Don’t nuke ZWJ from emoji.
Common mistakes
Writing a “strip whitespace” routine that misses NBSP and ideographic space. Not stripping the BOM from CSV imports and having a broken first column. Assuming trim() in every language handles Unicode whitespace (it doesn’t, in some older runtimes). Treating all zero-width characters as noise when some carry meaning. And not logging code points when a mystery bug lands — debugging invisible characters without a hex dump is agony.
Run the numbers
Invisible character detectorUnicode text normalizerWhitespace remover
Use these while you read
Tools that pair with this guide
- Invisible Character DetectorFind zero-width spaces, BOMs, and hidden chars breaking CSVs or passwords, then clean them instantly in your browser—free with no sign-up.Text & Writing Utilities
- Unicode Text NormalizerNormalize text to fix broken accents and compatibility characters instantly. Choose NFC, NFD, NFKC, or NFKD and clean text in your browser, free, no sign-up required.Text & Writing Utilities
- Whitespace RemoverClean pasted text by collapsing repeated spaces, tabs, and newlines. Trim edges, normalize to single spaces, or strip entirely.Text & Writing Utilities
- Paragraph CounterCount paragraphs in pasted text with average length, word count per paragraph, and a live outline view.Text & Writing Utilities
Advertisement
Continue reading
- How-To & LifeHow to Convert to snake_caseConvert text to snake_case online instantly. Free tool handles acronyms and PascalCase for Python and Ruby naming with no download needed.
- How-To & LifeHow to Write Numbers in WordsConvert any number into words for checks or legal documents instantly. This free online tool handles cardinal and ordinal formats, with no registration or sign-up needed.
- How-To & LifeHow to Count Word FrequencyCount word frequency online instantly with our free text analyzer. Remove stop-words and explore n-grams for SEO research without registration.
- How-To & LifeHow to Remove Duplicate LinesDedup text lines using exact, trimmed, or case-insensitive comparison free online. Process large files and choose preserve-first or preserve-last instantly with no signup.
- How-To & LifeHow to Normalize Unicode TextNormalize Unicode text using NFC, NFD, NFKC, and NFKD forms for search and security. A free online guide to handling homoglyph attacks and key normalization in seconds.
- How-To & LifeHow to Strip Special CharactersStrip special characters from text: define what 'special' means, clean to ASCII‑only, preserve spaces and punctuation, and produce URL‑safe output. Free guide, no sign‑up.