How-To & Life · Guide · Text & Writing Utilities
How to Detect Invisible Characters
Zero-width joiner/non-joiner, BOM, non-breaking spaces, tag characters, and how they break regex and search after paste.
Some of the most frustrating bugs in text processing are caused by characters you literally cannot see. Zero-width spaces, non-breaking spaces, byte-order marks, zero-width joiners, and the exotic tag characters used in Unicode hide in pasted text, survive regex cleanup, and silently break string matching, search indexes, and CSV parsing. A password field rejects your input; a regex doesn’t match what you know is there; a file has a mysterious first character. This guide covers the most common invisible characters, how they sneak into your workflow, and the detection and stripping patterns that actually work.
Advertisement
The usual suspects
- NBSP (U+00A0) — non-breaking space, looks like space
- ZWSP (U+200B) — zero-width space, takes no width
- ZWNJ (U+200C) — zero-width non-joiner
- ZWJ (U+200D) — zero-width joiner (used in emoji)
- BOM (U+FEFF) — byte-order mark, often at file start
- Soft hyphen (U+00AD) — only visible at line-break points
- LRM/RLM (U+200E/U+200F) — LTR/RTL marks
- Tag characters (U+E0000 block) — invisible hidden-message vector
- Variation selectors (U+FE00–U+FE0F)
- Ideographic space (U+3000) — full-width space
How they get in
Paste workflows are the main source:
- Copy from Microsoft Word → NBSP and smart quotes
- Copy from web pages → ZWSP for word-break hints
- Save from Excel CSV → BOM at file start
- Copy from terminal → ANSI escape sequences
- Paste from messaging apps → ZWJ in emoji sequences
- Malicious paste → intentionally hidden payloads
Detecting with a hex dump
The most reliable inspection: view the hex. Anything in a visible region that isn’t ASCII is suspect.
// JS: dump each code point
[...str].forEach((c, i) => {
const cp = c.codePointAt(0).toString(16).padStart(4, "0");
console.log(i, c, "U+" + cp.toUpperCase());
});On the command line: xxd, od -c, orhexyl.
Regex detection
Match anything that renders with zero or ambiguous width:
const INVISIBLES = /[\u00A0\u00AD\u034F\u061C\u115F\u1160\u17B4\u17B5\u180E\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\u2060-\u2064\uFEFF\uFFF9-\uFFFB]/gu; str.match(INVISIBLES);
This covers the Unicode ranges explicitly listed as “default-ignorable” or known invisible-space characters. Extend with the E0000 tag block if you’re paranoid about hidden-message attacks.
Zero-width characters
U+200B–U+200D and U+FEFF take zero rendering width. They’re functionally invisible but affect:
- String length (
”a\u200Bb”.length === 3) - Regex matching (
/ab/won’t match) - Search indexes
- URL parsing and domain validation
- Password comparison
Strip aggressively for input normalization:
str.replace(/[\u200B-\u200D\uFEFF]/g, "")
The BOM problem
U+FEFF at the start of a file is a byte-order mark, used by some tools to signal UTF encoding. It causes:
- CSV parsers reading “Name” as the first column header
- JSON parsers failing with “unexpected character”
- Shell scripts failing because the shebang line is invalidated
- Diff tools showing a byte at position 0 that “isn’t there”
// strip BOM at start of file only str.replace(/^\uFEFF/, "")
Non-breaking space variants
NBSP (U+00A0) is the most common impostor. Looks identical to space. Breaks:
/ /regexstr.split(” ”)- Trim (in older engines)
Other space variants to watch: U+2007 (figure space), U+2008 (punctuation space), U+202F (narrow no-break), U+3000 (ideographic space). Normalize all to regular space:
str.replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")
Tag characters — the hidden-message vector
Unicode’s U+E0020–U+E007F block mirrors ASCII but is default-ignorable. You can encode an entire message in “invisible” tag characters and append it to normal text. It survives most regex, most UI display, and most copy-paste. Used in watermarking and some attack scenarios. Strip unless you have a specific reason to keep them.
str.replace(/[\u{E0020}-\u{E007F}]/gu, "")Detection UI patterns
When a user complains “the form says my input is invalid but it looks fine,” show a character-by-character diagnostic:
- Render each character with its code point
- Highlight any not in a safe allow-list
- Suggest an auto-cleaned version
Prevention at input boundaries
The fix is at ingest, not at query. On every user-text input:
function cleanInput(s) {
return s
.normalize("NFC")
.replace(/^\uFEFF/, "")
.replace(/[\u200B-\u200D\u2060-\u2064\uFEFF]/g, "")
.replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")
.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "")
.trim();
}When invisible characters are wanted
Not always garbage. Legitimate uses include:
- ZWJ in emoji sequences (family, professions, flags)
- ZWNJ in Persian and Arabic text for correct letter-joining
- Soft hyphens for line-break hints in print typography
- LRM/RLM for fixing bidirectional text display
Context-aware stripping only. Don’t nuke ZWJ from emoji.
Common mistakes
Writing a “strip whitespace” routine that misses NBSP and ideographic space. Not stripping the BOM from CSV imports and having a broken first column. Assuming trim() in every language handles Unicode whitespace (it doesn’t, in some older runtimes). Treating all zero-width characters as noise when some carry meaning. And not logging code points when a mystery bug lands — debugging invisible characters without a hex dump is agony.
Run the numbers
Invisible character detectorUnicode text normalizerWhitespace remover
Advertisement