Skip to content
Free Tool Arena

How-To & Life · Guide · Text & Writing Utilities

How to Detect Invisible Characters

Zero-width joiner/non-joiner, BOM, non-breaking spaces, tag characters, and how they break regex and search after paste.

Updated April 2026 · 6 min read

Some of the most frustrating bugs in text processing are caused by characters you literally cannot see. Zero-width spaces, non-breaking spaces, byte-order marks, zero-width joiners, and the exotic tag characters used in Unicode hide in pasted text, survive regex cleanup, and silently break string matching, search indexes, and CSV parsing. A password field rejects your input; a regex doesn’t match what you know is there; a file has a mysterious first character. This guide covers the most common invisible characters, how they sneak into your workflow, and the detection and stripping patterns that actually work.

Advertisement

The usual suspects

  • NBSP (U+00A0) — non-breaking space, looks like space
  • ZWSP (U+200B) — zero-width space, takes no width
  • ZWNJ (U+200C) — zero-width non-joiner
  • ZWJ (U+200D) — zero-width joiner (used in emoji)
  • BOM (U+FEFF) — byte-order mark, often at file start
  • Soft hyphen (U+00AD) — only visible at line-break points
  • LRM/RLM (U+200E/U+200F) — LTR/RTL marks
  • Tag characters (U+E0000 block) — invisible hidden-message vector
  • Variation selectors (U+FE00–U+FE0F)
  • Ideographic space (U+3000) — full-width space

How they get in

Paste workflows are the main source:

  • Copy from Microsoft Word → NBSP and smart quotes
  • Copy from web pages → ZWSP for word-break hints
  • Save from Excel CSV → BOM at file start
  • Copy from terminal → ANSI escape sequences
  • Paste from messaging apps → ZWJ in emoji sequences
  • Malicious paste → intentionally hidden payloads

Detecting with a hex dump

The most reliable inspection: view the hex. Anything in a visible region that isn’t ASCII is suspect.

// JS: dump each code point
[...str].forEach((c, i) => {
  const cp = c.codePointAt(0).toString(16).padStart(4, "0");
  console.log(i, c, "U+" + cp.toUpperCase());
});

On the command line: xxd, od -c, orhexyl.

Regex detection

Match anything that renders with zero or ambiguous width:

const INVISIBLES = /[\u00A0\u00AD\u034F\u061C\u115F\u1160\u17B4\u17B5\u180E\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\u2060-\u2064\uFEFF\uFFF9-\uFFFB]/gu;

str.match(INVISIBLES);

This covers the Unicode ranges explicitly listed as “default-ignorable” or known invisible-space characters. Extend with the E0000 tag block if you’re paranoid about hidden-message attacks.

Zero-width characters

U+200B–U+200D and U+FEFF take zero rendering width. They’re functionally invisible but affect:

  • String length (”a\u200Bb”.length === 3)
  • Regex matching (/ab/ won’t match)
  • Search indexes
  • URL parsing and domain validation
  • Password comparison

Strip aggressively for input normalization:

str.replace(/[\u200B-\u200D\uFEFF]/g, "")

The BOM problem

U+FEFF at the start of a file is a byte-order mark, used by some tools to signal UTF encoding. It causes:

  • CSV parsers reading “Name” as the first column header
  • JSON parsers failing with “unexpected character”
  • Shell scripts failing because the shebang line is invalidated
  • Diff tools showing a byte at position 0 that “isn’t there”
// strip BOM at start of file only
str.replace(/^\uFEFF/, "")

Non-breaking space variants

NBSP (U+00A0) is the most common impostor. Looks identical to space. Breaks:

  • / / regex
  • str.split(” ”)
  • Trim (in older engines)

Other space variants to watch: U+2007 (figure space), U+2008 (punctuation space), U+202F (narrow no-break), U+3000 (ideographic space). Normalize all to regular space:

str.replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")

Tag characters — the hidden-message vector

Unicode’s U+E0020–U+E007F block mirrors ASCII but is default-ignorable. You can encode an entire message in “invisible” tag characters and append it to normal text. It survives most regex, most UI display, and most copy-paste. Used in watermarking and some attack scenarios. Strip unless you have a specific reason to keep them.

str.replace(/[\u{E0020}-\u{E007F}]/gu, "")

Detection UI patterns

When a user complains “the form says my input is invalid but it looks fine,” show a character-by-character diagnostic:

  • Render each character with its code point
  • Highlight any not in a safe allow-list
  • Suggest an auto-cleaned version

Prevention at input boundaries

The fix is at ingest, not at query. On every user-text input:

function cleanInput(s) {
  return s
    .normalize("NFC")
    .replace(/^\uFEFF/, "")
    .replace(/[\u200B-\u200D\u2060-\u2064\uFEFF]/g, "")
    .replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")
    .replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "")
    .trim();
}

When invisible characters are wanted

Not always garbage. Legitimate uses include:

  • ZWJ in emoji sequences (family, professions, flags)
  • ZWNJ in Persian and Arabic text for correct letter-joining
  • Soft hyphens for line-break hints in print typography
  • LRM/RLM for fixing bidirectional text display

Context-aware stripping only. Don’t nuke ZWJ from emoji.

Common mistakes

Writing a “strip whitespace” routine that misses NBSP and ideographic space. Not stripping the BOM from CSV imports and having a broken first column. Assuming trim() in every language handles Unicode whitespace (it doesn’t, in some older runtimes). Treating all zero-width characters as noise when some carry meaning. And not logging code points when a mystery bug lands — debugging invisible characters without a hex dump is agony.

Run the numbers

Invisible character detectorUnicode text normalizerWhitespace remover

Advertisement

Found this useful?Email