How-To & Life · Guide · Text & Writing Utilities

How to Detect Invisible Characters

Find zero-width joiners, non-breakers, and byte order marks that break your regex after paste. Use this free instant guide for clean text processing online.

By FreeToolArena Staff · Updated June 2026 · 6 min read

Some of the most frustrating bugs in text processing are caused by characters you literally cannot see. Zero-width spaces, non-breaking spaces, byte-order marks, zero-width joiners, and the exotic tag characters used in Unicode hide in pasted text, survive regex cleanup, and silently break string matching, search indexes, and CSV parsing. A password field rejects your input; a regex doesn’t match what you know is there; a file has a mysterious first character. This guide covers the most common invisible characters, how they sneak into your workflow, and the detection and stripping patterns that actually work.

The usual suspects

NBSP (U+00A0) — non-breaking space, looks like space
ZWSP (U+200B) — zero-width space, takes no width
ZWNJ (U+200C) — zero-width non-joiner
ZWJ (U+200D) — zero-width joiner (used in emoji)
BOM (U+FEFF) — byte-order mark, often at file start
Soft hyphen (U+00AD) — only visible at line-break points
LRM/RLM (U+200E/U+200F) — LTR/RTL marks
Tag characters (U+E0000 block) — invisible hidden-message vector
Variation selectors (U+FE00–U+FE0F)
Ideographic space (U+3000) — full-width space

How they get in

Paste workflows are the main source:

Copy from Microsoft Word → NBSP and smart quotes
Copy from web pages → ZWSP for word-break hints
Save from Excel CSV → BOM at file start
Copy from terminal → ANSI escape sequences
Paste from messaging apps → ZWJ in emoji sequences
Malicious paste → intentionally hidden payloads

Detecting with a hex dump

The most reliable inspection: view the hex. Anything in a visible region that isn’t ASCII is suspect.

// JS: dump each code point
[...str].forEach((c, i) =&gt; {
  const cp = c.codePointAt(0).toString(16).padStart(4, "0");
  console.log(i, c, "U+" + cp.toUpperCase());
});

On the command line: xxd, od -c, orhexyl.

Regex detection

Match anything that renders with zero or ambiguous width:

const INVISIBLES = /[\u00A0\u00AD\u034F\u061C\u115F\u1160\u17B4\u17B5\u180E\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\u2060-\u2064\uFEFF\uFFF9-\uFFFB]/gu;

str.match(INVISIBLES);

This covers the Unicode ranges explicitly listed as “default-ignorable” or known invisible-space characters. Extend with the E0000 tag block if you’re paranoid about hidden-message attacks.

Zero-width characters

U+200B–U+200D and U+FEFF take zero rendering width. They’re functionally invisible but affect:

String length (”a\u200Bb”.length === 3)
Regex matching (/ab/ won’t match)
Search indexes
URL parsing and domain validation
Password comparison

Strip aggressively for input normalization:

str.replace(/[\u200B-\u200D\uFEFF]/g, "")

The BOM problem

U+FEFF at the start of a file is a byte-order mark, used by some tools to signal UTF encoding. It causes:

CSV parsers reading “Name” as the first column header
JSON parsers failing with “unexpected character”
Shell scripts failing because the shebang line is invalidated
Diff tools showing a byte at position 0 that “isn’t there”

// strip BOM at start of file only
str.replace(/^\uFEFF/, "")

Non-breaking space variants

NBSP (U+00A0) is the most common impostor. Looks identical to space. Breaks:

/ / regex
str.split(” ”)
Trim (in older engines)

Other space variants to watch: U+2007 (figure space), U+2008 (punctuation space), U+202F (narrow no-break), U+3000 (ideographic space). Normalize all to regular space:

str.replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")

Tag characters — the hidden-message vector

Unicode’s U+E0020–U+E007F block mirrors ASCII but is default-ignorable. You can encode an entire message in “invisible” tag characters and append it to normal text. It survives most regex, most UI display, and most copy-paste. Used in watermarking and some attack scenarios. Strip unless you have a specific reason to keep them.

str.replace(/[\u{E0020}-\u{E007F}]/gu, "")

Detection UI patterns

When a user complains “the form says my input is invalid but it looks fine,” show a character-by-character diagnostic:

Render each character with its code point
Highlight any not in a safe allow-list
Suggest an auto-cleaned version

Prevention at input boundaries

The fix is at ingest, not at query. On every user-text input:

function cleanInput(s) {
  return s
    .normalize("NFC")
    .replace(/^\uFEFF/, "")
    .replace(/[\u200B-\u200D\u2060-\u2064\uFEFF]/g, "")
    .replace(/[\u00A0\u2000-\u200A\u202F\u205F\u3000]/g, " ")
    .replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "")
    .trim();
}

When invisible characters are wanted

Not always garbage. Legitimate uses include:

ZWJ in emoji sequences (family, professions, flags)
ZWNJ in Persian and Arabic text for correct letter-joining
Soft hyphens for line-break hints in print typography
LRM/RLM for fixing bidirectional text display

Context-aware stripping only. Don’t nuke ZWJ from emoji.

Common mistakes

Writing a “strip whitespace” routine that misses NBSP and ideographic space. Not stripping the BOM from CSV imports and having a broken first column. Assuming trim() in every language handles Unicode whitespace (it doesn’t, in some older runtimes). Treating all zero-width characters as noise when some carry meaning. And not logging code points when a mystery bug lands — debugging invisible characters without a hex dump is agony.

Run the numbers

Invisible character detector Unicode text normalizer Whitespace remover

Use these while you read

Tools that pair with this guide

Found this useful?Email Buy Me a Coffee

Continue reading

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →