How-To & Life · Guide · Text & Writing Utilities
How to Strip Special Characters
Defining 'special' characters, ASCII-only cleanup, preserving spaces and punctuation, regex patterns, and URL-safe output.
“Strip special characters” is a deceptively fuzzy request. There’s no universal definition of a “special” character — it depends on what you’re cleaning text for. URL-safe output wants one character set; database primary keys want another; human-readable display wants a third. Running a blanket regex like /[^a-zA-Z0-9]/g will nuke spaces, accents, and punctuation you probably wanted to keep. This guide walks through defining “special” for your use case, the regex patterns for each, how to preserve common punctuation selectively, and how to produce ASCII-only or URL-safe output without destroying the meaning of the text.
Advertisement
Start by defining “special”
Pick the output constraint first, then derive the allow-list:
- Filename-safe — no
/ \\ : * ? “ < > | - URL-safe — alphanumeric, dash, underscore, dot
- ASCII-only — strip or transliterate everything outside U+0000–U+007F
- Alphanumeric-only — letters and digits, nothing else
- Human-readable — keep punctuation, strip control chars
Allow-list beats deny-list
Deny-listing (“remove these bad characters”) leaves you vulnerable to characters you didn’t think of — especially Unicode confusables, zero-width characters, and invisible tags. Allow-listing (“keep only these characters”) is safer.
// Allow-list: alphanumeric + space + basic punctuation
str.replace(/[^\p{L}\p{N} .,!?'-]/gu, "")
// \p{L} = any letter (any script)
// \p{N} = any number
// u flag = UnicodeASCII-only with transliteration
Don’t just strip non-ASCII — transliterate first so “café” becomes “cafe,” not “caf.” The trick: normalize to NFD (decomposed form), then strip combining marks, then strip anything still non-ASCII.
function toAscii(s) {
return s
.normalize("NFD")
.replace(/\p{M}/gu, "") // strip combining marks
.replace(/[^\x00-\x7F]/g, ""); // drop any remaining non-ASCII
}
toAscii("caf\u00e9") // "cafe"
toAscii("na\u00efve") // "naive"
toAscii("r\u00e9sum\u00e9") // "resume"This handles accented Latin beautifully. It can’t transliterate non-Latin scripts — for Cyrillic, Greek, or CJK you need a dedicated library.
URL-safe output
URLs allow a narrow character set. The standard pattern:
function toSlug(s) {
return s
.normalize("NFD")
.replace(/\p{M}/gu, "")
.toLowerCase()
.replace(/[^a-z0-9]+/g, "-")
.replace(/^-+|-+$/g, "");
}
toSlug("Hello, World!") // "hello-world"
toSlug("caf\u00e9 & bar") // "cafe-bar"Preserve spaces but strip punctuation
Common for prepping text for tokenization or search indexing:
str.replace(/[^\p{L}\p{N}\s]/gu, "")
.replace(/\s+/g, " ")
.trim()Removes punctuation but keeps word boundaries intact.
Filename sanitization
Windows is the strictest. Safe filename regex:
function sanitizeFilename(name) {
return name
.replace(/[\/\\:*?"<>|]/g, "_")
.replace(/\s+/g, " ")
.trim()
.slice(0, 200); // reserve room for extension
}Also check for reserved names on Windows: CON,PRN, AUX, NUL,COM1–COM9, LPT1–LPT9. Neither with nor without extensions are allowed as filenames.
Control character stripping
Control characters (U+0000–U+001F and U+007F) cause chaos in display, logs, and databases. Strip them universally unless you specifically need \t, \n, \r:
str.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "")
Preserving quotes and apostrophes
“Smart” quotes (U+2018, U+2019, U+201C, U+201D) vs straight (U+0027, U+0022) is a frequent headache. Pick one and normalize:
str .replace(/[\u2018\u2019\u201A\u201B]/g, "'") .replace(/[\u201C\u201D\u201E\u201F]/g, "\"");
Category-based stripping with Unicode
Regex Unicode categories let you strip by meaning, not by codepoint:
\p{L}— letters\p{N}— numbers\p{P}— punctuation\p{S}— symbols (math, currency, etc.)\p{M}— marks (combining diacritics)\p{C}— control and format characters\p{Z}— separators (spaces)
// Strip punctuation and symbols only
str.replace(/[\p{P}\p{S}]/gu, "")Testing your filter
Always run it on a torture-test string:
const test = "Caf\u00e9 \u2014 'Na\u00efve' <3 😀 \u0000\u200B"; console.log(filter(test));
Check the output for smart quotes, combining marks, emoji, zero-width space, and control characters. If any slipped through, tighten your allow-list.
Common mistakes
Using [^a-zA-Z0-9] without the Unicode flag and destroying all non-ASCII letters. Stripping combining marks without first normalizing to NFD, so “café” (one code point) stays intact but “cafe + combining acute” becomes “cafe” — inconsistent results across inputs. Forgetting zero-width characters exist. Writing a deny-list that misses a character class somebody pastes next week.
Run the numbers
Special character removerUnicode text normalizerSlug generator
Advertisement