Skip to content
Free Tool Arena

How-To & Life · Guide · Text & Writing Utilities

How to Strip Special Characters

Defining 'special' characters, ASCII-only cleanup, preserving spaces and punctuation, regex patterns, and URL-safe output.

Updated April 2026 · 6 min read

“Strip special characters” is a deceptively fuzzy request. There’s no universal definition of a “special” character — it depends on what you’re cleaning text for. URL-safe output wants one character set; database primary keys want another; human-readable display wants a third. Running a blanket regex like /[^a-zA-Z0-9]/g will nuke spaces, accents, and punctuation you probably wanted to keep. This guide walks through defining “special” for your use case, the regex patterns for each, how to preserve common punctuation selectively, and how to produce ASCII-only or URL-safe output without destroying the meaning of the text.

Advertisement

Start by defining “special”

Pick the output constraint first, then derive the allow-list:

  • Filename-safe — no / \\ : * ? “ < > |
  • URL-safe — alphanumeric, dash, underscore, dot
  • ASCII-only — strip or transliterate everything outside U+0000–U+007F
  • Alphanumeric-only — letters and digits, nothing else
  • Human-readable — keep punctuation, strip control chars

Allow-list beats deny-list

Deny-listing (“remove these bad characters”) leaves you vulnerable to characters you didn’t think of — especially Unicode confusables, zero-width characters, and invisible tags. Allow-listing (“keep only these characters”) is safer.

// Allow-list: alphanumeric + space + basic punctuation
str.replace(/[^\p{L}\p{N} .,!?'-]/gu, "")

// \p{L} = any letter (any script)
// \p{N} = any number
// u flag = Unicode

ASCII-only with transliteration

Don’t just strip non-ASCII — transliterate first so “café” becomes “cafe,” not “caf.” The trick: normalize to NFD (decomposed form), then strip combining marks, then strip anything still non-ASCII.

function toAscii(s) {
  return s
    .normalize("NFD")
    .replace(/\p{M}/gu, "")       // strip combining marks
    .replace(/[^\x00-\x7F]/g, ""); // drop any remaining non-ASCII
}

toAscii("caf\u00e9")        // "cafe"
toAscii("na\u00efve")       // "naive"
toAscii("r\u00e9sum\u00e9") // "resume"

This handles accented Latin beautifully. It can’t transliterate non-Latin scripts — for Cyrillic, Greek, or CJK you need a dedicated library.

URL-safe output

URLs allow a narrow character set. The standard pattern:

function toSlug(s) {
  return s
    .normalize("NFD")
    .replace(/\p{M}/gu, "")
    .toLowerCase()
    .replace(/[^a-z0-9]+/g, "-")
    .replace(/^-+|-+$/g, "");
}

toSlug("Hello, World!")       // "hello-world"
toSlug("caf\u00e9 &amp; bar")   // "cafe-bar"

Preserve spaces but strip punctuation

Common for prepping text for tokenization or search indexing:

str.replace(/[^\p{L}\p{N}\s]/gu, "")
   .replace(/\s+/g, " ")
   .trim()

Removes punctuation but keeps word boundaries intact.

Filename sanitization

Windows is the strictest. Safe filename regex:

function sanitizeFilename(name) {
  return name
    .replace(/[\/\\:*?"&lt;&gt;|]/g, "_")
    .replace(/\s+/g, " ")
    .trim()
    .slice(0, 200);              // reserve room for extension
}

Also check for reserved names on Windows: CON,PRN, AUX, NUL,COM1COM9, LPT1LPT9. Neither with nor without extensions are allowed as filenames.

Control character stripping

Control characters (U+0000–U+001F and U+007F) cause chaos in display, logs, and databases. Strip them universally unless you specifically need \t, \n, \r:

str.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "")

Preserving quotes and apostrophes

“Smart” quotes (U+2018, U+2019, U+201C, U+201D) vs straight (U+0027, U+0022) is a frequent headache. Pick one and normalize:

str
  .replace(/[\u2018\u2019\u201A\u201B]/g, "'")
  .replace(/[\u201C\u201D\u201E\u201F]/g, "\"");

Category-based stripping with Unicode

Regex Unicode categories let you strip by meaning, not by codepoint:

  • \p{L} — letters
  • \p{N} — numbers
  • \p{P} — punctuation
  • \p{S} — symbols (math, currency, etc.)
  • \p{M} — marks (combining diacritics)
  • \p{C} — control and format characters
  • \p{Z} — separators (spaces)
// Strip punctuation and symbols only
str.replace(/[\p{P}\p{S}]/gu, "")

Testing your filter

Always run it on a torture-test string:

const test = "Caf\u00e9 \u2014 'Na\u00efve' &lt;3 &#x1F600; \u0000\u200B";
console.log(filter(test));

Check the output for smart quotes, combining marks, emoji, zero-width space, and control characters. If any slipped through, tighten your allow-list.

Common mistakes

Using [^a-zA-Z0-9] without the Unicode flag and destroying all non-ASCII letters. Stripping combining marks without first normalizing to NFD, so “café” (one code point) stays intact but “cafe + combining acute” becomes “cafe” — inconsistent results across inputs. Forgetting zero-width characters exist. Writing a deny-list that misses a character class somebody pastes next week.

Run the numbers

Special character removerUnicode text normalizerSlug generator

Advertisement

Found this useful?Email