How-To & Life · Guide · Developer Utilities
How to parse URLs
URL components, WHATWG URL spec vs legacy parsing, handling IDN, ports, userinfo, and the gotchas between browsers and servers.
URL parsing is one of those foundations that looks like a regex problem and is actually a specification problem. RFC 3986 defines the generic URI grammar, the WHATWG URL Living Standard defines what browsers do (which is not quite 3986), and the actual URL strings you meet in the wild contain backslashes, Unicode, and the occasional missing protocol. Writing your own parser is almost always a mistake. This guide covers the URL API built into every modern runtime, the eight components of a URL and how they map to the API’s properties, IDN domains and Punycode, file URLs and relative URL resolution against a base, the difference between URL and URLSearchParams, and the common bugs that live in the seams between path, query, and fragment.
Advertisement
The URL API
Every evergreen browser, Node 10+, Deno, and Bun ship the WHATWG URL constructor. It parses a string into a structured object with live getters and setters.
const u = new URL('https://user:pass@api.example.com:8080/v1/items?q=k#top');
u.protocol; // 'https:'
u.username; // 'user'
u.password; // 'pass'
u.host; // 'api.example.com:8080'
u.hostname; // 'api.example.com'
u.port; // '8080'
u.pathname; // '/v1/items'
u.search; // '?q=k'
u.hash; // '#top'
u.origin; // 'https://api.example.com:8080'If the input cannot be parsed — missing scheme, bad characters, invalid Unicode — the constructor throws a TypeError. Wrap in try/catch or use URL.canParse() (Node 19.9+, Chrome 120+) to check without throwing.
The eight components
Every URL is built from up to eight parts, separated by fixed delimiters. In the form scheme://userinfo@host:port/path?query#fragment:
protocol / scheme — ends with a colon. The only part before //. Case-insensitive; lower-cased by the parser.
userinfo — user:pass before @. Still valid syntactically but browsers have shown a security warning on navigation since 2017 and Chrome blocks credentials in navigations entirely.
host — domain name or IP. Case-insensitive; lower-cased.
port — 1-65535. The port property is "" when the URL uses the default port for its scheme (443 for https, 80 for http).
path — everything from the first / after the host up to ? or #. The root is "/", never the empty string.
query — starts with ?. Included in search; decomposed into searchParams.
fragment — starts with #. Client-side only; never sent to the server.
IDN domains and Punycode
Internationalized domain names (IDNs) like münchen.de or example.日本 are stored in DNS as ASCII using Punycode: xn--mnchen-3ya.de, example.xn--wgv71a. The URL constructor converts for you.
const u = new URL('https://münchen.de/info');
u.hostname; // 'xn--mnchen-3ya.de'
u.href; // 'https://xn--mnchen-3ya.de/info'When serving user-visible content, convert back using a Punycode library or new Intl.DisplayNames. For network calls, use the ASCII form — every DNS resolver speaks it natively.
Mixed-script IDN domains are the vector for homograph attacks: xn--pple-43d.com renders as аpple.com (Cyrillic a). Modern browsers show the Punycode form when scripts mix.
File URLs
file: URLs reference local filesystem paths. They look simple but pack surprises.
new URL('file:///Users/jay/notes.txt').pathname
// '/Users/jay/notes.txt'
new URL('file:///C:/Users/jay/notes.txt').pathname
// '/C:/Users/jay/notes.txt' -- note the leading slash
new URL('file://server/share/file.txt').host
// 'server'Three slashes (file:///) means localhost; two (file://) means the next segment is a host (UNC share on Windows). Convert pathname back to an OS path with url.fileURLToPath(u) in Node to handle Windows drive letters correctly.
Relative URLs and base resolution
The two-argument form resolves a relative URL against a base.
new URL('/help', 'https://example.com/blog/').href;
// 'https://example.com/help'
new URL('help', 'https://example.com/blog/').href;
// 'https://example.com/blog/help'
new URL('../site/', 'https://example.com/a/b/c').href;
// 'https://example.com/a/site/'
new URL('//cdn.example.com/js', 'https://example.com/').href;
// 'https://cdn.example.com/js'The relative-resolution rules come from RFC 3986 section 5. Leading / resets path to root; no slash appends to the current directory; .. walks up; // changes the host but keeps the scheme.
URL vs URLSearchParams
URL parses the whole thing. URLSearchParams parses just the query string. On a URL instance, url.searchParams gives you a live URLSearchParams — changes to it update url.search automatically.
const u = new URL('https://example.com/s?q=old');
u.searchParams.set('q', 'new');
u.searchParams.append('page', '2');
u.href; // 'https://example.com/s?q=new&page=2'Building URLs this way avoids every encoding mistake in the book. Manual string concatenation is where bugs live.
Normalization
The URL parser normalizes while it parses:
Lowercases scheme and host. Resolves .. and . segments in paths. Strips default ports. Decodes unreserved percent-encoded characters (%41 → A). Converts backslashes to forward slashes in the path (special WHATWG behavior for HTTP URLs).
Two input strings that normalize to the same href are semantically equivalent, which is what you compare on for canonicalization.
Data URLs and other schemes
Beyond http:, https:, and file:, the parser handles data:, blob:, javascript:, mailto:, and any scheme you invent. Non-special schemes have subtly different parsing — the host component is absent, and pathname gets the entire post-scheme content.
const d = new URL('data:text/plain;base64,SGVsbG8=');
d.pathname; // 'text/plain;base64,SGVsbG8='
d.host; // ''Common mistakes
Parsing with regex. “Just grab everything after ://” works until you meet a URL with userinfo, an IPv6 host in brackets, or an IDN. Use the URL constructor.
Assuming host equals hostname. host includes the port when present. hostname never does. The bug where a URL with explicit port :443 does not match an allow-list usually traces back to this.
Treating search and searchParams independently. Setting u.search = "?q=new"rebuilds the params. Mutating u.searchParams rewrites the search. Do not touch both in the same operation.
Forgetting the trailing slash on base URLs. new URL('help', 'https://x.com/blog')resolves to https://x.com/help, not /blog/help, because /blog has no trailing slash — it is a file, not a directory.
Not handling the constructor throwing. User input into URL parsers is a constant source of exceptions. Use URL.canParse() or wrap in try/catch.
Trusting URLs from untrusted input. A valid URL object does not mean a safe destination. Check protocol (no javascript:!) and origin against your allow-list before using the URL in window.location or an <a href>.
Run the numbers
Break any URL into its components with the URL parser. Pair with the query string parser when the query is the complicated part, and the URL encoder/decoder to check how a specific string encodes before embedding it in a URL component.
Advertisement