Developers & Technical · Guide · Developer Utilities
How to read and write XML
XML syntax, when to use XML vs JSON, namespaces, escaping, schemas (DTD, XSD), safe parsing (XXE, billion laughs), XPath and XSLT basics.
XML is the 30-year-old granddaddy of structured data — verbose, strict, and still everywhere you don’t expect: RSS feeds, SOAP APIs, SVG images, Office file formats (.docx, .xlsx), Android layouts, Maven builds, sitemaps, and enterprise data exchange. JSON won the web, but XML won the enterprise. This guide covers XML syntax (elements, attributes, namespaces), when XML is the right choice versus JSON, how to parse and generate it safely (entity escaping, injection risks), schemas (DTD, XSD, Relax NG), XPath and XSLT, and the gotchas that trip up even experienced developers.
Advertisement
XML basics — the syntax rules
Every XML document needs a single root element. Tags must nest properly. Attributes go in quotes. Case matters.
Minimal valid XML:
<?xml version="1.0" encoding="UTF-8"?><book id="42">
<title>Dune</title>
<author>Frank Herbert</author></book>
Elements vs attributes: elements are the boxes (<title>Dune</title>); attributes describe them (id="42"). Use elements for content, attributes for metadata. Content that might contain markup or needs structure goes in elements.
Self-closing: empty elements use <br/> or <br></br>. Both are valid.
XML vs JSON — when each wins
JSON wins for: web APIs, JavaScript-native workflows, simple data exchange, anything developer-facing today. Lighter, parseable natively, human-readable without effort.
XML wins for: document-centric formats (where mixed content matters — text with inline tags), schemas with strict validation (regulated industries), existing enterprise integrations (SOAP, SAP, IBM, banking), tools that require it (Office Open XML, Android layouts, SVG).
XML has things JSON doesn’t: namespaces (merging data from multiple sources without key collision), comments, mixed content (text + elements inline), strict schemas (XSD), attributes distinct from elements, XSLT transforms, XPath queries.
JSON has things XML doesn’t: array syntax, native number/boolean types, terseness, universal support in every modern language without a library.
Namespaces — the feature most people hate
Namespaces prevent collision when you mix XML from different sources. A <title> in HTML means something different from <title> in SVG.
Declaration: <root xmlns:svg="http://www.w3.org/2000/svg">
Usage: prefix element names: <svg:circle r="10"/>.
Namespaces are URIs but they’re just identifiers — the URL doesn’t have to resolve. Most developers copy namespaces from docs and never think about them again.
Escaping — the #1 bug source
Five characters need escaping in XML:
< → <> → >& → &" → " (in attribute values)' → ' (in attribute values)
CDATA sections let you embed raw text without escaping:
<![CDATA[any < or & characters here are literal]]>
Great for embedding code snippets, HTML content, or already-formatted text. Doesn’t need to be nested — just wraps the block.
XML schema — validating structure
DTD (Document Type Definition): the original schema format. Ancient syntax, limited type system. Still seen in HTML/SGML doctypes. Don’t write new DTDs.
XSD (XML Schema Definition): the modern standard. Supports types (int, date, string patterns), enumerations, cardinality, inheritance. Verbose but powerful. The schema is itself XML.
Relax NG: cleaner schema syntax, popular in open-source projects (DocBook). Less common in enterprise.
When to validate: on input (reject malformed data) and on output (catch bugs before shipping). Validate selectively if schemas get huge — validating every message on a high-throughput pipe kills performance.
Parsing XML safely
XXE attacks: XML external entity attacks. An attacker sends XML with <!ENTITY xxe SYSTEM "file:///etc/passwd"> and your parser dutifully reads that file. Devastating if your parser loads external entities by default.
Fix: disable external entity resolution in your XML parser. Every major parser has a flag: Python’s defusedxml, Java’s XMLInputFactory.setProperty(SUPPORT_DTD, false), Node’s libxmljs with noent: false.
Billion laughs attack: nested entities that expand exponentially. 10 levels of 10x expansion = 10 billion characters. Crashes naive parsers. Modern parsers mostly defend against this but check your library.
DOM vs SAX vs StAX:
DOM loads the whole document into memory. Easy, slow, bad for large files.
SAX is event-driven — parser calls your callbacks as it streams. Fast, memory-efficient, harder to write.
StAX (pull parser) is the modern middle ground. You pull events when you want them. Good for large files with selective processing.
XPath and XSLT
XPath is a query language for XML. Like CSS selectors but more powerful. Examples:
/book/title — direct child path//title — any title anywhere//book[@id='42']/title — filteredcount(//book) — function
XPath ships with most languages. Learn the 20% that handles 80% of queries: paths, attribute filters, text() nodes, position() filters.
XSLT transforms XML into other XML, HTML, or text. Declarative template language. Powerful but esoteric — if you have a choice, use a regular language for transformations instead.
Formatting and indentation
XML whitespace rules are subtle. Indentation is usually ignored inside element content but preserved inside attributes and CDATA. Pretty-printers typically use 2- or 4-space indentation.
When XML is generated programmatically (no pretty-print), it often arrives as one giant line. Formatters help developers read it. Don’t assume the wire format matches what the formatter displays.
Common mistakes
Forgetting to escape. Ampersand in data → parse error. Angle bracket in data → parse error. Use the library’s serializer; don’t concatenate strings.
Wrong encoding declaration. Declaring UTF-8 but writing CP1252 (common on Windows). Characters above ASCII 127 break. Always save as UTF-8 without BOM.
Mixing tabs and spaces. XML itself doesn’t care, but diff tools and humans do. Pick one.
Using XML when JSON would do. If your data is “list of records, each with fields,” use JSON. XML earns its verbosity on document-shaped or schema-heavy problems.
Parsing XML with regex. Doesn’t work. XML is not a regular language. Use a parser.
Leaving XXE vulnerabilities open. Default parser settings in some languages are dangerous. Disable external entities explicitly.
Run the numbers
Format and validate XML instantly with the XML formatter. Pair with the JSON formatter for comparing payload formats, and the HTML formatter when your XML is really HTML-flavored markup.
Advertisement