How-To & Life · Guide · Developer Utilities
How to format XML
Well-formed vs valid XML, attribute order, CDATA sections, namespaces, pretty-printing vs minification, and catching parse errors early.
XML looks like HTML’s stricter cousin, but the stricter part matters. Unlike HTML, where browsers forgive mismatched tags and missing quotes, XML parsers reject anything malformed—a single unescaped ampersand halts processing. That strictness is the point: XML is used where data integrity across systems is more important than author convenience, which is why it still powers SOAP, SAML, RSS, SVG, OOXML, and a long tail of industry formats. Formatting XML for humans means balancing readability against parser-significant whitespace, attribute order stability for diffs, CDATA preservation, and namespace clarity. This guide covers the XML declaration, DOCTYPE, namespaces, attribute ordering, whitespace handling inside and outside content, CDATA blocks, and the difference between pretty-print and canonical XML.
Advertisement
The XML declaration
The declaration <?xml version="1.0" encoding="UTF-8"?>must be the first line of the document if present, with no leading whitespace or byte-order-mark visible ahead of it. It tells the parser which XML version (1.0 is universal; 1.1 adds support for more Unicode characters in names but is rarely used) and which character encoding applies. The declaration is optional for UTF-8 and UTF-16 content, but including it explicitly avoids ambiguity when files are transferred across systems that might re-encode silently.
<?xml version="1.0" encoding="UTF-8"?> <root> <child>content</child> </root>
DOCTYPE and schema references
DOCTYPE declarations are used in XML to reference a DTD (Document Type Definition). They appear after the XML declaration and before the root element. Modern XML usage mostly replaces DTDs with XSD schemas, referenced via xsi:schemaLocation orxsi:noNamespaceSchemaLocation attributes on the root element. Formatters should leave DOCTYPE and schema references alone—moving them or altering whitespace inside them can invalidate the document.
Namespaces
XML namespaces prevent element-name collisions when documents combine vocabularies from multiple sources. A namespace is declared with xmlns="URI" for the default namespace or xmlns:prefix="URI" for a prefixed one. The URI is an identifier, not a URL the parser fetches. Consistent prefix choices make documents easier to read: xs or xsd for XML Schema, xsi for XML Schema Instance, soap for SOAP Envelope, atom for Atom feeds. Formatters preserve the exact prefix and URI because changing them changes the semantics of every child element.
Attribute order and equality
XML has no required attribute order—two elements with the same name and attributes in different orders are semantically equal. However, textual diffs care about order, so consistent attribute ordering reduces noise in version control. Alphabetical order is the simplest rule and is what XML Canonicalization (C14N) uses. A more readable convention places xmlns declarations first, followed by id andname, then domain-specific attributes. Pick one and let the formatter enforce it.
Significant versus insignificant whitespace
Whitespace between tags outside text content is usually insignificant, and formatters add newlines and indentation freely there. Whitespace inside text content is significant by default—an XML parser reports the whitespace to the consuming application. The xml:space="preserve" attribute flags an element whose whitespace must not be altered, and xml:space="default" allows the formatter to treat whitespace normally. When pretty-printing, respect these attributes or you risk corrupting content that downstream systems interpret literally.
<root>
<code xml:space="preserve">
def hello():
return "world"
</code>
<description>Allow reformatting here.</description>
</root>CDATA sections
CDATA sections <![CDATA[ ... ]]> are used to include literal text that would otherwise need extensive escaping—code, HTML, or anything full of angle brackets and ampersands. Inside CDATA, all characters are taken literally except the closing ]]> sequence, which you have to split across two CDATA sections if your content contains it. Formatters must never reformat CDATA contents. The only valid transformation is converting CDATA to escaped text or vice versa, and that is a semantic change the formatter should not make automatically.
Character escaping
Five characters have special meaning in XML text content and attribute values: &,<, >, ' (apostrophe), and ". They must be escaped as &, <, >,', and " respectively. Numeric character references like — (em dash) are legal anywhere. A formatter should not silently switch between named and numeric entities because the choice may be meaningful to consumers.
Pretty-print versus canonical XML
Pretty-printing rewrites an XML document to be human-readable: consistent indentation, newlines between elements, and wrapped attributes. Canonical XML (C14N) rewrites a document into a byte-identical normalized form used for digital signatures and hash comparisons. C14N rules include: no XML declaration, sorted attribute order, normalized whitespace in attribute values, resolved namespace declarations, and replacement of empty-element tags with start-and-end-tag pairs. C14N output is not especially readable but is reproducible, which is what matters for cryptographic operations on SAML assertions or XML-DSig documents.
Handling large XML files
Pretty-printing a 500 MB XML file in memory will exhaust most environments. Streaming formatters that use a pull parser (StAX in Java, xml.etree.iterparse in Python) can pretty-print arbitrarily large documents. For occasional cleanup, splitting the file on a known boundary element, formatting chunks, and reassembling works well enough. For production pipelines, prefer tools that emit formatted output during serialization rather than reformatting after the fact.
Common mistakes
Reformatting xml:space="preserve" content. Strip whitespace there and you corrupt the semantics of the document. A good formatter honors the attribute automatically.
Breaking CDATA with indentation. Formatters that indent inside CDATA add leading whitespace to the literal content, which changes what consumers see. CDATA content must be untouched.
Changing attribute order in signed documents. SAML and XML-DSig rely on canonical form for signature verification. Reformatting a signed document breaks the signature unless you reapply canonicalization identically.
Forgetting to escape ampersands. A raw & in text or an attribute value produces a parser error. Always escape as &.
Mixing attribute quote styles. XML allows either single or double quotes around attribute values, but mixed quoting makes diffs noisy. Pick double quotes and let the formatter enforce it.
Assuming empty-element collapse is free. <tag></tag>and <tag/> parse to identical infosets, but a formatter that collapses them can change the byte signature of a document. For most uses this is harmless; for canonical documents it matters.
Editing BOM-containing files without care. A byte-order-mark before the XML declaration is legal in some encodings and illegal in others. Formatters that silently add or remove a BOM can break downstream parsers.
Run the numbers
Pretty-print, collapse, or canonicalize XML in the browser with the XML formatter. Pair with the HTML formatter for XHTML and SVG documents where both specs overlap, and the JSON formatter when you are translating legacy SOAP/XML payloads into JSON REST equivalents.
Advertisement