How-To & Life · Guide · Writing & Content
How to convert HTML to Markdown
Tag mappings, handling nested lists, tables, images, and inline styles, cleaning up WYSIWYG soup, and preserving readability.
Converting HTML to Markdown sounds lossless until you hit your first <table>with merged cells, a <div> soup full of inline styles, or a legacy CMS export where every paragraph is wrapped in a decorative span. Markdown is a deliberately small language—it covers roughly the subset of HTML a plain-prose writer needs—and anything beyond that subset has to be dropped, approximated, or preserved as raw HTML passthrough. Getting the conversion right means knowing which elements map cleanly, which are lossy, and how to configure your converter to fail gracefully. This guide covers the safe-mapping elements, the lossy ones, how to handle nested HTML, and the best strategies for bulk-migrating content without losing structure.
Advertisement
The safe subset
A handful of HTML elements map one-to-one to Markdown with zero information loss. Headings<h1> through <h6> become # through######. Paragraphs become blank-line-separated text. <strong>and <b> both become **bold**. <em> and<i> become _italic_. <code> becomes backtick spans, <pre><code> becomes fenced code blocks, and<blockquote> becomes >-prefixed lines. If your source HTML consists only of these tags, your converter can produce clean Markdown that round-trips perfectly back to HTML.
Links and images
Links and images convert cleanly when the attributes are minimal.<a href="/about">About’ becomes [About](/about), and <img src="x.png" alt="X"> becomes. Extra attributes like class, id,rel, target, loading, or inline styles cannot be expressed in standard Markdown and must be dropped. If any of those attributes carry semantic meaning in your content—for example, target="_blank" on external links—your converter should be configured to either fall back to inline HTML or warn you so you can add the attributes back manually after conversion.
HTML: <a href="https://example.com" target="_blank" rel="noopener">Docs</a> Lossy: [Docs](https://example.com) Lossless: <a href="https://example.com" target="_blank" rel="noopener">Docs</a>
Lists and nesting
Unordered lists become -, *, or + lines; ordered lists become 1., 2., etc. Most converters normalize the numbers to1. on every line because Markdown renderers auto-increment. Nested lists work but require four-space indentation per level in strict Markdown and two spaces in GFM. The real trouble is when HTML lists contain block-level children—paragraphs, code blocks, tables, or other lists. Those need a blank line and continuation-indented content inside the list item, which many naive converters get wrong.
Tables and other lossy elements
HTML tables support things Markdown tables cannot: merged cells via rowspan andcolspan, captions, multiple tbody sections, headers in the left column, and arbitrary HTML inside cells. A converter has three options: flatten the table into a GFM pipe table and drop the unsupported features, emit the entire <table> as raw HTML inside the Markdown, or throw an error and ask for human review. For documentation content, raw HTML passthrough is usually the right call because preserving the structure matters more than source readability.
Nested and decorative HTML
Legacy CMS exports often wrap content in decorative <div> and<span> elements with classes like wp-block-paragraph,mce-container, or post-content. Those elements carry no meaning and should be stripped before conversion. A well-configured converter unwraps purely decorative containers and keeps their children. More aggressive converters apply a whitelist: anything outside a known-safe set of tags is either unwrapped or dropped entirely.
Input:
<div class="wp-block-group">
<div class="wp-block-column">
<p>Hello <span class="emphasis">world</span>.</p>
</div>
</div>
Output:
Hello **world**.Inline styles and color
Markdown has no syntax for color, font family, size, or any other CSS property. If your source uses <span style="color: red"> to flag errors or<mark> for highlights, you have to choose between dropping the styling, replacing it with an emoji or unicode marker, or preserving the HTML passthrough. For technical docs, passthrough is fine. For blog prose, dropping is usually cleaner and the styling should be reintroduced via CSS classes once the Markdown renders.
Code blocks and language hints
A well-formed <pre><code class="language-python"> block converts cleanly into a fenced code block with the language tag. But many editors emitclass="lang-python", class="python", or nothing at all, and highlight.js and Prism use different class conventions. A good converter detects the language from any of the common class prefixes and falls back to an untagged fence when no language can be identified. Preserve indentation carefully—Markdown renderers are strict about the content between the fences.
Bulk migration strategy
For one-off conversions, paste the HTML into a converter and clean up the output by hand. For a bulk migration—say, a thousand CMS posts—script the conversion with a tool like Turndown (JavaScript) or html2text (Python), tune the rules to match your HTML patterns, and run the conversion against a sample of twenty posts first. Look for patterns that break: custom shortcodes, embedded widgets, <iframe> embeds, and anything generated by a WYSIWYG editor. Build transformations for those patterns before running the full batch, or you will spend weeks cleaning up the output.
Round-trip testing
The fastest way to check converter quality is to convert a sample from HTML to Markdown and back. Elements that survive the round trip without changing are safe. Elements that degrade, lose attributes, or restructure are the ones you need to handle manually or with custom rules. A round-trip diff run on a representative sample gives you a concrete coverage number for your migration.
Common mistakes
Assuming the conversion is lossless. Markdown covers maybe sixty percent of real-world HTML cleanly. The rest needs choices. Plan for review and cleanup in your migration schedule.
Dropping tables silently. If your converter turns a complex table into paragraphs of piped text, you lose the structure. Either force HTML passthrough for tables or flag them for manual review.
Keeping decorative wrappers. Converting every <div> and<span> into raw HTML passthrough defeats the point of going to Markdown. Strip the ones that carry no semantic weight.
Forgetting image paths change. Markdown images use relative paths differently than HTML, and a site reorganization during migration often breaks every image. Rewrite image paths as part of the conversion, not after.
Ignoring whitespace inside code blocks. Leading tabs, trailing spaces, and blank lines in code matter. Converters that trim whitespace aggressively will corrupt your code samples.
Not handling HTML entities. Entities like —, , and — should be decoded to their literal characters during conversion unless they appear inside code blocks.
Run the numbers
Convert single files or whole batches with the HTML to Markdown converter. Pair with the Markdown to HTML converter for round-trip verification that your output renders the same as your input, and the HTML formatter to tidy up the source before conversion so the converter has a consistent input shape to work with.
Advertisement