Unmarkdown
Markdown

HTML to Markdown: The Complete Reverse Conversion Guide

Updated Feb 24, 2026 · 9 min read

Most people think of markdown conversion as a one-way street. You write markdown, you convert it to HTML, and you move on. But the reverse direction, HTML to markdown, has become one of the most important conversion paths in modern software development. And the reason has everything to do with AI.

If you work with large language models, build RAG pipelines, or simply want to archive web content in a portable format, understanding HTML-to-markdown conversion will save you significant time and frustration.

Why convert HTML to markdown?

The obvious question: why would anyone want to go backward? HTML is the format of the web. Markdown is a simplified authoring format. Converting rendered content back to source format sounds like unnecessary work.

There are three major reasons this conversion matters more now than ever before.

AI and LLM context preparation. Large language models work best with clean, structured text. When you need to feed web content into a RAG pipeline, provide context to an AI assistant, or build a knowledge base from existing web pages, markdown is the ideal intermediate format. It preserves structure (headings, lists, tables, code blocks) while stripping away the noise of HTML tags, CSS classes, JavaScript, and layout markup. A 50KB HTML page might compress to 5KB of clean markdown that an LLM can process efficiently.

Content extraction and archiving. Web pages are fragile. They go offline, redesign, or disappear behind paywalls. Converting HTML to markdown creates a durable, portable archive that you can store in Git, search with grep, and read without a browser. Markdown files are plain text. They will be readable in 50 years.

Migration between tools. If you are moving from a CMS, wiki, or documentation platform to a markdown-based system (Obsidian, MkDocs, Docusaurus, Hugo), you need to convert existing HTML content to markdown. This is also common when extracting content from Google Docs, Confluence, or Notion for use in developer documentation.

The tools: a landscape overview

The HTML-to-markdown ecosystem has matured significantly. Here are the major tools, organized by approach.

Turndown.js

Turndown.js is the dominant library for client-side and Node.js HTML-to-markdown conversion. With over 2.5 million weekly npm downloads and 10,800+ GitHub stars, it is the de facto standard. The library weighs roughly 4KB minified and supports GFM (GitHub Flavored Markdown) tables, task lists, and strikethrough through its official plugin.

Turndown works by walking the DOM tree and applying conversion rules to each element. It ships with sensible defaults: <h2> becomes ## , <strong> becomes **text**, <a href="..."> becomes [text](url). Where it really shines is extensibility. You can write custom rules to handle any HTML pattern your source content uses.

const TurndownService = require('turndown');
const turndown = new TurndownService({
  headingStyle: 'atx',
  codeBlockStyle: 'fenced'
});

// Custom rule for a specific class
turndown.addRule('highlight', {
  filter: node => node.classList?.contains('highlight'),
  replacement: (content) => `==${content}==`
});

const markdown = turndown.turndown(htmlString);

The main limitation: Turndown operates on a DOM tree, so in Node.js you need a DOM parser like jsdom or linkedom. In the browser, it uses the native DOM directly.

node-html-markdown

A newer alternative with around 333K weekly npm downloads. It takes a different approach by parsing HTML with a streaming tokenizer instead of building a full DOM tree. This makes it faster for large documents and avoids the jsdom dependency in Node.js. The trade-off is slightly less flexibility for custom rules compared to Turndown's plugin system.

Pandoc (reverse mode)

Pandoc, the universal document converter, can convert HTML to markdown using pandoc -f html -t markdown. It handles edge cases that JavaScript libraries miss, like nested tables, complex list structures, and definition lists. The downside: Pandoc is a Haskell binary, so it requires a system-level installation. It is not suitable for browser-based or serverless environments without significant effort.

htmd (Rust)

A Rust-based HTML-to-markdown converter that prioritizes speed. Useful for batch processing thousands of pages, but the ecosystem and plugin support are less mature than Turndown.

The AI angle: web pages as LLM context

The biggest growth driver for HTML-to-markdown conversion in 2025 and 2026 is AI context preparation. Several purpose-built tools have emerged specifically for this use case.

Crawl4AI is an open-source web crawler that outputs markdown specifically for LLM consumption. It crawls websites, extracts main content (stripping navigation, ads, footers), and produces clean markdown files ready for RAG indexing or context injection. It is designed for the workflow of "I need to give my AI access to this website's content."

Firecrawl positions itself as a "Web Data API for AI." It handles JavaScript-rendered pages (which traditional crawlers miss), extracts content, and returns structured markdown. Their API is designed for developers building AI applications that need web content as input.

Jina Reader API provides a simple interface: pass a URL, get back markdown. It handles the entire pipeline of fetching, rendering, extracting, and converting. Particularly useful for one-off conversions or lightweight integrations.

ReaderLM-v2 takes a fundamentally different approach. It is a 1.5 billion parameter language model trained specifically for HTML-to-markdown conversion. Instead of rule-based DOM walking, it uses neural inference to understand page structure and produce markdown. This handles messy, inconsistent HTML better than rule-based approaches, at the cost of requiring GPU inference.

The two-step approach: extract then convert

For serious web-to-markdown pipelines, the best practice is a two-step process. First, use a content extraction tool like trafilatura (Python) to identify and extract the main content from a web page, stripping away navigation, headers, footers, ads, and boilerplate. Then, run the extracted HTML through a markdown converter like Turndown for precise structural conversion.

This separation of concerns produces cleaner results than trying to do both in a single pass. Content extraction is a classification problem (what is content vs. chrome). Markdown conversion is a structural transformation problem (how to represent this element). Different tools excel at each task.

Common conversion challenges

HTML-to-markdown conversion is not as straightforward as it appears. Several patterns cause problems across all tools.

Inline styles and CSS classes

HTML often uses CSS classes or inline styles to convey meaning that has no markdown equivalent. A <span class="warning"> or <span style="color: red"> carries semantic information that disappears in markdown. You either lose the information or need custom rules to map specific classes to markdown extensions like callouts or highlights.

Nested and complex tables

Markdown tables are deliberately simple: single-row headers, basic alignment, no merged cells, no nested tables. When HTML tables use colspan, rowspan, or nested <table> elements, there is no clean markdown representation. Most tools either flatten the structure (losing information) or fall back to raw HTML passthrough.

Semantic vs. presentational HTML

Clean, semantic HTML (<article>, <section>, <nav>, <aside>) converts beautifully. Presentational HTML (<div class="col-md-6">, <table> used for layout) produces garbage. The quality of your output depends heavily on the quality of your input HTML.

Images and media

Markdown supports images with ![alt](src), but HTML images often use lazy loading (data-src), responsive sources (<picture>, srcset), or CSS background images. Most converters only handle the basic <img src="..."> case.

Whitespace and formatting

HTML collapses whitespace by default. Markdown is whitespace-sensitive (indentation matters for lists and code blocks). This mismatch produces subtle bugs, especially with nested lists and code blocks that appear inside other elements.

MCP and AI tool integration

The Model Context Protocol (MCP) ecosystem has embraced HTML-to-markdown conversion. An html-to-markdown-mcp server exists that wraps Turndown for use as an MCP tool, allowing AI assistants like Claude to convert HTML to markdown as part of their workflow.

This fits naturally into the AI context preparation use case. An AI assistant can fetch a web page, convert it to markdown via MCP, and then use that markdown as context for answering questions or generating content. The conversion happens within the AI's tool-use loop, with no manual intervention required.

The Remarkdown approach in Unmarkdown™

Unmarkdown™ includes a feature called Remarkdown that handles the reverse conversion direction. When you paste formatted HTML into the editor (rich text copied from a web page, Google Docs, or any application), Unmarkdown™ automatically detects the formatted content and converts it to clean markdown.

Under the hood, Remarkdown uses Turndown with 14 custom rules tailored for the specific patterns that real-world clipboard HTML produces. This includes handling for Google Docs formatting (which uses heavily nested <span> elements with inline CSS), Slack-style markup, and the quirks of various browser copy implementations.

You can also open or drag-and-drop .html files directly into the editor. The same Turndown pipeline processes the file content and produces markdown that preserves the original document's structure: headings, lists, tables, code blocks, links, and images.

This is particularly useful for the workflow of "I have formatted content and I need it as markdown." Instead of manually rewriting or using an external tool, you paste and get clean markdown immediately.

Choosing the right tool

The right tool depends on your use case.

For browser-based conversion: Turndown.js. It uses the native DOM, weighs almost nothing, and has the largest ecosystem of plugins and custom rules.

For Node.js batch processing: Turndown.js with jsdom, or node-html-markdown if you need speed and want to avoid the jsdom dependency.

For AI/LLM context preparation at scale: Crawl4AI or Firecrawl. They handle the full pipeline from URL to clean markdown, including JavaScript rendering and content extraction.

For one-off web page conversion: Jina Reader API. Pass a URL, get markdown. No setup required.

For complex documents with edge cases: Pandoc. It handles the widest range of HTML patterns, including nested tables, definition lists, and footnotes.

For pasting formatted content into a markdown editor: Unmarkdown™'s Remarkdown feature handles this automatically on paste.

The future of reverse conversion

HTML-to-markdown conversion is only going to become more important. As AI tools consume more web content, the demand for clean, structured text extraction will grow. The emergence of neural approaches like ReaderLM-v2 suggests that rule-based conversion may eventually be supplemented (or replaced) by models that understand document structure at a deeper level.

For now, Turndown.js remains the practical choice for most developers. It is fast, well-tested, extensible, and works everywhere JavaScript runs. Pair it with a content extraction step for web pages, and you have a pipeline that handles the vast majority of real-world HTML-to-markdown needs.

Your markdown deserves a beautiful home.

Start publishing for free. Upgrade when you need more.

View pricing