Comparison of HTML parsers

Print Print
Reading time 1:15

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML parsing[1] HTML5-compliant parsing Clean HTML** Update HTML***
HTML Tidy W3C license ANSI C 2021-03-24[2] Yes[3] Yes Yes[3] Yes
HtmlUnit Apache License 2.0 Java 2021-05-16[4] Yes ? No No
libxml2 HTMLparser MIT License C 2021-05-13[5] Yes No ? ?
Parser License Implementation language(s) Latest date* HTML Parsing HTML5-compliant Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").


Edited: 2021-06-18 19:12:27