jsoup release 1.8.1

2014-Sep-27

jsoup 1.8.1 brings great performance improvements to text and tree serialization, the choice of HTML or XML output, and a range of other improvements and bug-fixes.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Improvements

  • Introduced the ability to chose between HTML and XML output, and made HTML the default. This means img tags are output as <img>, not <img />. XML is the default when using the XmlTreeBuilder. Control this with the Document.OutputSettings.syntax() method.
  • Improved the performance of Element.text() by 3.2x
  • Improved the performance of Element.html() by 1.7x
  • Improved file read time by 2x, giving around a 10% speed improvement to file parses.
  • Added Element.cssSelector(), which returns a unique CSS selector/path for an element.
  • Tightened the scope of what characters are escaped in attributes and textnodes, to align with the spec. Also, when using the extended escape entities map, only escape a character if the current output charset does not support it. This produces smaller, more legible HTML, with greater control over the output (by setting charset and escape mode).
  • If pretty-print is disabled, don't trim outer whitespace in Element.html()
  • In the HTML Cleaner, allow span tags in the basic whitelist, and span and div tags in the relaxed whitelist.
  • Relaxed doctype validation, allowing doctypes to not specify a name.
  • Added support for quoted attribute values in CSS Selectors

Bug Fixes

  • Fixed an issue where <svg><img/></svg> was parsed as <svg><image/></svg>
  • Fixed an issue where a UTF-8 BOM character was not detected if the HTTP response did not specify a charset, and the HTML body did, leading to the head contents incorrectly being parsed into the body. Changed the behavior so that when the UTF-8 BOM is detected, it will take precedence for determining the charset to decode with.
  • Fixed an issue in parsing a base URI when loading a URL containing a http-equiv element.
  • Fixed an issue for Java 1.5 / Android 2.2 compatibility, and verify it doesn't regress.
  • Fixed an issue that would throw an NPE when trying to set invalid HTML into a title element.
  • Fixed support for nth-of-type selectors with unknown tags.
  • Added support for application/*+xml mimetypes.
  • Fixed support for allowing script tags in cleaner whitelists.

Many thanks to everyone who contributed patches, suggestions, and bug reports. Sorry for the hiatus! If you have any suggestions for the next release, I would love to hear them; please get in touch via the mailing list or to me directly.