强大的Java 的HTML 解析器,jsoup 1.7.1 发布
 jopen 13年前
jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于JQuery的操作方法来取出和操作数据。
jsoup的主要功能如下:
- 从一个URL,文件或字符串中解析HTML;
- 使用DOM或CSS选择器来查找、取出数据;
- 可操作HTML元素、属性、文本;
本站还翻译了官方的CookBook中文文档:http://www.open-open.com/jsoup
示例代码:
File input = new File("/tmp/input.html");  Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");    Element content = doc.getElementById("content");  Elements links = content.getElementsByTag("a");  for (Element link : links) {    String linkHref = link.attr("href");    String linkText = link.text();  }   jsoup 1.7.1 发布了,该版本在性能和稳定性方面都有不少提升,下载地址:- jsoup-1.7.1.jarcore library
- jsoup-1.7.1-sources.jaroptional sources jar
- jsoup-1.7.1-javadoc.jaroptional javadoc jar
改进记录:
- 改进解析时间,比之前的版本快2.3x倍,降低了内存消耗。
  - 选择元素时减少内存消耗和垃圾收集。
  - 删除不必要的Tag.valueOf同步,从而使多线程解析,运行速度更快。
  - Introduced finer granularity of exceptions in Jsoup.connect, including HttpStatusException and UnsupportedMimeTypeException, allowing programmers better control of error cases.
  - In Jsoup.clean, allow custom Document.OutputSettings, to control pretty printing, character set, and entity escaping.
  - Whitespace normalise document.title() output.
  - In Jsoup.connect, fail faster if the return content type is not supported.
  - Made entity decoding less greedy, so that non-entities are less likely to be incorrectly treated as entities.
  - In Jsoup.connect, enforce a connection disconnect after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error.
  - 如果服务器不指定Content-Type头,把它当作OK。
  - 如果服务器返回一个不支持的字符集头,试图解码的默认字符集(UTF8)的内容,而不是想逃与不支持的字符集异常。
  
  
  Bug 修复:
  - Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale.
  - Fixed whitespace preservation in textarea tags.
  - Fixed an issue that prevented frameset documents to be cleaned by the Cleaner.
  - Fixed an issue when normalising whitespace for strings containing high-surrogate characters.
  </div>  after every connect. This precludes keep-alive connections to the same host, but in practise many implementations will leak connections, particularly on error. - If a server doesn't specify a content-type header, treat that as OK. - If a server returns an unsupported character-set header, attempt to decode the content with the default charset (UTF8), instead of bailing with an unsupported charset exception. Bug fixes: - Fixed an issue when determining the Windows-1254 character-set from a meta tag when run in the Turkish locale. - Fixed whitespace preservation in textarea tags. - Fixed an issue that prevented frameset documents to be cleaned by the Cleaner. - Fixed an issue when normalising whitespace for strings containing high-surrogate characters.