开源的搜索引擎,Nutch 1.9 发布

jopen 10年前

Nutch 是一个开放源代码(open-source)的Java搜索引擎包,它提供了构建一个搜索引擎所需要的全部工具和功能。使用Nutch不仅可以建立自己内 部网的搜索引擎,同时也可以针对整个网络建立搜索引擎。除了基本的功能之外,Nutch也还有不少自己的特色,如Map-Reduce、Hadoop、 Plugin等。

Nutch 从总体上看来,分为三个主要的部分:爬行、索引和搜索。Web db是Nutch初始运行的URL集合;Fetcher是用来抓取网页的爬行器,也就是平时常说的Crawler;indexer是用来建立索引的部分, 它将会生成的索引文件并存放在系统之中;searcher是查询器,用来完成对某一词条的搜索并返回结果。
nutch_logo_tm.png

近日,Apache Nutch 1.9 发布,主要改进包括:

改进

  • [NUTCH-1502] - Test for CrawlDatum state transitions

  • [NUTCH-1561] - improve usability of parse-metatags and index-metadata

  • [NUTCH-1676] - Add rudimentary SSL support to protocol-http

  • [NUTCH-1745] - Upgrade to ElasticSearch 1.1.0

  • [NUTCH-1747] - Use AtomicInteger as semaphore in Fetcher

  • [NUTCH-1757] - ParserChecker to take custom metadata as input

  • [NUTCH-1758] - IndexChecker to send document to IndexWriters

  • [NUTCH-1772] - Injector does not need merging if no pre-existing crawldb

  • [NUTCH-1782] - NodeWalker to return current node

  • [NUTCH-1787] - update and complete API doc overview page

  • [NUTCH-1794] - IndexingFilterChecker to optionally dumpText

  • [NUTCH-1799] - ANT Eclipse task discovers all plugin jars automatically

新的特性

  • [NUTCH-207] - Bandwidth target for fetcher rather than a thread count

  • [NUTCH-1327] - QueryStringNormalizer

  • [NUTCH-1590] - [SECURITY] Frame injection vulnerability in published Javadoc