Node.js Web 爬虫:Node Osmosis

n6xb 9年前

Osmosis 是 Node.js 用来解析 HTML/XML 和 Web 内容爬取的扩展。

Features

  • Fast: uses libxml C bindings
  • Lightweight: no dependencies like jQuery, cheerio, or jsdom
  • Clean: promise based interface- no more nested callbacks
  • Flexible: supports both CSS and XPath selectors
  • Predictable: same input, same output, same order
  • Detailed logging for every step
  • Precise and natural IO flow- no setTimeout or process.nextTick
  • Easy debugging with built-in stack size and memory usage reporting
  • Memory leak free

Example: scrape all craigslist listings

var osmosis = require('osmosis');     osmosis  .get('www.craigslist.org/about/sites')   .find('h1 + div a')  .set('location')  .follow('@href')  .find('header + div + div li > a')  .set('category')  .follow('@href')  .find('p > a', '.totallink + a.button.next:first')  .follow('@href')  .set({      'title':        'section > h2',      'description':  '#postingbody',      'subcategory':  'div.breadbox > span[4]',      'date':         'time@datetime',      'latitude':     '#map@data-latitude',      'longitude':    '#map@data-longitude',      'images[]':     'img@src'  })  .data(function(listing) {      // do something with listing data  })

项目主页:http://www.open-open.com/lib/view/home/1428322356791