Python开源爬虫框架:Grab

jopen 6年前

Grab是一个Python开源Web爬虫框架。Grab提供非常多实用的方法来爬取网站和处理爬到的内容:

  • Automatic cookies (session) support
  • HTTP and SOCKS proxy with and without authorization
  • Keep-Alive support
  • IDN support
  • Tools to work with web forms
  • Easy multipart file uploading
  • Flexible customization of HTTP requests
  • Automatic charset detection
  • Powerful API of extracting info from HTML documents with XPATH queries
  • Asynchronous API to make thousands of simultaneous queries. This part of library called Spider and it is too big to even list its features in this README.
  • Python 3 ready

Grab Example

from grab import Grab  import logging    logging.basicConfig(level=logging.DEBUG)  g = Grab()  g.go('https://github.com/login')  g.set_input('login', '***')  g.set_input('password', '***')  g.submit()  g.doc.save('/tmp/x.html')    g.doc('//span[contains(@class, "octicon-sign-out")]').assert_exists()  home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()  repo_url = home_url + '?tab=repositories'    g.go(repo_url)  for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):      print('%s: %s' % (elem.text(),                        g.make_url_absolute(elem.attr('href'))))

项目主页:http://www.open-open.com/lib/view/home/1440858338263