Python开源爬虫框架：Grab

jopen 11年前

Grab是一个Python开源Web爬虫框架。Grab提供非常多实用的方法来爬取网站和处理爬到的内容：

Automatic cookies (session) support
HTTP and SOCKS proxy with and without authorization
Keep-Alive support
IDN support
Tools to work with web forms
Easy multipart file uploading
Flexible customization of HTTP requests
Automatic charset detection
Powerful API of extracting info from HTML documents with XPATH queries
Asynchronous API to make thousands of simultaneous queries. This part of library called Spider and it is too big to even list its features in this README.
Python 3 ready

Grab Example

from grab import Grab  import logging    logging.basicConfig(level=logging.DEBUG)  g = Grab()  g.go('https://github.com/login')  g.set_input('login', '***')  g.set_input('password', '***')  g.submit()  g.doc.save('/tmp/x.html')    g.doc('//span[contains(@class, "octicon-sign-out")]').assert_exists()  home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()  repo_url = home_url + '?tab=repositories'    g.go(repo_url)  for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):      print('%s: %s' % (elem.text(),                        g.make_url_absolute(elem.attr('href'))))

项目主页：http://www.open-open.com/lib/view/home/1440858338263

Python开源爬虫框架：Grab

Grab Example

相关经验

目录