一个简单的PHP Web爬虫:Goutte

jopen 9年前

Goutte是一个屏幕抓取和web爬虫PHP库。

Goutte提供了一个很好的API来抓取网站和从服务器响应的HTML/ XML提取数据。

要求

Goutte depends on PHP 5.4+ and Guzzle 4+.

Tip

If you need support for PHP 5.3 or Guzzle 3, use Goutte 1.0.6.

安装

Add fabpot/goutte as a require dependency in your composer.json file:

php composer.phar require fabpot/goutte:~2.0

Tip

You can also download the Goutte.phar file:

require_once '/path/to/goutte.phar'; 

使用

Create a Goutte Client instance (which extendsSymfony\Component\BrowserKit\Client):

use Goutte\Client; $client = new Client(); 

Make requests with the request() method:

// Go to the symfony.com website $crawler = $client->request('GET', 'http://www.symfony.com/blog/'); 

The method returns a Crawler object (Symfony\Component\DomCrawler\Crawler).

Fine-tune cURL options:

$client->getClient()->setDefaultOption('config/curl/'.CURLOPT_TIMEOUT, 60); 

点击链接:

// Click on the "Security Advisories" link  $link = $crawler->selectLink('Security Advisories')->link();  $crawler = $client->click($link);

抽取数据:

// Get the latest post in this category and display the titles  $crawler->filter('h2.post > a')->each(function ($node) {      print $node->text()."\n";  });

提交表单:

$crawler = $client->request('GET', 'http://github.com/');  $crawler = $client->click($crawler->selectLink('Sign in')->link());  $form = $crawler->selectButton('Sign in')->form();  $crawler = $client->submit($form, array('login' => 'fabpot', 'password' => 'xxxxxx'));  $crawler->filter('.flash-error')->each(function ($node) {      print $node->text()."\n";  });

项目主页:http://www.open-open.com/lib/view/home/1413877792059