PHP 爬虫库:Goutte

jopen 10年前

Goutte 是一个抓取网站数据的 PHP 库。它提供了一个优雅的 API,这使得从远程页面上选择特定元素变得简单。

Require the Goutte phar file to use Goutte in a script:

require_once '/path/to/goutte.phar'; 

Create a Goutte Client instance (which extends SymfonyComponentBrowserKitClient):

use Goutte\Client; $client = new Client(); 

Make requests with the request() method:

$crawler = $client->request('GET', 'http://www.symfony-project.org/'); 

The method returns a Crawler object (SymfonyComponentDomCrawlerCrawler).

点击链接:

$link = $crawler->selectLink('Plugins')->link(); $crawler = $client->click($link); 

提交表单:

$form = $crawler->selectButton('sign in')->form();   $crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx')); 
抽取数据:
$nodes = $crawler->filter('.error_list'); if ($nodes->count()) {     die(sprintf("Authentication error: %s\n", $nodes->text())); }   printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text()); 

项目主页:http://www.open-open.com/lib/view/home/1388458699125