基于简单脚本的下一代开源爬虫框架 - Creeper

fjlvjie 5年前
   <p style="text-align: center;"><img src="https://simg.open-open.com/show/b81cc15d8a320ed618ed5f2aae21e7b6.png"></p>    <h2>About</h2>    <p>Creeper is a <em>next-generation</em> crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.</p>    <p>Warning:At present this project is still under stage-1 development, please do not use in the production environment.</p>    <h2>Get Started</h2>    <p>Installation</p>    <pre>  $ go get github.com/wspl/creeper</pre>    <p>Hello World!</p>    <p>Create hacker_news.crs</p>    <pre>  page(@page=1) = "https://news.ycombinator.com/news?p={@page}"    news[]: page -> $("tr.athing")      title: $(".title a.storylink").text      site: $(".title span.sitestr").text      link: $(".title a.storylink").href</pre>    <p>Then, create main.go</p>    <pre>  package main    import "github.com/wspl/creeper"    func main() {      c := creeper.Open("./hacker_news.crs")      c.Array("news").Each(func(c *creeper.Creeper) {          println("title: ", c.String("title"))          println("site: ", c.String("site"))          println("link: ", c.String("link"))          println("===")      })  }</pre>    <p>Build and run. Console will print something like:</p>    <pre>  title:  Samsung chief Lee arrested as S.Korean corruption probe deepens  site:  reuters.com  link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD  ===  title:  ReactOS 0.4.4 Released  site:  reactos.org  link:  https://reactos.org/project-news/reactos-044-released  ===  title:  FeFETs: How this new memory stacks up against existing non-volatile memory  site:  semiengineering.com  link:  http://semiengineering.com/what-are-fefets/</pre>    <h2>Script Spec</h2>    <h3>Town</h3>    <p>Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.</p>    <pre>  page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"</pre>    <p>When you need town, use it as if you were calling a function:</p>    <pre>  news[]: page(ext="Hello World!") -> $("tr.athing")</pre>    <p>Hey, you might have noticed that the @page parameter is not used. Yeah, it is a special parameter.</p>    <p>Expression in town definition line like name="something" , represents parameter name has a default value "something" .</p>    <p>Incidentally, @page is a parameter that will automatically increasing when current page has no more content.</p>    <h3>Node</h3>    <p>Nodes are tree structure that represent the data structure you are going to crawl.</p>    <pre>  news[]: page -> $("tr.athing")      title: $(".title a.storylink").text      site: $(".title span.sitestr").text      link: $(".title a.storylink").href</pre>    <p>Like yaml , nodes distinguishes the hierarchy by indentation.</p>    <p>Node Name</p>    <p>Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.</p>    <p>Page</p>    <p>Page indicates where to fetching the field data. It can be a town expression or field reference.</p>    <p>Field reference is a advanced usage of Node, you can found the details in <a href="/misc/goto?guid=4959737794635644266" rel="nofollow,noindex">./eh.crs</a> .</p>    <p>If a node owned page and fun at the same time, page should on the left of -> , fun should on the right of -> . Which is page -> fun</p>    <p>Fun</p>    <p>Fun represents the data processing process.</p>    <p>There are all supported funs:</p>    <table>     <thead>      <tr>       <th>Name</th>       <th>Parameters</th>       <th>Description</th>      </tr>     </thead>     <tbody>      <tr>       <td>$</td>       <td>(selector: string)</td>       <td>CSS selector</td>      </tr>      <tr>       <td>html</td>       <td> </td>       <td>inner HTML</td>      </tr>      <tr>       <td>text</td>       <td> </td>       <td>inner text</td>      </tr>      <tr>       <td>outerHTML</td>       <td> </td>       <td>outer HTML</td>      </tr>      <tr>       <td>attr</td>       <td>(attr: string)</td>       <td>attribute value</td>      </tr>      <tr>       <td>style</td>       <td> </td>       <td>style attribute value</td>      </tr>      <tr>       <td>href</td>       <td> </td>       <td>href attribute value</td>      </tr>      <tr>       <td>src</td>       <td> </td>       <td>src attribute value</td>      </tr>      <tr>       <td>calc</td>       <td>(prec: int)</td>       <td>calculate arithmetic expression</td>      </tr>      <tr>       <td>match</td>       <td>(regexp: string)</td>       <td>match first sub-string via regular expression</td>      </tr>      <tr>       <td>expand</td>       <td>(regexp: string, target: string)</td>       <td>expand matched strings to target string</td>      </tr>     </tbody>    </table>    <h2>Author</h2>    <p>Plutonist</p>    <p><a href="/misc/goto?guid=4959737794729556554" rel="nofollow,noindex">impl.moe</a> · Github <a href="/misc/goto?guid=4959737794813216466" rel="nofollow,noindex">@wspl</a></p>    <p> </p>