开源一个爬虫代理框架:IPProxyTool

epimetheus 7年前
   <p style="text-align:start">使用 scrapy 爬虫抓取代理网站,获取大量的免费代理 ip。过滤出所有可用的 ip,存入数据库以备使用。</p>    <h2>运行环境</h2>    <p style="text-align:start">python 2.7.12</p>    <h3 style="text-align: start;">运行依赖包</h3>    <ul>     <li>scrapy</li>     <li>BeautifulSoup</li>     <li>requests</li>     <li>mysql-connector-python</li>     <li>web.py</li>     <li>scrapydo</li>     <li>lxml</li>    </ul>    <h3 style="text-align: start;">Mysql 配置</h3>    <ul>     <li>安装 Mysql 并启动</li>     <li>安装 mysql-connector-python <a href="/misc/goto?guid=4959737167257814816" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">安装参考</a></li>     <li>在 config.py 更改数据库配置</li>    </ul>    <pre style="text-align:start">  <code>        database_config = {              'host': 'localhost',              'port': 3306,              'user': 'root',              'password': '123456',          }  </code></pre>    <h2 style="text-align:start">下载使用</h2>    <p style="text-align:start">将项目克隆到本地</p>    <pre style="text-align:start">  <code>$ git clone https://github.com/awolfly9/IPProxyTool.git  </code></pre>    <p style="text-align:start">进入工程目录</p>    <pre style="text-align:start">  <code>$ cd IPProxyTool  </code></pre>    <p style="text-align:start">分别运行代理抓取、验证、服务器 脚本</p>    <pre style="text-align:start">  <code>$ python runspider.py   </code></pre>    <pre style="text-align:start">  <code>$ python runvalidator.py   </code></pre>    <pre style="text-align:start">  <code>$ python runserver.py  </code></pre>    <h2 style="text-align:start">项目说明</h2>    <p>抓取代理网站</p>    <p style="text-align:start">所有抓取代理网站的代码都在 <a href="/misc/goto?guid=4959737167341230598" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">proxy</a></p>    <p>扩展抓取其他的代理网站</p>    <p style="text-align:start">1.在 proxy 目录下新建脚本并继承自 BaseSpider <br> 2.设置 name、urls、headers<br> 3.重写 parse_page 方法,提取代理数据<br> 4.将数据存入数据库 具体可以参考 <a href="/misc/goto?guid=4959737167416302141" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">ip181</a> <a href="/misc/goto?guid=4959737167506104245" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">kuaidaili</a><br> 5.如果需要抓取特别复杂的代理网站,可以参考<a href="/misc/goto?guid=4959737167586577773" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">peuland</a></p>    <p>修改 runspider.py 导入抓取库,添加到抓取队列</p>    <p style="text-align:start">运行 runspider.py 脚本开始抓取代理网站</p>    <pre style="text-align:start">  <code>$ python runspider.py  </code></pre>    <p>验证代理 ip 是否有效</p>    <p style="text-align:start">目前验证方式:利用将抓取到的代理 ip 设置成 scrapy 请求的代理,然后去请求目标网站,如果目标网站在合适的时间内成功返回,那么这个则认为这个代理 ip 有效。如果没有在合适的时间返回成功的数据,则认为这个代理 ip 无效。<br> 一个目标网站对应一个脚本,所有验证代理 ip 的代码都在 <a href="/misc/goto?guid=4959737167666763031" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">validator</a></p>    <p>扩展验证其他网站</p>    <p style="text-align:start">1.在 validator 目录下新建脚本并继承 Validator <br> 2.设置 name、timeout、urls、headers <br> 3.然后调用 init 方法 <br> 4.如果需要特别复杂的验证方式,可以参考 <a href="/misc/goto?guid=4959737167756078255" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">assetstore</a></p>    <p>修改runvalidator.py 导入验证库,添加到验证队列</p>    <p style="text-align:start">运行 runvalidator.py 脚本开始抓取代理网站</p>    <pre style="text-align:start">  <code>$ python runvalidator.py  </code></pre>    <h3 style="text-align:start">获取代理 ip 数据服务器</h3>    <p style="text-align:start">在 config.py 中修改启动服务器端口配置 data_port,默认为 8000 启动服务器</p>    <pre style="text-align:start">  <code>$ python runserver.py  </code></pre>    <p style="text-align:start">服务器提供接口</p>    <p>获取</p>    <p style="text-align:start"><a href="/misc/goto?guid=4959737167833208235" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">http://127.0.0.1:8000/select?name=douban</a></p>    <p style="text-align:start">参数</p>    <table style="-webkit-text-stroke-width:0px; border-collapse:collapse; border-spacing:0px; box-sizing:border-box; color:rgb(51, 51, 51); display:block; font-family:-apple-system,blinkmacsystemfont,segoe ui,helvetica,arial,sans-serif,apple color emoji,segoe ui emoji,segoe ui symbol; font-size:16px; font-style:normal; font-variant-caps:normal; font-variant-ligatures:normal; font-weight:normal; letter-spacing:normal; margin-bottom:16px; margin-top:0px; orphans:2; overflow:auto; text-align:start; text-indent:0px; text-transform:none; white-space:normal; widows:2; width:888px; word-spacing:0px">     <thead>      <tr>       <th>Name</th>       <th>Type</th>       <th>Description</th>      </tr>     </thead>     <tbody>      <tr>       <td>name</td>       <td>str</td>       <td>数据库名称</td>      </tr>     </tbody>    </table>    <p>删除</p>    <p style="text-align:start"><a href="http://127.0.0.1:8000/delete?name=free_ipproxy&ip=27.197.144.181" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">http://127.0.0.1:8000/delete?name=free_ipproxy&ip=27.197.144.181</a></p>    <p style="text-align:start">参数</p>    <table style="-webkit-text-stroke-width:0px; border-collapse:collapse; border-spacing:0px; box-sizing:border-box; color:rgb(51, 51, 51); display:block; font-family:-apple-system,blinkmacsystemfont,segoe ui,helvetica,arial,sans-serif,apple color emoji,segoe ui emoji,segoe ui symbol; font-size:16px; font-style:normal; font-variant-caps:normal; font-variant-ligatures:normal; font-weight:normal; letter-spacing:normal; margin-bottom:16px; margin-top:0px; orphans:2; overflow:auto; text-align:start; text-indent:0px; text-transform:none; white-space:normal; widows:2; width:888px; word-spacing:0px">     <thead>      <tr>       <th>Name</th>       <th>Type</th>       <th>Description</th>      </tr>     </thead>     <tbody>      <tr>       <td>name</td>       <td>str</td>       <td>数据库名称</td>      </tr>      <tr>       <td>ip</td>       <td>str</td>       <td>需要删除的 ip</td>      </tr>     </tbody>    </table>    <p>插入</p>    <p style="text-align:start"><a href="http://127.0.0.1:8000/insert?name=douban&ip=555.22.22.55&port=335&country=%E4%B8%AD%E5%9B%BD&anonymity=1&https=yes&speed=5&source=100" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">http://127.0.0.1:8000/insert?name=douban&ip=555.22.22.55&port=335&country=%E4%B8%AD%E5%9B%BD&anonymity=1&https=yes&speed=5&source=100</a></p>    <p style="text-align:start">参数</p>    <table style="-webkit-text-stroke-width:0px; border-collapse:collapse; border-spacing:0px; box-sizing:border-box; color:rgb(51, 51, 51); display:block; font-family:-apple-system,blinkmacsystemfont,segoe ui,helvetica,arial,sans-serif,apple color emoji,segoe ui emoji,segoe ui symbol; font-size:16px; font-style:normal; font-variant-caps:normal; font-variant-ligatures:normal; font-weight:normal; letter-spacing:normal; margin-bottom:16px; margin-top:0px; orphans:2; overflow:auto; text-align:start; text-indent:0px; text-transform:none; white-space:normal; widows:2; width:888px; word-spacing:0px">     <thead>      <tr>       <th>Name</th>       <th>Type</th>       <th>Description</th>       <th>是否必须</th>      </tr>     </thead>     <tbody>      <tr>       <td>name</td>       <td>str</td>       <td>数据库名称</td>       <td>是</td>      </tr>      <tr>       <td>ip</td>       <td>str</td>       <td>ip 地址</td>       <td>是</td>      </tr>      <tr>       <td>port</td>       <td>str</td>       <td>端口</td>       <td>是</td>      </tr>      <tr>       <td>country</td>       <td>str</td>       <td>国家</td>       <td>否</td>      </tr>      <tr>       <td>anonymity</td>       <td>int</td>       <td>1:高匿,2:匿名,3:透明</td>       <td>否</td>      </tr>      <tr>       <td>https</td>       <td>str</td>       <td>yes:https,no:http</td>       <td>否</td>      </tr>      <tr>       <td>speed</td>       <td>float</td>       <td>访问速度</td>       <td>否</td>      </tr>      <tr>       <td>source</td>       <td>str</td>       <td>ip 来源</td>       <td>否</td>      </tr>     </tbody>    </table>    <h2 style="text-align:start">TODO</h2>    <ul>     <li>添加服务器获取接口更多筛选条件</li>     <li>添加 https 支持</li>     <li>添加检测 ip 的匿名度</li>     <li>添加抓取更多免费代理网站</li>     <li>分布式部署项目</li>    </ul>    <h2 style="text-align:start">参考</h2>    <ul>     <li><a href="/misc/goto?guid=4959737168081228665" style="box-sizing: border-box; background-color: transparent; color: rgb(64, 120, 192); text-decoration: none;">IPProxyPool</a></li>    </ul>