python 高度健壮性爬虫的异常和超时问题

zoohvan 9年前
   <p>爬虫这类型程序典型特征是意外多，无法确保每次请求都是稳定的返回统一的结果，要提高健壮性，能对错误数据or超时or程序死锁等都能进行处理，才能确保程序几个月不停止。本项目乃长期维护github： <a href="/misc/goto?guid=4959730797686206281" rel="nofollow,noindex">反反爬虫开源库</a> 中积累下来，更多干货欢迎star。</p>    <h2>目录：</h2>    <ul>     <li>一：基础try&except异常处理</li>     <li>二：普通请求函数的超时处理</li>     <li>三：selenium+chrome  | phantomjs 的超时处理</li>     <li>四：自定义函数的死锁or超时处理</li>     <li>五：自定义线程的死锁or超时处理</li>     <li>六：自重启的程序设计</li>    </ul>    <h2>一：基础try&except异常处理</h2>    <p>try&except的语句作用不仅仅是要让其捕获异常更重要的是让其忽略异常，因为爬虫中的绝大多数异常可能重新请求就不存在，因此，发现异常的时候将其任务队列进行修复其实是个最省力的好办法。</p>    <p>其次被try包住的语句即使出错也不会导致整个程序的退出，相信我，你绝对不希望计划跑一个周末的程序在半夜停止了。</p>    <pre>  <code class="language-python"> try:      passhttp://top.jobbole.com/deliver-article/#      #可能出错的语句  except Exception,e:      pass      #保留错误的url，留待下次重跑      print e  finally:      #无论是否处理了异常都继续运行      print time.ctime()  </code></pre>    <h2>二：请求函数的超时处理</h2>    <h3>2.1:普通请求：</h3>    <p>2.1.1单请求类型：</p>    <pre>  <code class="language-python">import requests  requests.get(url,timeout=60)  </code></pre>    <p>2.1.2会话保持类型：</p>    <pre>  <code class="language-python">import requesocks  session = requesocks.session()  response = session.get(URL,headers=headers,timeout=10)  </code></pre>    <h2>三：selenium+chrome  | phantomjs 的超时处理</h2>    <h3>2.2.1：selenium+chrome的超时设置</h3>    <p>官网原文：http://selenium-python.readthedocs.io/waits.html</p>    <p>显式等待：、等待某个条件发生，然后再继续进行代码。</p>    <pre>  <code class="language-python">fromseleniumimportwebdriver  fromselenium.webdriver.common.byimportBy  fromselenium.webdriver.support.uiimportWebDriverWait  fromselenium.webdriver.supportimportexpected_conditionsas EC     driver = webdriver.Firefox()  driver.get("http://somedomain/url_that_delays_loading")  try:      element = WebDriverWait(driver, 10).until(  #这里修改时间          EC.presence_of_element_located((By.ID, "myDynamicElement"))      )  finally:      driver.quit()  </code></pre>    <p>隐式等待：是告诉WebDriver在尝试查找一个或多个元素（如果它们不是立即可用的）时轮询DOM一定时间。默认设置为0，一旦设置，将为WebDriver对象实例的生命期设置隐式等待。</p>    <pre>  <code class="language-python">fromseleniumimportwebdriver     driver = webdriver.Firefox()  driver.implicitly_wait(10) # seconds  driver.get("http://somedomain/url_that_delays_loading")  myDynamicElement = driver.find_element_by_id("myDynamicElement")  </code></pre>    <h3>2.2.2：phantomjs的超时设置</h3>    <p>这里使用不带selenium的phantomjs，需要使用js。主要设置语句是</p>    <pre>  <code class="language-python">page.settings.resourceTimeout = 5000; // 等待5秒     var system = require('system');  var args = system.args;  var url = args[1];  var page = require('webpage').create();  page.settings.resourceTimeout = 5000; // 等待5秒  page.onResourceTimeout = function(e) {  console.log(e.errorCode);   //打印错误码  console.log(e.errorString);//打印错误语句  console.log(e.url);     //打印错误url  phantom.exit(1);  };  page.open(url, function(status) {  if(status==='success'){  var html=page.evaluate(function(){  returndocument.documentElement.outerHTML;  });  console.log(html);  }  phantom.exit();  });  //$phantomjs xx.js http://bbs.pcbaby.com.cn/topic-2149414.html  </code></pre>    <h2>四：自定义函数的死锁or超时处理</h2>    <p>这个非常重要！！</p>    <p>python是顺序执行的，但是如果下一句话可能导致死锁（比如一个while（1））那么如何强制让他超时呢？他本身如果没有带有超时设置的话，就要自己运行信号（import signal）来处理</p>    <pre>  <code class="language-python">#coding:utf-8  import time  import signal     def test(i):      time.sleep(0.999)#模拟超时的情况      print "%d within time"%(i)      return i     def fuc_time(time_out):      # 此为函数超时控制，替换下面的test函数为可能出现未知错误死锁的函数      def handler(signum, frame):          raise AssertionError      try:          signal.signal(signal.SIGALRM, handler)          signal.alarm(time_out)#time_out为超时时间          temp = test(1) #函数设置部分，如果未超时则正常返回数据，          return temp      except AssertionError:          print "%d timeout"%(i)# 超时则报错     if __name__ == '__main__':      for i in range(1,10):          fuc_time(1)  </code></pre>    <h2>五：自定义线程的死锁or超时处理</h2>    <p>在某个程序中一方面不适合使用selenium+phantomjs的方式（要实现的功能比较难不适合）因为只能用原生的phantomjs，但是这个问题他本身在极端情况下也有可能停止（在超时设置之前因为某些错误）</p>    <p>那么最佳方案就是用python单独开一个线程（进程）调用原生phantomjs，然后对这个线程进程进行超时控制。</p>    <p>这里用ping这个命令先做测试，</p>    <pre>  <code class="language-python">import subprocess  from threading import Timer  import time     kill = lambda process: process.kill()     cmd = ["ping", "www.google.com"]  ping = subprocess.Popen(      cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)     my_timer = Timer(5, kill, [ping])#这里设定时间，和命令  try:      my_timer.start()#启用      stdout, stderr = ping.communicate()#获得输出      #print stderr      print time.ctime()  finally:      print time.ctime()      my_timer.cancel()  </code></pre>    <h2>六：自重启的程序设计</h2>    <p>比如程序在某种情况下报错多次，，那么满足条件后，让其重启即可解决大多数问题，当然这只不过是治标不治本而已，如果这个程序重启没有大问题（例如读队列类型）那么自重启这是最省力的方式之一。</p>    <pre>  <code class="language-python">import time  import sys  import os  def restart_program():    python = sys.executable    os.execl(python, python, * sys.argv)      if __name__ == "__main__":    print 'start...'    print u"3秒后,程序将结束...".encode("utf8")    time.sleep(3)    restart_program()  </code></pre>    <p> </p>    <p> </p>    <p>来自：http://python.jobbole.com/87357/</p>    <p> </p>
python 高度健壮性爬虫的异常和超时问题

相关经验

目录