网络抓取

764877509 贡献于2013-12-19

作者 Administrator  创建于2012-06-06 01:30:27   修改者  修改于2013-08-01 02:46:58字数6107

文档摘要:1模拟ajax提交这几天开始做一些爬虫方面的东西,但是在解析页面是碰到了分页数据的爬取问题,如果分页是get方式的url还好,但是如果是Post方式的ajax提交那就感觉比较纠结思路:因为是post所以首先想到使用Post的参数方式来做.
关键词:

1 模拟ajax提交 这几天开始做一些爬虫方面的东西,但是在解析页面是碰到了分页数据的爬取问题,如果分页是get方式的url还好,但是如果是Post方式的ajax提交那就感觉比较纠结 思路: 因为是post所以首先想到使用Post的参数方式来做: Java代码 1 public String doHttpSend(String keyWord,String searchType,int pageNum) throws Exception 2 { 3 PostMethod method = null; 4 try 5 { 6 HttpClient client = getHttpClient(); 7 method = new PostMethod(SEARCH_URL); 8 9 method.addRequestHeader("connection","keep-alive"); 10 11 NameValuePair[] params = new NameValuePair[]{ 12 new NameValuePair("keyWord",keyWord), 13 new NameValuePair("page",String.valueOf(pageNum)) 14 }; 15 method.addParameters(params); 16 int statusCode = client.executeMethod(method); 17 18 if(statusCode != HttpStatus.SC_OK) 19 { 20 return null; 21 } 22 23 System.out.println(method.getResponseBodyAsString()); 24 return method.getResponseBodyAsString(); 25 26 } 27 finally 28 { 29 if(null != method) 30 { 31 method.releaseConnection(); 32 } 33 } 34 } 但是发现该方法实现发送后,获得的结果总是"System.NotSupportedException"; 刚开始以为是Header设置不对,用工具对比后发现header信息基本一直,但是参数格式却是不一样的,比如当前方式的参数格式最终为"param1=value1¶m2=value2",而页面上ajax提交的参数确实json格式的字符串; 于是,就修改参数的构造方式: 首先构造json格式的字符串 如:String param ="{\"keyWord\":"+keyWord+",\"page\":"+pageNum+"}",而不能使用NameValuePair来传递参数 再设置参数到method : method.setRequestBody(param); 2 关于HttpClient的总结 1 (1)当HttpClient的实例不再需要时,可以使用连接管理器关闭 2 httpclient.getConnectionManager().shutdown(); 1 (1)当HttpClient的实例不再需要时,可以使用连接管理器关闭 2 httpclient.getConnectionManager().shutdown(); 1 (2)针对HTTPs的协议的HttpClient请求必须用户和密码 2 httpclient.getCredentialsProvider() 3 .setCredentials(new AuthScope("localhost", 443), 4 new UsernamePasswordCredentials("username", "password")); 1 (4)httpclient传送文件的方式 2 HttpClient httpclient = new DefaultHttpClient(); 3 HttpPost httppost = new HttpPost("http://www.apache.org"); 4 File file = new File(args[0]); 5 InputStreamEntity reqEntity = new InputStreamEntity( 6 new FileInputStream(file), -1); 7 reqEntity.setContentType("binary/octet-stream"); 8 reqEntity.setChunked(true); 9 // It may be more appropriate to use FileEntity class in this particular 10 // instance but we are using a more generic InputStreamEntity to demonstrate 11 // the capability to stream out data from any arbitrary source 12 // 13 // FileEntity entity = new FileEntity(file, "binary/octet-stream"); 14 httppost.setEntity(reqEntity); 15 System.out.println("executing request " + httppost.getRequestLine()); 16 HttpResponse response = httpclient.execute(httppost); 1 (4)httpclient传送文件的方式 2 HttpClient httpclient = new DefaultHttpClient(); 3 HttpPost httppost = new HttpPost("http://www.apache.org"); 4 File file = new File(args[0]); 5 InputStreamEntity reqEntity = new InputStreamEntity( 6 new FileInputStream(file), -1); 7 reqEntity.setContentType("binary/octet-stream"); 8 reqEntity.setChunked(true); 9 // It may be more appropriate to use FileEntity class in this particular 10 // instance but we are using a more generic InputStreamEntity to demonstrate 11 // the capability to stream out data from any arbitrary source 12 // 13 // FileEntity entity = new FileEntity(file, "binary/octet-stream"); 14 httppost.setEntity(reqEntity); 15 System.out.println("executing request " + httppost.getRequestLine()); 16 HttpResponse response = httpclient.execute(httppost); 1 (5)获取Cookie的信息 2 HttpClient httpclient = new DefaultHttpClient(); 3 // 创建一个本地Cookie存储的实例 4 CookieStore cookieStore = new BasicCookieStore(); 5 //创建一个本地上下文信息 6 HttpContext localContext = new BasicHttpContext(); 7 //在本地上下问中绑定一个本地存储 8 localContext.setAttribute(ClientContext.COOKIE_STORE, cookieStore); 9 //设置请求的路径 10 HttpGet httpget = new HttpGet("http://www.google.com/"); 11 //传递本地的http上下文给服务器 12 HttpResponse response = httpclient.execute(httpget, localContext); 13 //获取本地信息 14 HttpEntity entity = response.getEntity(); 15 System.out.println(response.getStatusLine()); 16 if (entity != null) { 17 System.out.println("Response content length: " + entity.getContentLength()); 18 } 19 //获取cookie中的各种信息 20 List cookies = cookieStore.getCookies(); 21 for (int i = 0; i < cookies.size(); i++) { 22 System.out.println("Local cookie: " + cookies.get(i)); 23 } 24 //获取消息头的信息 25 Header[] headers = response.getAllHeaders(); 26 for (int i = 0; i cookies = cookieStore.getCookies(); 21 for (int i = 0; i < cookies.size(); i++) { 22 System.out.println("Local cookie: " + cookies.get(i)); 23 } 24 //获取消息头的信息 25 Header[] headers = response.getAllHeaders(); 26 for (int i = 0; i nvps = new ArrayList (); 5 nvps.add(new BasicNameValuePair("IDToken1", "username")); 6 nvps.add(new BasicNameValuePair("IDToken2", "password")); 7 httpost.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));

下载文档到电脑,查找使用更方便

文档的实际排版效果,会与网站的显示效果略有不同!!

需要 3 金币 [ 分享文档获得金币 ] 0 人已下载

下载文档