实现原理及思路请参考我的另外几篇爬虫实践博客
py3+urllib+bs4+反爬,20+行代码教你爬取豆瓣妹子图:http://www.cnblogs.com/UncleYong/p/6892688.html
py3+requests+json+xlwt,爬取拉勾招聘信息:http://www.cnblogs.com/UncleYong/p/6960044.htmlpy3+urllib+re,轻轻松松爬取双色球最近100期中奖号码:http://www.cnblogs.com/UncleYong/p/6958242.html实现代码如下:
import urllib.request, re# 获取网页源码def page(pg): url = 'https://www.pengfu.com/index_%s.html'%pg # 页面是utf8编码,所有解码成unicode html = urllib.request.urlopen(url).read().decode('utf8') # # print(html) return html# 获取标题def title(html): reg = re.compile(r'(.*?)') # r表示防止转义 item = re.findall(reg, html) # print(item) return item# 获取图片urldef content(html): # html = page(1) reg = r'>>>>:' + m, n) download(n, m)