西湖梦: python抓取网页在ubuntu排错思考

2014年9月11日星期四

python抓取网页在ubuntu排错思考

我有1个需求，需要从某网页上抓取一段内容，而该段内容的html源码有个特征就是
<div class="example english">
<div class="pic-show"><img alt="" src="http://oimagec1.ydstatic.com/image?product=dict-treasury&id=7737883349263777458&w=280&h=170" title=""/>
</div>
<div class="content">
<p class="sen">We are the ones obsessed by measurement. The world just pours it out.</p>
<p class="trans">人们总醉心于算计，而世界却倾其所有，慷慨给予。</p>
</div>
</div>
我在开发机器上使用的python为2.7.8，windows 8系统，测试机器上使用的ubuntu，python为2.7.3
抓取网页用的库urllib2,解析网页用的bs4，两台机器的这两个库一致。
碰到的问题，在开发机器上可以正常抓取该段页面，而在测试机器上却返回的None。
开发机器上的代码和测试机器上一致。
代码为：
from bs4 import BeautifulSoup
url = 'http://xue.youdao.com/w'
html_doc = urllib2.urlopen(url).read()
#print html_doc
soup = BeautifulSoup(html_doc)
print soup.find("div",class_="example english")
后定位问题再bs4上。
beautifulsoup4-4.0.2.egg-info
问题解决很简单，将代码改为
print soup.find("div","example english")
或
print soup.find("div",attrs={"class":"example english"})即可。
但是原因是什么不知道，后经查找是版本的问题，
测试机器上的bs4的版本为
beautifulsoup4-4.0.2.egg-info
开发机器上的版本为
beautifulsoup4-4.3.2-py2.7.egg-info
找这个版本真是浪费了我不少时间，以前以为有函数可以找到，Google了好多文章都没有，就只好看安装包了。
官方文档的解释：

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_ 参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_ 参数同样接受不同类型的 过滤器 ,字符串,正则表达式,方法或 True :

可以看到从这个版本才开始支持开发机器上的写法。

tag的 class 属性是多值属性 .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]

搜索 class 属性时也可以通过CSS值完全匹配:

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

完全匹配 class 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
题外话：文档要认真的看，英语要好好的学

西湖梦

2014年9月11日星期四

python抓取网页在ubuntu排错思考

没有评论:

发表评论