本篇总结 BeautifulSoup 库的安装使用（补充了Xpath）。

BeautifulSoup初步

安装很容易，管理员打开命令行窗口，输入pip install BeautifulSoup，即可。

使用BeautifulSoup主要两行代码即可。

# 引用bs4库的BeautifulSoup类
from bs4 import BeautifulSoup

# html代码可以是requests库读取的网页代码 解析器：html的parser
soup = BeautifulSoup(html代码, 'html.parser')

# 可查看有代码缩进的html页面代码
print(soup.prettify())

BeautifulSoup库的基本元素

HTML文档、标签树和BeautifulSoup类的关系是等价的。

BeautifulSoup对应一个HTML/XML文档的全部内容

BeautifulSoup有四种解析器，其他三种需要另行安装。

BeautifulSoup类的基本元素

1、Tag标签

任何存在于HTML语法中的标签都可以用soup.<Tag>访问获得当HTML文档中存在多个相同<Tag>对应内容时，soup.<Tag>返回第一个。

1 2	>>> soup.title <title>This is a title</title>

2、Tag的name

每个<Tag>都有自己的名字，通过<Tag>.name获取，字符串类型.

>>> soup.a.name
'a'
>>> soup.a.parent.name # 包裹a的第一层标签为p标签
'p'

3、Tag的attrs（属性）

一个<Tag>可以有 0或多个属性，字典类型。

>>> tag = soup.a
>>> tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'

4、Tag的NavigableString

NavigableString可以跨越多个层次。即直接获取<p><a>xxxx</a></p>里的内容。

>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>

>>> soup.p.string
'The demo python introduces several python courses.'

5、Tag的Comment

提取注释部分。

基于bs4库的HTML内容遍历方法

有三种遍历方法：下行、上行和平行遍历。

1、标签树的下行遍历

BeautifulSoup类型是标签树的根节点。

>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5

注意：返回的是列表。

遍历方法

# 遍历儿子结点
for child in soup.body.children:
    print(child)

# 遍历子孙结点
for child in soup.body.descendants:
    print(child)

2、标签树的上行遍历

实例：遍历所有先辈节点，包括soup本身，所以要区别判断

3、标签树的平行遍历

注意：

遍历

for sibling in soup.a.next_sibling:
    print(sibling)			# 遍历后续结点

for sibling in soup.a.previous_sibling:
	print(sibling)			# 遍历前续结点

总结

基于bs4库的HTML格式输出

能否让HTML内容更加“友好”的显示？

bs4库的prettify()方法

可以使读取的html内容格式化输出

1
2
3

# 用于文本和标签
soup.prettify() 
soup.a.prettify()