首页 > 美文鉴赏

Beautiful Soup中文文档

更新时间:2023-07-12 16:32:21 阅读：评论：0

Beautiful Soup中文文档

from BeautifulSoup import BeautifulSoup # For processing HTML

from BeautifulSoup import BeautifulStoneSoup # For processing XML

import BeautifulSoup# To get everything

下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中，自己运行看看。from BeautifulSoup import BeautifulSoup

import re

doc = ['<html><head><title>Page title</title></head>',

'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',

变频冰箱

'<p id="condpara" align="blah">This is paragraph <b>two</b>.',

'</html>']

soup = BeautifulSoup(''.join(doc))

print soup.prettify()

# <html>

# <head>

# <title>

# Page title

兔子头饰图片

# </title>

# </head>

# <body>

# <p id="firstpara" align="center">

# This is paragraph

# <b>

# one

# </b>

# .

# </p>b的笔顺

# <p id="condpara" align="blah">

# This is paragraph

# <b>

# two

# </b>

# .

# </p>

# </body>

# </html>

navigate soup的一些方法:

# u'html'

# u'head'

head = ts[0].contents[0]

head.parent.name

# u'html'

# <title>Page title</title>

# u'body'

# <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>

# <p id="condpara" align="blah">This is paragraph <b>two</b>.</p>

下面是一些方法搜索soup，获得特定标签或有着特定属性的标签：

titleTag = soup.html.head.title

titleTag

# <title>Page title</title>

titleTag.string

# u'Page title'

len(soup('p'))

# 2

soup.findAll('p', align="center")

# [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>]

soup.find('p', align="center")

# <p id="firstpara" align="center">This is paragraph <b>one</b>. </p>

soup('p', align="center")[0]['id']

# u'firstpara'

soup.find('p', pile('^b.*'))['id']

# u'condpara'

soup.find('p').b.string

# u'one'

soup('p')[1].b.string

# u'two'

修改soup也很简单：

titleTag['id'] = 'theTitle'

微信昵称女可爱soup.html.head

# <head><title id="theTitle">New title</title></head>

act()iphone密码忘了怎么办

soup.prettify()

# <html>

# <head>

也就是说那个文档不是一个有效的HTML，但是它也不是太糟糕。下面是一个比较糟糕的文档。在一些问题中，它的<FORM>的开始在<TABLE>外面，结束在<TABLE>里面。(这种HTML在一些大公司的页面上也屡见不鲜)

from BeautifulSoup import BeautifulSoup

html = """

捌的拼音<html>

<form>

<table>

<td><input name="input1">Row 1 cell 1

<tr><td>Row 2 cell 1

</form>

<td>Row 2 cell 2<br>This</br> sure is a long cell

</body>

</html>"""

Beautiful Soup也可以处理这个文档：

print BeautifulSoup(html).prettify()

# <html>

# <form>

# <table>

# <td>

# <input name="input1" />

# Row 1 cell 1

# </td>

# <tr>

# <td>

# Row 2 cell 1

# </td>

# </tr>

# </table>

# </form>

# <td>

# Row 2 cell 2

# <br />

# This

# sure is a long cell

# </td>

# </html>

table的最后一个单元格已经在标签<TABLE>外了；Beautiful Soup决定关闭<TABLE>标签当它在<FORM>标签哪里关闭了。写这个文档家伙原本打算使用<FORM>标签扩展到table的结尾，但是Beautiful Soup肯定不知道这些。即使遇到这样糟糕的情况,Beautiful Soup仍可以剖析这个不合格文档，使你开业存取所有数据。

剖析XML

BeautifulSoup类似浏览器，是个具有启发性的类，可以尽可能的推测HTML文档作者的意图。但是XML没有固定的标签集合，因此这些启发式的功能没有作用。因此BeautifulSoup处理XML不是很好。王平一

使用BeautifulStoneSoup类剖析XML文档。它是一个概括的类，没有任何特定的XML方言已经简单的标签内嵌规则。下面是范例：

from BeautifulSoup import BeautifulStoneSoup

xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3"

soup = BeautifulStoneSoup(xml)

print soup.prettify()

# <doc>

# <tag1>

# Contents 1

# <tag2>

# Contents 2

# </tag2>

# </tag1>

# <tag1>

# Contents 3

# </tag1>

# </doc>

色温调节

BeautifulStoneSoup的一个主要缺点就是它不知道如何处理自结束标签。HTML有固定的自结束标签集合，但是XML取决对应的DTD文件。你可以通过传递lfClosingTags的参数的名字到BeautifulStoneSoup的构造器中，指定自结束标签:

from BeautifulSoup import BeautifulStoneSoup

xml = "<tag>Text 1<lfclosing>Text 2"

print BeautifulStoneSoup(xml).prettify()

# <tag>

# Text 1

# <lfclosing>

# Text 2

# </lfclosing>

# </tag>

print BeautifulStoneSoup(xml, lfClosingTags=['lfclosing']).prettify()

# <tag>

# Text 1

# <lfclosing />

# Text 2 # </tag>

本文发布于:2023-07-12 16:32:21，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1078726.html

上一篇：美剧学习摩登家庭第一季第一集英文对白modernFamily-s1e1

下一篇：美国文学论文

标签：标签文档结束知道

留言与评论（共有 0 条评论）