首页 > 美文阅读

python增量式读取大型XML文件

更新时间:2023-06-14 06:29:34 阅读：评论：0

跆拳道等级<zip>60632</zip>

<x_coordinate>1159494.68618856</x_coordinate>

<y_coordinate>1873313.83503384</y_coordinate>

心痛的句子

<police_district>9</police_district>

<community_area>58</community_area>

<location latitude="41.808090232127896"

longitude="-87.69053684711305" />

</row>

<creation_date>2012-11-18T00:00:00</creation_date>

<status>Completed</status>

<completion_date>2012-11-18T00:00:00</completion_date>

<rvice_request_number>12-01906695</rvice_request_number>

<type_of_rvice_request>Pot Hole in Street</type_of_rvice_request>

<current_activity>Final Outcome</current_activity>

<most_recent_action>CDOT Street Cut ... Outcome</most_recent_action>

<street_address>3510 W NORTH AVE</street_address>

<x_coordinate>1152732.14127696</x_coordinate>

<y_coordinate>1910409.38979075</y_coordinate>

<police_district>14</police_district>

<community_area>23</community_area>

中国滑雪场排名<longitude>-87.71435952353961</longitude>

<location latitude="41.91002084292946"

longitude="-87.71435952353961" />

</row>

</respon>

</pre>

假设你想写⼀个脚本来按照坑洼报告数量排列邮编号码。你可以像这样做：

<pre Liberation Mono", "Courier New", Courier, monospace; font-size: 12px; white-space: pre; margin: 0px; padding: 12px; display: block; overflow: auto; line-height:

1.4;">ElementTree import par

from collections import Counter

potholes_by_zip = Counter()大意失荆州

doc = par('l')

for pothole in doc.iterfind('row/row'):

potholes_by_zip[pothole.findtext('zip')] += 1

for zipcode, num in potholes_st_common():

print(zipcode, num)

</pre>

这个脚本唯⼀的问题是它会先将整个XML⽂件加载到内存中然后解析。在我的机器上，为了运⾏这个程序需要⽤到450MB左右的内存空间。如果使⽤如下代码，程序只需要修改⼀点点：

<pre Liberation Mono", "Courier New", Courier, monospace; font-size: 12px; white-space: pre; margin: 0px; padding: 12px; display: block; overflow: auto; line-height:

1.4;">from collections import Counter

potholes_by_zip = Counter()

data = par_and_remove('l', 'row/row')

for pothole in data:

potholes_by_zip[pothole.findtext('zip')] += 1明辨是非的反义词

for zipcode, num in potholes_st_common():

print(zipcode, num)

</pre>

结果是：这个版本的代码运⾏时只需要7MB的内存–⼤⼤节约了内存资源。

讨论

这⼀节的技术会依赖 ElementTree 模块中的两个核⼼功能。第⼀，iterpar() ⽅法允许对XML⽂档进⾏增量操作。使⽤时，你需要提供⽂件名和⼀个包含下⾯⼀种或多种类型的事件列表： start , end, start-ns 和 end-ns 。由 iterpar() 创建的迭代器会产⽣形如 (event, elem) 的元组，其中 event 是上述事件列表中的某⼀个，⽽ elem 是相应的XML元素。例如：

兰陵王台词

<pre Liberation Mono", "Courier New", Courier, monospace; font-size: 12px; white-space: pre; margin: 0px; padding: 12px; display: block; overflow: auto; line-height:

1.4;">>>> data = iterpar('l',('start','end'))

next(data)

('start', <Element 'respon' at 0x100771d60>)

next(data)

('start', <Element 'row' at 0x100771e68>)

next(data)

('start', <Element 'row' at 0x100771fc8>)

next(data)

('start', <Element 'creation_date' at 0x100771f18>)

next(data)

('end', <Element 'creation_date' at 0x100771f18>)

next(data)

('start', <Element 'status' at 0x1006a7f18>)

next(data)

('end', <Element 'status' at 0x1006a7f18>)

</pre>

start 事件在某个元素第⼀次被创建并且还没有被插⼊其他数据(如⼦元素)时被创建。⽽ end 事件在某个元素已经完成时被创建。尽管没有在例⼦中演⽰， start-ns 和 end-ns 事件被⽤来处理XML⽂档命名空间的声明。钢铁是怎样炼成的主要内容

这本节例⼦中， start 和 end 事件被⽤来管理元素和标签栈。栈代表了⽂档被解析时的层次结构，还被⽤来判断某个元素是否匹配传给函数 par_and_remove() 的路径。如果匹配，就利⽤ yield 语句向调⽤者返回这个元素。

在 yield 之后的下⾯这个语句才是使得程序占⽤极少内存的ElementTree的核⼼特性：

<pre Liberation Mono", "Courier New", Courier, monospace; font-size: 12px; white-space: pre; margin: 0px; padding: 12px; display: block; overflow: auto; line-height:

1.4;">elem_stack[-2].remove(elem)

奥尔良烤鸡腿</pre>

这个语句使得之前由 yield 产⽣的元素从它的⽗节点中删除掉。假设已经没有其它的地⽅引⽤这个元素了，那么这个元素就被销毁并回收内存。

对节点的迭代式解析和删除的最终效果就是⼀个在⽂档上⾼效的增量式清扫过程。⽂档树结构从始⾃终没被完整的创建过。尽管如此，还是能通过上述简单的⽅式来处理这个XML数据。

本文发布于:2023-06-14 06:29:34，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/950515.html

上一篇：最新过年风俗作文过新年的风俗作文(3篇)

下一篇：2023年学生捐款倡议书200字(10篇)