python使用pywinauto驱动微信客户端实现公众号爬虫

更新时间:2023-07-23 16:17:31 阅读：评论：0

python使⽤pywinauto驱动微信客户端实现公众号爬⾍

这个项⽬是通过pywinauto控制windows(win10)上的微信PC客户端来实现公众号⽂章的抓取。代码分成rver和client两部分。rver接收client抓取的微信公众号⽂章，并且保存到数据库。另外rver⽀持简单的搜索和导出功能。client通过pywinauto实现微信公众号⽂章的抓取。

转载请注明： »

⼀、项⽬地址

⼆、pywinauto简介

pywinauto是⼀个python的⼯具，可以⽤于控制Windows的GUI程序。详细的⽂档可以参考这⾥。

中国娃娃的英文三、WechatAutomator类

⾃动化微信的代码封装在了类WechatAutomator⾥，完整的代码可以参考这⾥。这⾥简要的介绍⼀下其中的主要⽅法：

3.1init_window

这个⽅法完成类的初始化，它的代码为：

yahudef init_window(lf, exe_path=r"C:\Program Files (x86)\Tencent\",

turn_page_interval=3,

click_url_interval=1,

win_width=1000,

win_height=600):

app = Application(backend="uia").connect(path=exe_path)

lf.main_win = app.window(title=u"微信", class_name="WeChatMainWndForPC")

lf.main_win.t_focus()

lf.app = app

lf.visible_top =70

lf.turn_page_interval = turn_page_interval

lf.click_url_interval = click_url_interval

lf.browr =None

lf.win_width = win_width

lf.win_height = win_height

# 为了让移动窗⼝，同时使⽤⾮uia的backend，这是pywinauto的uia的⼀个bug

lf.app2 = Application().connect(path=exe_path)

我们⾸先来看函数的参数：

exe_path

微信程序的地址

turn_page_interval

抓取翻页时的时间间隔，默认3s

click_url_interval

在抓取⼀页的url时的间隔，默认1s

win_width

设置窗⼝的宽度

win_height

设置窗⼝的⾼度，如果显⽰器的分辨率较⼤，可以设置的更加⾼⼀些，从⽽⼀页包含的⽂章数更多⼀些，从⽽翻页少⼀点。注意：⼀定要保证窗⼝完全可见，也就是说win_height不能⼤于实际分辨率的⾼度！

brethren

这个函数的主要功能是构建Application对象从⽽通过pywinauto实现控制，这⾥使⽤的是uia的backend，然后设置窗⼝的⼤⼩并且把窗⼝移到最左上⾓。因为根据so⽂章，pywinauto 0.6.8存在bug，

只能通过win32的backend来移到窗⼝，所以构造了lf.app2然后调⽤move_window()函数把窗⼝移到最左上⾓。

3.2crawl_gongzhonghao

这个函数实现了某个公众号的⽂章抓取。它的基本控制逻辑如下：

ctcss⾸先通过搜索框根据名字搜索公众号并且点击它。

对于当前页点击所有的链接并且下载其内容。

使⽤PAGE_DOWN键往下翻页

需要判断是否继续抓取

第⼀个是通过locate_ur函数实现，后⾯会介绍。第⼆个是通过process_page函数实现，后⾯也会介绍。判断是否继续抓取的逻辑为：如果翻页超过max_pages，则停⽌抓取

如果碰到某个url曾经抓取过，那说明之前的⽂章都已经抓取过了，则停⽌抓取

如果lastest_date不是None并且⼀篇⽂章的发布⽇期早于它，则停⽌抓取

所以我们通常会在第⼀次抓取的时候把max_pages设置的很⼤(⽐如100)，然后通过latest_date来抓到指定的⽇期。⽽之后的抓取则设置max_pages为较⼩的值(⽐如默认的6)，这样只要爬⾍在两次抓取之间公众号的更新不超过6页，那么就不会漏掉⽂章。具体的逻辑可以参考main.py，它会把抓取的⽂章通过http请求发给Server，并且每次抓取的时候从Server查询抓取过的⽂章存放到states这个list⾥states[i][“url”]就存储了第i篇⽂章的url。

def crawl_gongzhonghao(lf, account_name, articles, states, detail,

max_pages=6, latest_date=None, no_item_retry=3):

logger.debug(account_name)

if not lf.locate_ur(account_name):

return Fal

last_visited_titles =t()

visited_urls =t()

lf.turn_page_up(min(20, max_pages *2))

pagedown_retry =0

last_visited_titles =[]

for page in range(0, max_pages):

items =[]

last_visited_titles = lf.process_page(account_name, items, last_visited_titles, states, visited_urls, detail) d(items)

sf expressif len(items)==0:

pagedown_retry +=1

if pagedown_retry >= no_item_retry:

s ="break becau of retry {}".format(pagedown_retry)

logger.debug(s)

WechatAutomator.add_to_detail(s, detail)

break

el:

pagedown_retry =0

if len(items)>0and latest_date is not None:

html = items[-1][-1]

pub_date = _pubdate(html)

if pub_date and pub_date < latest_date:

s ="stop becau {} < {}".format(pub_date, latest_date)

logger.debug(s)

WechatAutomator.add_to_detail(s, detail)

break

url_exist =Fal零基础英语速成

for item in items:

if WechatAutomator.url_in_states(item[0], states):

s ="stop becau url exist {}".format(item[0])

logger.debug(s)

WechatAutomator.add_to_detail(s, detail)

editionurl_exist =True

break

if url_exist:

break

lf.click_right()

lf.pe_keys("{PGDN}")

time.sleep(lf.turn_page_interval)

lf.turn_page_up(page *2)

return True

3.3locate_ur

locate_ur函数的控制流程为：

找到左上⾓的搜索框并且点击它获得焦点

使⽤ctrl+a选中可能有的⽂字(之前的bug?)并且使⽤后退键删除它们

输⼊公众号名称

在弹出的list⾥点击这个公众号名称从⽽进⼊公众号

def locate_ur(lf, ur, retry=5):

发怔

if not lf.main_win:

rai RuntimeError("you should call init_window first")

arch_btn = lf.main_win.child_window(title="搜索", control_type="Edit")

lf.click_center(arch_btn)

lf.pe_keys("^a")

lf.pe_keys("{BACKSPACE}")

lf.pe_keys(ur)

for i in range(retry):

time.sleep(1)

try:

arch_list = lf.main_win.child_window(title="搜索结果")

match_result = arch_list.child_window(title=ur, control_type="ListItem")

lf.click_center(match_result)

return True

except:

pass

return Fal

这⾥主要就是通过child_window函数进⾏定位，关于它的⽤法这⾥不介绍。关于怎么定位元素的⽅法可以使⽤或者

print_control_identifiers函数，具体参考这⾥。

3.4process_page

这个函数是最主要的抓取代码，它处理当前⼀页的内容，它的控制流程如下：

构建当前页的tree

使⽤recursive_get函数遍历这颗树并且找到每篇⽂章对应的element

遍历每⼀篇⽂章

如果⽂章的名字在上⼀页出现过，则跳过

获得这篇⽂章的坐标信息

四级成绩短信查询如果⽂章不可见(p >= win_rect.bottom or rect.bottom <= lf.visible_top)则跳过

计算点击的坐标

点击⽂章打开新的窗⼝

在新的窗⼝中点击【复制链接】按钮

从剪贴板复制链接url

通过url下载⽂章内容并且par发布⽇期

逻辑⽐较简单，但是有⼀些很trick的地⽅：

没什么大不了的英文

微信翻页的实现

微信客户端的翻页和浏览器不同，它的内容是累加的，⽐如第⼀页3篇⽂章，往下翻⼀页可能变成6篇⽂章，再翻可能变成9篇。这个时候这9篇⽂章都是在tree中的，只不过最后3篇的坐标(top和bottom)是空间的。

能否点击⼀篇⽂章对应的框(图)可能是部分可见的，甚⾄它的top⾮常接近屏幕的最下⽅，这个时候可能点不了。如下图所⽰：

与此类似的是右上⾓的⿊⾊头部(不能滚到并且会遮挡)也有⼀定空间，如下图所⽰：

点击的位置

因为这个框可能很窄(bottom-top很⼩)并且可能在很靠上或者靠下的位置。所以有如下代码：

本文发布于:2023-07-23 16:17:31，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1112950.html

上一篇：gopro HERO4银色版说明书

下一篇：DW-滚动文字代码

标签：抓取点击微信公众

留言与评论（共有 0 条评论）