数据采集第三次作业
作业⼀
作业①:
2、输出信息:将下载的Url信息在控制台输出,并将下载的图⽚存储在images⼦⽂件中,并给出截图。(1)代码如下:
# -*- coding = utf-8 -*-
# @Time:2020/10/13 18:36
# @Author:CaoLanying
# @File:the_third_project1.py
# @Software:PyCharm
from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
quest
import threading
import os
#图像爬取
def imageSpider(start_url):
global threads
global count #计数
try:
urls=[]
req = quest.Request(start_url,headers=headers)
data = quest.urlopen(req)
data = ad() #套路
dammit = UnicodeDammit(data,["utf-8","gbk"])
data = dammit.unicode_markup
soup = BeautifulSoup(data,"html.parr")
images = soup.lect("img") #获取图像标签
for image in images:
try:
src=image["src"]
quest.urljoin(start_url,src)
if url not in urls:2k显示器分辨率
#print(url) #爬取的图⽚地址
count = count+1
T = threading.Thread(target=download,args=(url,count)) #线程个数
T.tDaemon(Fal) #⾮守护线程
T.start()
threads.append(T) #把线程加⼊到线程数组
except Exception as err:
print(err)
except Exception as err: \
print(err)
def download(url,count):
try:
if(url[len(url)-4]=="."):
ext = url[len(url)-4:] #ext是个“.”
el:
ext=""
req = quest.Request(url,headers=headers)
data = quest.urlopen(req,timeout=100)
data = ad()
fobj = open("D:\\pythonProject\\Wulin'cour\\images\\" + str(count) + ext, "wb")
fobj.write(data)
fobj.clo()
宝宝几个月开始添加辅食比较好print("downloaded " + str(count) + ext)
except Exception as err:
print(err)
#start_url="/weather/101280601.shtml"
#start_url="www.sziit.edu"
start_url="xcb.fzu.edu/#"
headers = {"Ur-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre)Gecko/2008072421 Minefield/3.0.2pre"}
count=0
threads=[]
imageSpider(start_url)
for t in threads:
t.join()
print("The End")
(2)结果图⽚:
(3)⼼得体会:
1、遇到的问题:
2、解决⽅法:
路径没写对,于是换成了绝对路径 fobj = open("D:\pythonProject\Wulin'cour\images\" + str(count) + ext, "wb")作业⼆
要求:使⽤scrapy框架复现作业①。
输出信息:同作业①
(1)各个步骤及代码:
1、编写spider
import scrapy
from ..items import GetimagItem
class WeatherSpider(scrapy.Spider):
name = 'weather'
allowed_domains = ['']
start_urls = ['/']
def par(lf, respon):
img_url_list = respon.xpath('//img/@src')
for url in img_act():
item = GetimagItem()
item["url"] = url
print(url)
yield item
科技创新的名言print("ok")
2、编写pipelines
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES tting
# See: docs.scrapy/en/latest/topics/item-pipeline.html
import urllib
class GetimagPipeline(object):
count = 0 # process_item调⽤的次数
def process_item(lf,item,spider):
try:
url = item["url"] #获得url地址
if (url[len(url) - 4] == "."):
ext = url[len(url) - 4:] # ext是个“.”
韩银贞el:
ext = ""
req = quest.Request(url)
转身过人data = quest.urlopen(req, timeout=100)
data = ad()
fobj = open("D:\\pythonProject\\Wulin'cour\\images2\\" + unt) + ext, "wb") # 打开⼀个⽂件,这个
fobj.write(data) # 写⼊数据
fobj.clo() # 关闭⽂件
print("downloaded " + unt) + ext)
except Exception as err:
print(err)
return item
3、设置ttings
(2)结果图⽚:
(3)⼼得体会:
了解到Scrapy框架是⼀个快速、⾼层次的基于Python的web爬⾍框架,抓取web站点并从页⾯提取结构化数据。hhhh,但是第⼀次⽤的时候就是⼀
头雾⽔,不知道应该在什么模块下写代码,仔细看看书上的例⼦之后,就有了⼀丝丝的灵感。更了解流程和各个框架之间的关系后就感觉编程起来很⽅
便。
为什么要使⽤Scrapy框架呢?因为它更容易构建⼤规模抓取项⽬;异步处理请求的速度快,且使⽤⾃动调节机制⾃动调整爬取速度。总之⼀句琴叶榕怎么养
话,scrapy就是很强⼤!!
作业三
要求:使⽤scrapy框架爬取股票相关信息。
输出信息:
序号股票代码股票名称最新报价涨跌幅涨跌额成交量成交额振幅最⾼最低今开昨收
1 688093 N世华 28.47 62.22% 10.9
2 26.13万 7.6亿 22.34 32.0 28.08 30.2 17.55
<
(1)代码如下:
总的框架:
1、编写items
# Define here the models for your scraped items
# See documentation in:
# docs.scrapy/en/latest/topics/items.html
import scrapy
class StocksItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
i = scrapy.Field()
f12 = scrapy.Field()
f14 = scrapy.Field()
f2 = scrapy.Field()
f3 = scrapy.Field()
f4 = scrapy.Field()
f5 = scrapy.Field()
f6 = scrapy.Field()
f7 = scrapy.Field()
pass
2、编写spider
import scrapy
import json
from ..items import StocksItem
class MystockSpider(scrapy.Spider):
name = 'mystock'
start_urls = ["75./api/qt/clist/get?cb=jQuery112406817237975028352_1601466960670&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23& #start_urls = ["/center/gridlist.html#hs_a_board"]
def par(lf, respon):
# 调⽤body_as_unicode()是为了能处理unicode编码的数据
count = 0
result =
result = place('''jQuery112406817237975028352_1601466960670(''',"").replace(');','')#⽓死我了,最外层的“);”要去掉,不然⼀直报错。搞了好久
result = json.loads(result)
for f in result['data']['diff']:
count += 1
item = StocksItem()
item["i"] = str(count)
item["f12"] = f['f12']
item["f14"] = f['f14']
item["f2"] = f['f2']
item["f3"] = f['f3']
item["f4"] = f['f4']
item["f5"] = f['f5']
item["f6"] = f['f6']
item["f7"] = f['f7']
yield item
print("ok")
3、编写pipelines
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES tting
# See: docs.scrapy/en/latest/topics/item-pipeline.html
# uful for handling different item types with a single interface
from itemadapter import ItemAdapter
from openpyxl import Workbook
class StocksPipeline(object):
wb = Workbook()
ws = wb.active # 激活⼯作表
我想在家做加工
ws.append(["序号","代码","名称","最新价(元)","涨跌幅","跌涨额(元)", "成交量","成交额(元)","涨幅"])
def process_item(lf, item, spider):
line = [item['i'], item['f12'], item['f14'], item['f2'], item['f3'],item['f4'],item['f5'], item['f6'],item['f7']] # 把数据中每⼀项整理出来
lf.ws.append(line) # 将数据以⾏的形式添加到xlsx中
lf.wb.save(r'C:\Urs\Administrator\Desktop\stock.xlsx') # 保存xlsx⽂件
return item
4、设置ttings
(2)结果图⽚:
(3)⼼得体会:
**
有了第⼆个实验的经验,第三个作业就没那么难了。之前听⽼师说数据输出要漂亮⼀点,于是我就去看了,怎么把数据存在Excel表格⾥⾯,发现⾮常⽅便,⽽且也很实⽤。
主要⽤到了Scrapy的pipeline.py和python的开源库OpenPyxl。以后再试试其他存储⽅法。
**
1、遇到的问题:
舍友说,他的答案⼀直显⽰不出来在屏幕上,弄了好久。
我⼀开始也没注意这个问题,想到可能以后会遇到,先记录⼀下。
2、解决⽅法:
这个设置为ture
附加上srapy怎么使⽤Excel的⽅法:
关于pipeline
pipeline是scrapy中⼀个模块,数据被spider抓取之后会由pipeline处理。pipeline中通常会有⼏个“⼯序”,数据会按照顺序通过这⼏个“⼯序”。如果没有通过某项“⼯序”,会被抛弃掉。**
pipeline⼀般有⼏种⽤途:
1. 清洗HTML数据(⽐如清洗某⽆⽤tag)
2. 确认已抓取数据(⽐如确认是否包含特定字段)
3.检查重复(过滤重复数据)
4.保存已抓取数据⼊数据库
我们在这⾥⽤到的是最后⼀个功能,只是保存为xlsx⽂件。
关于OpenPyxl
OpenPyxl是读写 Excel 2007 xlsx/xlsm⽂件的python库。废话不多说,直接上例⼦:
from openpyxl import Workbook
wb = Workbook() # class实例化
ws = wb.active # 激活⼯作表
ws['A1'] = 42 # A1表格输⼊数据
ws.append(['科⽐', '1997年', '后卫', '赛季报销']) # 添加⼀⾏数据
wb.save('/home/alexkh/nba.xlsx') # 保存⽂件
Scrapy保存为excel
Scrapy数据保存为excel就是在pipeline.py中处理。具体代码如下:
#coding=utf-8
from openpyxl import Workbook
夹马营class TuniuPipeline(object): # 设置⼯序⼀
lf.wb = Workbook()
lf.ws = lf.wb.active
lf.ws.append(['新闻标题', '新闻链接', '来源⽹站', '发布时间', '相似新闻', '是否含有⽹站名']) # 设置表头
def process_item(lf, item, spider): # ⼯序具体内容
line = [item['title'], item['link'], item['source'], item['pub_date'], item['similar'], item['in_title']] # 把数据中每⼀项整理出来
lf.ws.append(line) # 将数据以⾏的形式添加到xlsx中
lf.wb.save('/home/alexkh/tuniu.xlsx') # 保存xlsx⽂件
return item
为了让pipeline.py⽣效,还需要在ttings.py⽂件中增加设置,内容如下:
ITEM_PIPELINES = {
'tuniunews.pipelines.TuniuPipeline': 200, # 200是为了设置⼯序顺序
}
参考资料
1. Scrapy⽂档中关于Item Pipeline的部分
2.OpenPyxl官⽅⽂档