爬虫之数据解析（bs4，Xpath）-白红宇的个人博客

爬虫之数据解析（bs4，Xpath）

发布日期：2022-03-30 05:03:20 浏览次数：43 分类：博客文章

本文共 7920 字，大约阅读时间需要 26 分钟。

实现数据爬取的流程

　　指定url

　　基于requests模块发起请求

　　获取响应中的数据

　　数据解析（正则解析，bs4解析，xpath解析）

　　进行持久化存储

一.bs4（BeautifulSoup）

1.安装

1.pip install bs4 2.pip install lxml

2.解析原理

　　1.将即将要进行解析的源码加载到bs对象

　　2.调用bs对象中相关的方法或属性进行源码中的相关标签的定位

　　3.将定位到的标签之间存在的文本或者属性值获取

3.基础使用

使用流程：           - 导包：from bs4 import BeautifulSoup    - 使用方式：可以将一个html文档，转化为BeautifulSoup对象，然后通过对象的方法或者属性去查找指定的节点内容        （1）转化本地文件：             - soup = BeautifulSoup(open('本地文件'), 'lxml')        （2）转化网络文件：             - soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')        （3）打印soup对象显示内容为html文件中的内容基础巩固：    （1）根据标签名查找        - soup.a   只能找到第一个符合要求的标签    （2）获取属性        - soup.a.attrs  获取a所有的属性和属性值，返回一个字典        - soup.a.attrs['href']   获取href属性        - soup.a['href']   也可简写为这种形式    （3）获取内容        - soup.a.string        - soup.a.text        - soup.a.get_text()       【注意】如果标签还有标签，那么string获取到的结果为None，而其它两个，可以获取文本内容    （4）find：找到第一个符合要求的标签        - soup.find('a')  找到第一个符合要求的        - soup.find('a', title="xxx")        - soup.find('a', alt="xxx")        - soup.find('a', class_="xxx")        - soup.find('a', id="xxx")    （5）find_all：找到所有符合要求的标签        - soup.find_all('a')        - soup.find_all(['a','b']) 找到所有的a和b标签        - soup.find_all('a', limit=2)  限制前两个    （6）根据选择器选择指定的内容               select:soup.select('#feng')        - 常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器            - 层级选择器：                div .dudu #lala .meme .xixi  下面好多级                div > p > a > .lala          只能是下面一级        【注意】select选择器返回永远是列表，需要通过下标提取指定的对象

　　需求：使用bs4实现将诗词名句网站中三国演义小说的每一章的内容爬去到本地磁盘进行存储　　http://www.shicimingju.com/book/sanguoyanyi.html

import requestsfrom bs4 import BeautifulSoupurl = 'http://www.shicimingju.com/book/sanguoyanyi.html'headers = {    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}page_text = requests.get(url=url,headers=headers).textsoup = BeautifulSoup(page_text,'lxml')a_list = soup.select('.book-mulu > ul > li > a')fp = open('sanguo.txt','w',encoding='utf-8')for a in a_list:    title = a.string    detail_url = 'http://www.shicimingju.com'+a['href']    detail_page_text = requests.get(url=detail_url,headers=headers).text        soup = BeautifulSoup(detail_page_text,'lxml')    content = soup.find('div',class_='chapter_content').text        fp.write(title+'\n'+content)    print(title,'下载完毕')print('over')fp.close()

二.Xpath解析

1.安装

pip install lxml

2.解析原理

　　获取页面源码数据

　　实例化一个etree的对象,并且将页面源码数据加载到该对象中

　　调用该对象的xpath方法进行指定标签的定位

　　注意:xpath函数必须结合着xpath表达式进行标签定位和内容捕获

3.基础使用

　　常用xpath表达式

属性定位：    #找到class属性值为song的div标签    //div[@class="song"] 层级&索引定位：    #找到class属性值为tang的div的直系子标签ul下的第二个子标签li下的直系子标签a    //div[@class="tang"]/ul/li[2]/a逻辑运算：    #找到href属性值为空且class属性值为du的a标签    //a[@href="" and @class="du"]模糊匹配：    //div[contains(@class, "ng")]    //div[starts-with(@class, "ta")]取文本：    # /表示获取某个标签下的文本内容    # //表示获取某个标签下的文本内容和所有子标签下的文本内容    //div[@class="song"]/p[1]/text()    //div[@class="tang"]//text()取属性：    //div[@class="tang"]//li[2]/a/@href

　　测试页面数据

    
        测试bs4    
            
     百里守约
    
    
    
            
     李清照
        
     王安石
        
     苏轼
        
     柳宗元
        
                 this is span        宋朝是最强大的王朝，不是军队的强大，而是经济很强大，国民都很有钱        
     总为浮云能蔽日,长安不见使人愁        
         
    
    
            
                 
      清明时节雨纷纷,路上行人欲断魂,借问酒家何处有,牧童遥指杏花村
            
      秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山
            
      岐王宅里寻常见,崔九堂前几度闻,正是江南好风景,落花时节又逢君
            
      杜甫
            
      杜牧
            
      杜小月
            
      度蜜月
            
      凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘

　　代码中的使用

1.下载：pip install lxml2.导包：from lxml import etree3.将html文档或者xml文档转换成一个etree对象，然后调用对象中的方法查找指定的节点　　2.1 本地文件：tree = etree.parse(文件名)                tree.xpath("xpath表达式")　　2.2 网络数据：tree = etree.HTML(网页内容字符串)                tree.xpath("xpath表达式")

安装xpath插件在浏览器中对xpath表达式进行验证：可以在插件中直接执行xpath表达式

　　将xpath插件拖动到谷歌浏览器拓展程序（更多工具）中，安装成功

　　启动和关闭插件 ctrl + shift + x

示例：

　　1.解析58二手房的相关数据

import requestsfrom lxml import etreeurl = 'https://bj.58.com/shahe/ershoufang/?utm_source=market&spm=u-2d2yxv86y3v43nkddh1.BDPCPZ_BT&PGTID=0d30000c-0047-e4e6-f587-683307ca570e&ClickID=1'headers = {    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}page_text = requests.get(url=url,headers=headers).texttree = etree.HTML(page_text)li_list = tree.xpath('//ul[@class="house-list-wrap"]/li')fp = open('58.csv','w',encoding='utf-8')for li in li_list:    title = li.xpath('./div[2]/h2/a/text()')[0]    price = li.xpath('./div[3]//text()')    price = ''.join(price)    fp.write(title+":"+price+'\n')fp.close()print('over')

　　2.解析图片数据：http://pic.netbian.com/4kmeinv/

import requestsfrom lxml import etreeimport osimport urlliburl = 'http://pic.netbian.com/4kmeinv/'headers = {    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}response = requests.get(url=url,headers=headers)#response.encoding = 'utf-8'if not os.path.exists('./imgs'):    os.mkdir('./imgs')page_text = response.texttree = etree.HTML(page_text)li_list = tree.xpath('//div[@class="slist"]/ul/li')for li in li_list:    img_name = li.xpath('./a/b/text()')[0]    #处理中文乱码    img_name = img_name.encode('iso-8859-1').decode('gbk')    img_url = 'http://pic.netbian.com'+li.xpath('./a/img/@src')[0]    img_path = './imgs/'+img_name+'.jpg'    urllib.request.urlretrieve(url=img_url,filename=img_path)    print(img_path,'下载成功!')print('over!!!')

　　3.下载煎蛋网中的图片数据：http://jandan.net/ooxx 数据加密 (反爬机制)

import requestsfrom lxml import etreeimport base64import urllibheaders = {    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}url = 'http://jandan.net/ooxx'page_text = requests.get(url=url,headers=headers).texttree = etree.HTML(page_text)img_hash_list = tree.xpath('//span[@class="img-hash"]/text()')for img_hash in img_hash_list:    img_url = 'http:'+base64.b64decode(img_hash).decode()    img_name = img_url.split('/')[-1]    urllib.request.urlretrieve(url=img_url,filename=img_name)

　　4.爬取站长素材中的简历模板

import requestsimport randomfrom lxml import etreeheaders = {    'Connection':'close', #当请求成功后,马上断开该次请求(及时释放请求池中的资源)    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}url = 'http://sc.chinaz.com/jianli/free_%d.html'for page in range(1,4):    if page == 1:        new_url = 'http://sc.chinaz.com/jianli/free.html'    else:        new_url = format(url%page)        response = requests.get(url=new_url,headers=headers)    response.encoding = 'utf-8'    page_text = response.text    tree = etree.HTML(page_text)    div_list = tree.xpath('//div[@id="container"]/div')    for div in div_list:        detail_url = div.xpath('./a/@href')[0]        name = div.xpath('./a/img/@alt')[0]        detail_page = requests.get(url=detail_url,headers=headers).text        tree = etree.HTML(detail_page)        download_list  = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')        download_url = random.choice(download_list)        data = requests.get(url=download_url,headers=headers).content        fileName = name+'.rar'        with open(fileName,'wb') as fp:            fp.write(data)            print(fileName,'下载成功')

　　5.解析所有的城市名称

import requestsfrom lxml import etreeheaders = {    'Connection':'close', #当请求成功后,马上断开该次请求(及时释放请求池中的资源)    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'}url = 'https://www.aqistudy.cn/historydata/'page_text = requests.get(url=url,headers=headers).texttree = etree.HTML(page_text)li_list = tree.xpath('//div[@class="bottom"]/ul/li |  //div[@class="bottom"]/ul/div[2]/li')for li in li_list:    city_name = li.xpath('./a/text()')[0]    print(city_name)

转载地址：https://www.cnblogs.com/chenxi67/p/10446115.html 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：爬虫之jupyter的使用，requests模块，正则表达式

下一篇：nginx基于uwsgi部署Django

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章