某师范大学文章最新时间的爬取-白红宇的个人博客

某师范大学文章最新时间的爬取

发布日期：2022-03-04 11:48:29 浏览次数：10 分类：技术文章

本文共 1285 字，大约阅读时间需要 4 分钟。

import requests

import time

from bs4 import BeautifulSoup

def get_data():

# 加入请求头

headers = {

‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9’,

‘Connection’: ‘keep-alive’,

‘Host’: ‘em.scnu.edu.cn’,

‘If-Modified-Since’: ‘Mon, 21 Jun 2021 04:19:30 GMT’,

‘If-None-Match’: ‘“3d7490a15466d71:0”’,

‘Upgrade-Insecure-Requests’: ‘1’,

‘User-Agent’: ‘’

}

# 需要爬取网页的url

url = ‘={}/’.format(‘论文’) # 遍历一个列表，返回一个输入值

# 将headers封装进request函数中，让返回一个值

res = requests.get(headers=headers, url=url).content

print(res)

def input_data():

# 实例化Beautifulsoup对象，需要将网页源码加载到该对象中。

soup = BeautifulSoup(res, ‘lxml’)

# 解析出文章的时间和标题

p_list = soup.select(’.linkBox3> p’)

print(p_list)

fp = open(‘huananshifan.txt’, ‘w’, encoding=‘utf-8’)

for p in p_list:

title = p.a.string

time = p.span.string # 这边是获取到网页的时间，但是我并不确定是不是要用到string，时间或许要用到其他。

detail_url = p.a[‘href’]

detail_page_text = requests.get(url=detail_url, headers=headers).content

# 解析出详情页中的数据

detail_soup = BeautifulSoup(detail_page_text, ‘lxml’)

div_tag = detail_soup.find(‘div’, class_=‘article’)

# 解析到章节内容

content = div_tag.text

fp.write(title + time + ‘:’ + content + ‘\n’)

print(title + time, ‘爬取成功！！！’)

if name == ‘main’:

get_data()

input_data()

time.sleep(2)

转载地址：https://blog.csdn.net/xxy_yinji/article/details/119004044 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：利用pandas库查看数据

下一篇：如何利用excel设置导航条？

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章