选择优质的ip代理,我们能够利用它来完成很多网络工作,比如网上的大数据抓取,其实就是要依靠ip代理来进行的。今天,IP海向大家介绍一个爬取新闻网站内容的教程。
IP海以UC网站为例子:
这个网站并没有太复杂的反爬虫,我们可以直接解析爬取就好。
from bs4 import BeautifulSoup
from urllib import request
def download(title,url):
req = request.Request(url)
response = request.urlopen(req)
response = response.read().decode('utf-8')
soup = BeautifulSoup(response,'lxml')
tag = soup.find('div',class_='sm-article-content')
if tag == None:
return 0
title = title.replace(':','')
title = title.replace('"','')
title = title.replace('|','')
title = title.replace('/','')
title = title.replace('\','')
title = title.replace('*','')
title = title.replace('<','')
title = title.replace('>','')
title = title.replace('?','')
with open(r'D:codepythonspider_newsUC_newssociety\' + title + '.txt','w',encoding='utf-8') as file_object:
file_object.write(' ')
file_object.write(title)
file_object.write(' ')
file_object.write('该新闻地址:')
file_object.write(url)
file_object.write(' ')
file_object.write(tag.get_text())
#print('正在爬取')
if __name__ == '__main__':
for i in range(0,7):
url = 'https://news.uc.cn/c_shehui/'
# headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36",
# "cookie":"sn=3957284397500558579; _uc_pramas=%7B%22fr%22%3A%22pc%22%7D"}
# res = request.Request(url,headers = headers)
res = request.urlopen(url)
req = res.read().decode('utf-8')
soup = BeautifulSoup(req,'lxml')
#print(soup.prettify())
tag = soup.find_all('div',class_ = 'txt-area-title')
#print(tag.name)
for x in tag:
news_url = 'https://news.uc.cn' + x.a.get('href')
print(x.a.string,news_url)
download(x.a.string,news_url)
这样,我们就完成了网站新闻数据的抓取,可以检查运行结果看到,我们的数据是否成功获得。
版权声明:本文为IP海(iphai.cn)原创作品,未经许可,禁止转载!
Copyright © www.iphai.cn. All Rights Reserved. IP海 版权所有.
IP海仅提供中国内IP加速服务,无法跨境联网,用户应遵守《服务条款》内容,严禁用户使用IP海从事任何违法犯罪行为。
鄂ICP备19030659号-3
鄂公网安备42100302000141号
计算机软件著作权证
ICP/EDI许可证:鄂B2-20200106