爬取猫眼电影Top100最新版 • 前端技术分享

爬取猫眼电影网站Top100电影的详细信息并保存到excel表格内。

关于爬取猫眼电影网站的Top100的数据遇到了很多坑。首先就是获取源码时可能获取到的数据并不是你想要，其次就是多次爬取后发现获取不到正确的数据。对于这些问题或者说坑吧，下面写了一套最新的代码来避过这些坑。

本次主要用到的python库包括requests（请求数据），openpyxl（写入excel），re（正则），time（定时）。其中前两个库需要单独安装（如果没有安装的情况下）。

'''
爬取网站：https://www.maoyan.com/board/4
分析：通过网站发现每页有10条数据，Top100共有10页，而每次进行下一页时，链接后面会增加参数?offset=10,20,30...
    所以，可以定义一个for循环进行十次循环，每次请求一页的数据
'''
import requests
import re
import time
import openpyxl
'''
坑：获取不到正确数据，因为少了cookies和user-agent参数
    多次请求后，要更换cookies值，否则也获取不到正确数据
'''
'''
1.设置一个获取一页数据的函数
'''
def get_page(url):
    # 设置header
    headers = {  
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36',
        'referer': 'https://passport.meituan.com/',
        'Cookie': '自己获取'
    }
    # 发送请求
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    else:
        return None

'''
2.定义解析处理函数,通过正则获取自己想要的数据
'''
def parse_one_page(html):
    pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i><i class="fraction">(.*?)</i>.*?</dd>',re.S)
    items = re.findall(pattern,html)
    for item in items:
        # print(item[0])
        yield {
            'index':item[0],
            'images':item[1],
            'name':item[2],
            'actor':item[3].strip(),
            'time':item[4],
            'score':item[5]+item[6]
        }
    # print(items)

'''
3使用openpyxl写入到excel中
'''
def write_excel(item):
    # 创建表
    wb = openpyxl.Workbook()
    #添加工作表
    sheet = wb.active
    sheet.title = '猫眼电影top100'
    titles = ['排名','头像','电影名字','作者','上映时间','评分']
    # 首先写入表头数据
    for col_index,title in enumerate(titles):
        sheet.cell(1,col_index+1,title)
    # 添加生成的100条数据
    for text in item:
        sheet.append(text)
    # 保存工作薄
    wb.save("猫眼电影top100.xlsx")

# 定义一个变量，存储获取的100条数据
itemarr = []

'''
4.定义一个主函数，请求接口数据
'''
def main(offset):
    url = 'https://www.maoyan.com/board/4?offset='+str(offset)
    html = get_page(url)
    # get_page(html)
    for item in parse_one_page(html):
        # print(list(item.values()))
        itemarr.append(list(item.values()))
    #print(itemarr)
if __name__ == '__main__':
    # 定义十次循环
    for i in range(10):
        main(offset=i*10)
        time.sleep(1)
    # 循环结束调用写入表格数据
    write_excel(itemarr)

以上就是最新版的爬取猫眼电影Top100的完整代码，其中主要的问题就是cookies问题导致获取不到正确的数据，所以如果没有获取到自己想要的数据，可以关注这一方面。

发送评论编辑评论

发送评论 编辑评论

推荐文章

发送评论编辑评论