ChatGPT代码解释器实践，解析大众点评爬取到的数据-六虎

前言

虽然咱们的密室创业项目《封灵街》倒闭了，但是好歹咱们也是群众点评上了4.8分的，所以想把之前的谈论都下载下来，留作纪念，以防止哪天被群众点评删去掉了。

群众点评数据获取

232条数据说多不多，说少也不少，人工一条一条的下载是不现实的，也违反了我工程师的初衷。于是我翻开了Charles设置了手机的代理，开端抓包。

1. 测验APP抓包

无功而返，看来点评APP的网络恳求为了反扒不是经过HTTP来发送的

2. 测验小程序抓包

那就退而求其次，来看看小程序的恳求能不能抓到，共享店铺小程序到文件传输帮手，然后微信中翻开该小程序

2.1. 获取URL，并测验修正参数

哈哈，果然是明文传输，so easy。

接下来，把恳求导出来，看看能不能经过修正参数，完结后续的批量恳求。

粘贴到postman，并修正offset参数

返回成果报错了，说明参数做了签名

目前没有太好的方法了，我计算了一下，还好一页10跳数据，232条数据，24页就搞定了，我决定人工完结翻页，并经过Charles保存恳求返回值。

2.2. 手动获取，并保存

还好，整个进程也就2分钟左右，然后挑选所有的恳求，并Save All，保存到一个文件夹中。

原始数据都保存下来了，接下来我想把这些数据保存到数据库，并把图片上传到OSS中，防止后续图片URL失效。这个时分我想到了刚刚可以运用的ChatGPT代码解说器，看看能不能让他帮我简化这个流程。

用ChatGPT代码解说器来处理保存好的数据

开启代码解说器

因为是刚刚接到可以注册代码解说器Code Interpreter的通知，需求先去设置里边把这个功用翻开一下

之后再新建谈天的时分，就可以挑选代码解说器了

让ChatGPT来帮我编程

当进入代码解说器的模型后，ChatGPT的输入框，左侧会多了一个“加号”，咱们可以经过这个加号来上传文件。上传文件后，我给了他一个指令“获取文件中关键的json信息，并下载图片原图即big_url”。接下来，ChatGPT就开端自己的工作了，这个流程有点像比较火的AutoGPT，即AI主动进行一步一步的操作流，详细流程可以拜见我的视频：

第一个 ChatGPT 代码解说器 demo，解析群众点评爬虫成果数据

最终，是我让ChatGPT基于以上的内容，生成的一个接写JSON文件并保存数据到MongoDB和Google Cloud Storage的教育文档：

前提条件

在开端之前，您需求装置以下Python库：

pymongo：用于衔接MongoDB数据库。
requests：用于从网络上下载图片。
google-cloud-storage：用于衔接Google Cloud Storage并上传图片。

您可以运用以下指令来装置这些库：

pip install pymongo requests google-cloud-storage

此外，您还需求配置Google Cloud SDK的认证信息，这样才能运用Google Cloud Storage Python客户端库。有关如何设置GOOGLE_APPLICATION_CREDENTIALS环境变量的更多信息，请参阅Google Cloud SDK文档。

实现过程

以下是咱们要生成的Python脚本的过程：

衔接到MongoDB数据库。
衔接到Google Cloud Storage。
遍历包括JSON文件的目录。
关于每个JSON文件，加载JSON数据。
关于每条谈论数据，检查是否有关联的图片。
如果有，下载这些图片，然后将它们上传到Google Cloud Storage。
将Google Cloud Storage返回的公共URL保存到reviewPics数组中的new_image_url字段。
将更新后的谈论数据保存到MongoDB数据库。

Python脚本

以下是生成的Python脚本：

import os
import json
import requests
import pymongo
from pymongo import MongoClient
from google.cloud import storage
# Connect to your MongoDB instance
client = MongoClient('mongodb://localhost:27017/')
db = client['your_db_name']  # Replace with your database name
collection = db['your_collection_name']  # Replace with your collection name
# Specify the root directory for storing images
img_root_dir = "/path/to/your/image/directory"  # Replace with your directory path
# Specify the directory containing the JSON files
json_dir = "/path/to/your/json/directory"  # Replace with your directory path
# Initialize a Google Cloud Storage client
gcs_client = storage.Client()
# Specify your bucket name
bucket_name = 'your_bucket_name'  # Replace with your bucket name
bucket = gcs_client.get_bucket(bucket_name)
# Loop over every JSON file in the directory
for filename in os.listdir(json_dir):
    if filename.endswith('.json'):
        print(f"Processing file: {filename}")
        # Load the JSON file
        with open(os.path.join(json_dir, filename), 'r') as file:
            data = json.load(file)
        for review in data['reviewInfo']['reviewListInfo']['reviewList']:
            if 'reviewPics' in review:
                for pic in review['reviewPics']:
                    if 'bigurl' in pic:
                        # Create a directory for the review if it doesn't exist
                        review_dir = os.path.join(img_root_dir, str(review['reviewId']))
                        os.makedirs(review_dir, exist_ok=True)
                        # Extract the original filename from the URL
                        filename = os.path.basename(pic['bigurl'].split("?")[0])
                        # Download the image
                        response = requests.get(pic['bigurl'])
                        img_path = os.path.join(review_dir, filename)
                        with open(img_path, 'wb') as f:
                            f.write(response.content)
                        print(f"Saved image: {img_path}")
                        # Upload the image to Google Cloud Storage
                        blob = bucket.blob(filename)
                        blob.upload_from_filename(img_path)
                        print(f"Uploaded image: {blob.public_url}")
                        # Update the bigurl field in the reviewPics array
                        pic['bigurl'] = blob.public_url
                        # Update the MongoDB record with the new review data
                collection.update_one({'reviewId': review['reviewId']}, {"$set": review}, upsert=True)
                print(f"Updated MongoDB record with new review data for reviewId: {review['reviewId']}")
print("Completed processing all files.")

请注意，您需求替换 ‘your_db_name’ 、 ‘your_collection_name’ 、 “/path/to/your/image/directory” 、 “/path/to/your/json/directory” 和 ‘your_bucket_name’ 为您实际的值。

成果检验

图片上传Google Cloud Storage

数据上传 Mongodb Atlas

这就是咱们如何运用ChatGPT Code Interpreter来生成Python脚本，协助咱们完结数据的提取、保存和上传的进程。希望这个教程对您有所协助！

ChatGPT代码解释器实践，解析大众点评爬取到的数据

前言