Ctrip Popular Scenic Area Review Crawling

Ctrip Popular Scenic Area Review Scraping#

Introduction#

Recently, I participated in a competition that required scraping reviews and information about popular scenic areas in several provincial capitals of Yunnan, Guizhou, and Sichuan. I looked at various projects online, tried most of them, but found the operations too cumbersome. Some required finding parameters one by one, which didn't meet my needs, so I decided to write one myself. First, let's take a look at the results.

The scraped data is saved in excel

Scraping in progress

After a while of scraping, I successfully collected reviews of popular scenic areas from all cities in the three provinces of Yunnan, Guizhou, and Sichuan, totaling 280,000 data points. Not easy! 😭😭😭

Now, let me share the process of this scraping. 🚀🚀🚀

Note: All the code shared here is not complete; for the complete code, see aglorice/CtripSpider: Ctrip Review Scraper, using thread pool to scrape popular scenic area reviews, simple and easy to use. One-click to scrape all popular scenic areas in any province. (github.com)

1. Analyze the Page#

First, go to the Ctrip Guide.Scenic Area page, hover the cursor over Domestic (including Hong Kong, Macao, and Taiwan) to get all cities in almost all provinces. This is the source of our city data.

Open the console, and quickly locate it as shown below.

With this, let's write the code. Here, I use BeautifulSoup to parse the page.

    def get_areas(self) -> list:
        city_list = []
        try:
            res = self.sees.get(
                url=GET_HOME,
                headers={
                    "User-Agent": get_fake_user_agent("pc")
                },
                proxies=my_get_proxy(),
                timeout=TIME_OUT
            )
        except Exception as e:
            self.console.print(f"[red]Failed to get city scenic area information, {e}, you can check your network or proxy.", style="bold red")
            exit()
        res_shop = BeautifulSoup(res.text, "lxml")
        areas = res_shop.find_all("div", attrs={"class": "city-selector-tab-main-city"})

        for area in areas:
            area_title = area.find("div", attrs={"class": "city-selector-tab-main-city-title"}).string
            if area_title is None:
                continue
            area_items = area.find_all("div", attrs={"class": "city-selector-tab-main-city-list"})
            area_items_list = [{"name": item.string, "url": item["href"]} for item in area_items[0].find_all("a")]
            city_list.append({
                "name": area_title,
                "city": area_items_list
            })
        return city_list

By this method, all city names and urls for the specified provinces are saved as city.json. The reason for saving them first is mainly for convenience in customization; you can freely add or remove cities you want to scrape based on your needs. The scraping results are as follows:

Next, we open the urls of these scenic areas. As shown below, we can see that the homepage displays popular scenic areas or attractions:

The preliminary work is done; now let's scrape the reviews for the corresponding scenic areas.

2. Scenic Area Review Scraping#

Open the reviews of any scenic area and check the requests in the console. As shown:

First, let's analyze the parameters. After multiple attempts, we can identify which ones are dynamic. The first is _fxpcqlniredt, which can be quickly found by checking the cookie.

The second is x-traceID. By reversing the JS, I directly found the relevant code. As shown:

Knowing how it is generated makes it simple; let's write the code.

    def generate_scene_comments_params(self) -> dict:
        """
        Generate params for requesting scenic area reviews
        :return:
        """
        random_number = random.randint(100000, 999999)
        return {
            "_fxpcqlniredt": self.sees.cookies.get("GUID"),
            "x-traceID": self.sees.cookies.get("GUID") + "-" + str(int(time.time() * 1000000)) + "-" + str(
                random_number)
        }

At this point, we are almost done. Now we just need to solve the poild issue, which actually exists on every page, in the script tag under each scenic area page. However, this adds an extra request, which is too time-consuming. If we could directly request the data without entering the scenic area page, that would be great. So, let's change our approach. We enter Ctrip's h5 page and find that its comment retrieval interface is different from the pc version, as shown:

On the mobile side, there is no need to use the poild parameter. Actually, at this point, it is already finished; the remaining task is to solve various issues that arise during the scraping process. The most important is Ctrip's anti-scraping measures. Since I used a thread pool for faster scraping, the speed was quite high. To address this issue, I used random ua and a proxy pool, along with various fault tolerance mechanisms, to ensure stable scraping. Below is the result of frequent interface access:

3. Solutions to Ctrip's Anti-Scraping Measures#

The first solution to anti-scraping is to use a random ua. Previously, I used fake-useragent, but since I later switched to the h5 interface, the ua must be for mobile devices. However, this library does not support that, so I manually created one, which is simple but practical.

# -*- coding = utf-8 -*-
# @Time :2023/7/13 21:32
# @Author :Xiao Yue
# @Email  :[email protected]
# @PROJECT_NAME :scenic_spots_comment
# @File :  fake_user_agent.py
from fake_useragent import UserAgent
import random
from config import IS_FAKE_USER_AGENT


def get_fake_user_agent(ua: str, default=True) -> str:
    match ua:
        case "mobile":
            if IS_FAKE_USER_AGENT and default:
                ua = get_mobile_user_agent()
                return ua
            else:
                return "Mozilla/5.0 (Linux; Android 8.0.0; SM-G955U Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Mobile Safari/537.36 Edg/114.0.0.0"
        case "pc":
            if IS_FAKE_USER_AGENT and default:
                ua = UserAgent()
                return ua.random
            else:
                return "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Mobile Safari/537.36 Edg/103.0.1264.49"


def get_mobile_user_agent() -> str:
    platforms = [
        'iPhone; CPU iPhone OS 14_6 like Mac OS X',
        'Linux; Android 11.0.0; Pixel 5 Build/RD1A.201105.003',
        'Linux; Android 8.0.0; Pixel 5 Build/RD1A.201105.003',
        'iPad; CPU OS 14_6 like Mac OS X',
        'iPad; CPU OS 15_6 like Mac OS X',
        'Linux; U; Android 9; en-us; SM-G960U Build/PPR1.180610.011',  # Samsung Galaxy S9
        'Linux; U; Android 10; en-us; SM-G975U Build/QP1A.190711.020',  # Samsung Galaxy S10
        'Linux; U; Android 11; en-us; SM-G998U Build/RP1A.200720.012',  # Samsung Galaxy S21 Ultra
        'Linux; U; Android 9; en-us; Mi A3 Build/PKQ1.180904.001',  # Xiaomi Mi A3
        'Linux; U; Android 10; en-us; Mi 10T Pro Build/QKQ1.200419.002',  # Xiaomi Mi 10T Pro
        'Linux; U; Android 11; en-us; LG-MG870 Build/RQ1A.210205.004',  # LG Velvet
        'Linux; U; Android 11; en-us; ASUS_I003D Build/RKQ1.200826.002',  # Asus ROG Phone 3
        'Linux; U; Android 10; en-us; CLT-L29 Build/10.0.1.161',  # Huawei P30 Pro
    ]

    browsers = [
        'Chrome',
        'Firefox',
        'Safari',
        'Opera',
        'Edge',
        'UCBrowser',
        'SamsungBrowser'
    ]

    platform = random.choice(platforms)
    browser = random.choice(browsers)

    match browser:
        case 'Chrome':
            version = random.randint(70, 90)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{version}.0.#{random.randint(1000, 9999)}.#{random.randint(10, 99)} Mobile Safari/537.36'

        case 'Firefox':
            version = random.randint(60, 80)
            return f'Mozilla/5.0 ({platform}; rv:{version}.0) Gecko/20100101 Firefox/{version}.0'

        case 'Safari':
            version = random.randint(10, 14)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/{version}.0 Safari/605.1.15'

        case 'Opera':
            version = random.randint(60, 80)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{version}.0.#{random.randint(1000, 9999)}.#{random.randint(10, 99)} Mobile Safari/537.36 OPR/{version}.0'

        case 'Edge':
            version = random.randint(80, 90)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/{version}.0.#{random.randint(1000, 9999)}.#{random.randint(10, 99)} Mobile Safari/537.36 Edg/{version}.0'

        case 'UCBrowser':
            version = random.randint(12, 15)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 UBrowser/{version}.1.2.49 Mobile Safari/537.36'

        case 'SamsungBrowser':
            version = random.randint(10, 14)
            return f'Mozilla/5.0 ({platform}) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/{version}.0 Chrome/63.0.3239.26 Mobile Safari/537.36'

The remaining task is to use a thread pool, and here I used an open-source project jhao104/proxy_pool: Python Scraper Proxy IP Pool (github.com).

With this, we are almost done. 👀👀👀

4. Conclusion#

Through this scraping of Ctrip, I can summarize some experiences: when encountering problems, it is helpful to broaden your thinking and try more solutions.

Project Address aglorice/CtripSpider: Ctrip Review Scraper, using thread pool to scrape popular scenic area reviews, simple and easy to use. One-click to scrape all popular scenic areas in any province. (github.com)