banner
Aglorice

Aglorice

Life is a coding,I will debug it.
github
twitter
telegram
bilibili

Ctrip Popular Scenic Area Review Crawling

Introduction#

Recently, I participated in a competition that required scraping reviews and information about popular scenic areas in several provincial capitals of Yunnan, Guizhou, and Sichuan. I looked at various projects online, tried most of them, but found the operations too cumbersome. Some required finding parameters one by one, which didn't meet my needs, so I decided to write one myself. First, let's take a look at the results.

The scraped data is saved in excel

image

Scraping in progress

image

After a while of scraping, I successfully collected reviews of popular scenic areas from all cities in the three provinces of Yunnan, Guizhou, and Sichuan, totaling 280,000 data points. Not easy! 😭😭😭

Now, let me share the process of this scraping. 🚀🚀🚀

Note: All the code shared here is not complete; for the complete code, see aglorice/CtripSpider: Ctrip Review Scraper, using thread pool to scrape popular scenic area reviews, simple and easy to use. One-click to scrape all popular scenic areas in any province. (github.com)

1. Analyze the Page#

First, go to the Ctrip Guide.Scenic Area page, hover the cursor over Domestic (including Hong Kong, Macao, and Taiwan) to get all cities in almost all provinces. This is the source of our city data.

image

image

Open the console, and quickly locate it as shown below.

image

With this, let's write the code. Here, I use BeautifulSoup to parse the page.

By this method, all city names and urls for the specified provinces are saved as city.json. The reason for saving them first is mainly for convenience in customization; you can freely add or remove cities you want to scrape based on your needs. The scraping results are as follows:

image

Next, we open the urls of these scenic areas. As shown below, we can see that the homepage displays popular scenic areas or attractions:

image

The preliminary work is done; now let's scrape the reviews for the corresponding scenic areas.

2. Scenic Area Review Scraping#

image

Open the reviews of any scenic area and check the requests in the console. As shown:

image

First, let's analyze the parameters. After multiple attempts, we can identify which ones are dynamic. The first is _fxpcqlniredt, which can be quickly found by checking the cookie.

image

The second is x-traceID. By reversing the JS, I directly found the relevant code. As shown:

image-20230728192940229

Knowing how it is generated makes it simple; let's write the code.

At this point, we are almost done. Now we just need to solve the poild issue, which actually exists on every page, in the script tag under each scenic area page. However, this adds an extra request, which is too time-consuming. If we could directly request the data without entering the scenic area page, that would be great. So, let's change our approach. We enter Ctrip's h5 page and find that its comment retrieval interface is different from the pc version, as shown:

image

On the mobile side, there is no need to use the poild parameter. Actually, at this point, it is already finished; the remaining task is to solve various issues that arise during the scraping process. The most important is Ctrip's anti-scraping measures. Since I used a thread pool for faster scraping, the speed was quite high. To address this issue, I used random ua and a proxy pool, along with various fault tolerance mechanisms, to ensure stable scraping. Below is the result of frequent interface access:

image

3. Solutions to Ctrip's Anti-Scraping Measures#

The first solution to anti-scraping is to use a random ua. Previously, I used fake-useragent, but since I later switched to the h5 interface, the ua must be for mobile devices. However, this library does not support that, so I manually created one, which is simple but practical.

The remaining task is to use a thread pool, and here I used an open-source project jhao104/proxy_pool: Python Scraper Proxy IP Pool (github.com).

With this, we are almost done. 👀👀👀

4. Conclusion#

Through this scraping of Ctrip, I can summarize some experiences: when encountering problems, it is helpful to broaden your thinking and try more solutions.

Project Address aglorice/CtripSpider: Ctrip Review Scraper, using thread pool to scrape popular scenic area reviews, simple and easy to use. One-click to scrape all popular scenic areas in any province. (github.com)

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.