Ctrip Popular Scenic Area Review Scraping#
Introduction#
Recently, I participated in a competition that required scraping reviews and information about popular scenic areas in several provincial capitals of Yunnan, Guizhou, and Sichuan. I looked at various projects online, tried most of them, but found the operations too cumbersome. Some required finding parameters one by one, which didn't meet my needs, so I decided to write one myself. First, let's take a look at the results.
The scraped data is saved in excel
Scraping in progress
After a while of scraping, I successfully collected reviews of popular scenic areas from all cities in the three provinces of Yunnan, Guizhou, and Sichuan, totaling 280,000 data points. Not easy! 😭😭😭
Now, let me share the process of this scraping. 🚀🚀🚀
Note: All the code shared here is not complete; for the complete code, see aglorice/CtripSpider: Ctrip Review Scraper, using thread pool to scrape popular scenic area reviews, simple and easy to use. One-click to scrape all popular scenic areas in any province. (github.com)
1. Analyze the Page#
First, go to the Ctrip Guide.Scenic Area page, hover the cursor over Domestic (including Hong Kong, Macao, and Taiwan) to get all cities in almost all provinces. This is the source of our city data.
Open the console, and quickly locate it as shown below.
With this, let's write the code. Here, I use BeautifulSoup to parse the page.
By this method, all city names and urls for the specified provinces are saved as city.json. The reason for saving them first is mainly for convenience in customization; you can freely add or remove cities you want to scrape based on your needs. The scraping results are as follows:
Next, we open the urls of these scenic areas. As shown below, we can see that the homepage displays popular scenic areas or attractions:
The preliminary work is done; now let's scrape the reviews for the corresponding scenic areas.
2. Scenic Area Review Scraping#
Open the reviews of any scenic area and check the requests in the console. As shown:
First, let's analyze the parameters. After multiple attempts, we can identify which ones are dynamic. The first is _fxpcqlniredt, which can be quickly found by checking the cookie.
The second is x-traceID. By reversing the JS, I directly found the relevant code. As shown:
Knowing how it is generated makes it simple; let's write the code.
At this point, we are almost done. Now we just need to solve the poild issue, which actually exists on every page, in the script tag under each scenic area page. However, this adds an extra request, which is too time-consuming. If we could directly request the data without entering the scenic area page, that would be great. So, let's change our approach. We enter Ctrip's h5 page and find that its comment retrieval interface is different from the pc version, as shown:
On the mobile side, there is no need to use the poild parameter. Actually, at this point, it is already finished; the remaining task is to solve various issues that arise during the scraping process. The most important is Ctrip's anti-scraping measures. Since I used a thread pool for faster scraping, the speed was quite high. To address this issue, I used random ua and a proxy pool, along with various fault tolerance mechanisms, to ensure stable scraping. Below is the result of frequent interface access:
3. Solutions to Ctrip's Anti-Scraping Measures#
The first solution to anti-scraping is to use a random ua. Previously, I used fake-useragent, but since I later switched to the h5 interface, the ua must be for mobile devices. However, this library does not support that, so I manually created one, which is simple but practical.
The remaining task is to use a thread pool, and here I used an open-source project jhao104/proxy_pool: Python Scraper Proxy IP Pool (github.com).
With this, we are almost done. 👀👀👀
4. Conclusion#
Through this scraping of Ctrip, I can summarize some experiences: when encountering problems, it is helpful to broaden your thinking and try more solutions.