This is how Scraping / Crawling of Single Page Applications (SPAs) works – quick and easy.
Challenges of Single Page Applications
Additionally: If the developers of the single page application are smart, they may even have configured the server to timeout after your initial request based on your IP. Crawling single page applications is hell!
ScrapingBee – easily scrape / crawl Single Page Applications with Python
I’m only going to introduce one way to do SPAs in Python in this post – the easiest and fastest I’ve found. More info on my specific use cases are explained in the next section. I discovered a small tool called Scrapingbee. The company behind Scrapingbee is based in France (Paris) and provides a simple API for its services. Its technology is flexible enough to crawl single page applications and integrates easily with Python.
With this small code snippet I get all the info I need for my case from a website or multiple subpages.
The tool creates an API request via its client, which contains further specs on how to crawl the target URL. So I have the following request based on the SPA that I want to crawl. I can specify this through the “param” object:
- Then please wait a second in case the scripts need something (js_scenario + wait)
- I would like to have a json response (json_response)
- And via extract_rules I define via CSS selector or X-path what information I need from the page (I traditionally don’t need all content).
With the response I can do what I want. Just write it into the terminal or write the whole thing into a CSV at the end so I can analyze it further. I could even search the response for links and put it again into a scrapingbee call.
As a response, I only get what I really need from the page.
My Scraping Use Cases
Admittedly, my business model is not website scraping! I do occasionally need scraping scripts for an SEO audit or a database that I can then cannibalize in PowerBi. But: I use it in commercial projects – accordingly I have no problem with the 50€ monthly subscription.
In the past I have also used Beautifulsoup to get data from websites or via Power Query itself. But for single page applications these tools are usually already at their limits. Accordingly, I would have to additionally co-figure Headless Chrome instances or Selenium on my machine, which in my view causes too much overhead for my purposes. It’s not like I would set up a server for daily data acquisition. I just need it every now and then. Accordingly, the easiest way for me so far is ScrapingBee.
Another plus of the API architecture is that I can also test my requests in advance via Postman before switching to the code editor!
I used the remind.me website for demonstration purposes because I just had it handy and the site uses a very classic SPA. Sorry if I caused extra traffic with my 5 calls!I used the website of https://www.remind.me/ for demonstration purposes because I had it just at hand and the site uses a very classic SPA. Sorry if I caused extra traffic with my 5 calls!