August 2, 2022

Industrialising Scrapers

Industrialising Scrapers

At Stockly, we started scraping e-commerce websites to evaluate the stock-outs of retailers. This was at the very beginning and it allowed us to understand our market and also the need for what we wanted to build.

Scrapers, how?

We want to create the first global inventory for e-retailers to share stocks. In order to do that, we need to know the stocks of all the partner suppliers we integrate. Although scrapers are not the only integration we provide, they allowed us to move fast in the beginning and integrate new suppliers in no-time.

Since practically all e-commerce websites follow the same architecture, here's how our scrapers get the product inventory information:

  • Go to the product listing page(s) of the website
  • For each product listed, go to it's specific product page
  • Retrieve information

So a new retailer to scrape could be defined by just an initial URL and a list of css selectors (url for product page, product price, product name, product brand etc..). We would then build the scraper given this config, and voilà! Retailer integrated.

Not so fast.

What if tomorrow we also needed to scrape other types of websites? And what about the few websites that don't actually follow this architecture? This is the reason why we decided to work on a flexible Architecture that would allow us to both scrape basic e-commerce websites easily and customise to any kind of website we needed.

A Pipeline Architecture

Inspired by the nodeJS native classes, our scrapers are built and run through a pipeline.

A pipeline is basically a structure in which we can build pipes to do stuff. Given a configuration, the pipeline will build the pipes and launch the first pipe with an initial value.

As for the pipes, they are the objects in which we define operands (the objects that actually "do stuff") and how operands should be run (maximum operations at a time, number of retries…).

  • Pipeline: Builds the pipes and launches them
  • Pipes: Contain operands and defines how they should be run
  • Operands: Perform operations

Scraping can now be done by launching a pipeline with a specific config in which we define the pipes to build and what operands to run in it. Since scraping is very specific, we have defined specific pipes to add in the pipeline when scraping. So our scrape config always add these pipes :

  • Pre-scrape pipes: Ex: when testing, a filtering pipe will be launched to filter all the urls to scrape given and only keep one.
  • Scrape pipes: This is where all the magic happens. It contains the pre-product pipe (retrieving product page urls from the product listing), the product pipe (retrieving product information)
  • Post-scrape pipes: Consisted of an aggregating pipe to aggregate all the data received and some optional pipes (ex: archive pipe)

We now have a basic config to initiate a pipeline that scrapes products on an e-commerce website. But let's dig a little more into the scrape pipes.

In the scrape pipes, we add fields to be extracted and returned at the end of the pipeline. Each field must be defined with a name, an extractor - how to extract the field -, a cleaner - to clean the data received - and a formator - to format the cleaned data. As we often deal with the same type of data, we have written general cleaners and formators (for numbers, strings, JSON…). Our more specific cleaners and formators will then extend the general ones (for instance a price cleaner does what a number cleaner does and more).

We have defined default fields to extract for each type of product we scrape on websites. For example, when scraping shoe products on a sneaker retailer we get: brand, name of product, product ID, colour, gender, price, size system, available sizes.

In the end, a scraper can be defined by initial values (urls to the product listing page) and extractors for all the fields defined.

New scrapers written in no time

When writing a new scraper, we first check the website to scrape, asking ourselves: "Does the website follow the classic behaviour?"

If so, parsing the pages with cheerio, we just need to look up for the HTML markups of the data we want and write it in the different field extractors.

If not, our architecture allows us to easily customise the way we scrape: we can add a new pipe, a new operand, a new field to be extracted or define new cleaners and formators if necessary.

This pipeline architecture has allowed us to move fast by integrating new suppliers within an hour and focusing on the next possible integrations.