Data Processing, Output & Validation

The last step of every web scraping project is final data processing. This often involved data validation, cleanup and storage.

As scrapers are dealing with data from an unknown source data processing can be a surprisingly complex challenge. For long-term scraping, data validation can be crucial for scraper maintenance. Using tools to ensure scraped result quality can prevent scrapers from silently breaking or performing sub-optimally.

Data Formats

Web scraping datasets can vary greatly from small predictable structures to large, complex data graphs.

Most commonly the CSV and JSON formats are used.

CSV

CSV is great for flat datasets with a consistent structure. This format can be directly imported to spreadsheet software (Excel, Google Sheets, etc) and doesn't require compression as the format is already very compact.

Here's a short example scraper that stores data to CSV:

Python
Scrapfly.py

import httpx
import csv
from parsel import Selector

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

writer = csv.writer(open("products.csv", "w"))
writer.writerow(["url", "name", "price"])
for url in urls:
    resp = httpx.get(url)
    sel = Selector(resp.text)
    price = sel.css(".product-price::text").get()
    name = sel.css(".product-title::text").get()
    writer.writerow([str(resp.url), name, price])

import csv
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

writer = csv.writer(open("products.csv", "w"))
writer.writerow(["url", "name", "price"])
for url in urls:
    result = scrapfly.scrape(ScrapeConfig(url))
    price = result.selector.css(".product-price::text").get()
    name = result.selector.css(".product-title::text").get()
    writer.writerow([result.context['url'], name, price])

Some things to note when working with CSV:

The separator characters (default ,) have to be escaped
CSV is a flat structure so nested datasets have to be flattened

JSON

JSON is great for complex structures as it allows easy nesting and key-to-value structuring. However, JSON datasets can take up a lot of space and require compression for best storage efficiency.

Here's a short example scraper that stores data in JSON:

Python
Scrapfly.py

import httpx
import json
from parsel import Selector

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    resp = httpx.get(url)
    sel = Selector(resp.text)
    price = sel.css(".product-price::text").get()
    name = sel.css(".product-title::text").get()
    results.append({
        "url": str(resp.url),
        "name": name,
        "price": price
    })
with open('results.json', 'w') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

import json
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY API KEY")

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    result = scrapfly.scrape(ScrapeConfig(url))
    price = result.selector.css(".product-price::text").get()
    name = result.selector.css(".product-title::text").get()
    results.append({
        "url": result.context['url'],
        "name": name,
        "price": price
    })
with open('results.json', 'w') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

Some things to note when working with JSON:

The quote (") characters have to be escaped
Unicode support is often not enabled by default

JSONL

JSONL is a particularly popular JSON format variant in web scraping datasets where each line is a JSON object. This behavior allows for easy result-streaming which makes scrapers easier to work with.

Here's an example of a simple JSON Lines scraper:

Python
Scrapfly.py

import httpx
import json
from parsel import Selector

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    resp = httpx.get(url)
    sel = Selector(resp.text)
    price = sel.css(".product-price::text").get()
    name = sel.css(".product-title::text").get()
    item = {
        "url": str(resp.url),
        "name": name,
        "price": price
    }
    with open('results.jsonl', 'a') as f:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

import json
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")

urls = [
    "https://web-scraping.dev/product/1",
    "https://web-scraping.dev/product/2",
    "https://web-scraping.dev/product/3",
    "https://web-scraping.dev/product/4",
    "https://web-scraping.dev/product/5",
]

results = []
for url in urls:
    result = scrapfly.scrape(ScrapeConfig(url))
    price = result.selector.css(".product-price::text").get()
    name = result.selector.css(".product-title::text").get()
    item = {
        "url": result.context['url'],
        "name": name,
        "price": price
    }
    with open('results.jsonl', 'a') as f:
        f.write(json.dumps(item, ensure_ascii=False) + "\n")

Above, we open our output file on each iteration and append a new line. This enables easy data streaming for our web scrapers which is a much easier way to handle large data flows.

Spreadsheets

Spreadsheets are a natural fit for web scraping as they are designed to handle dynamic data, can be streamed (appending rows) and can be easy to work with. CSV output is already compatible with spreadsheets but other formats like Google Sheets can add extra features like online access, version control and collaboration.

Scraping Data to Google Sheets

Google sheets are free, accessible and feature-rich online spreadsheets - ideal for storing scraped data.

FAQ

How to scrape HTML table to Excel Spreadsheet (.xlsx)?

Data Processing

Most of the datapoint found on the web a free format. Dates and prices for example are expressed in text needing to be converted to matching data types. Here are some tips we cover on this subject:

Automatic Date Parsing with Dateparsers

Dateparser is a python library that can accurately guess date objects from freeform date strings.

article feature image for scraping emails

Tips for Scraping Emails

Intro to email scraping and parsing which has its own unique data processing challenges like deobfuscation.

article feature image for scraping phone numbers

Tips for Scraping Phone Numbers

Intro to phone number scraping which can be difficult to successfully validate and process.

Data Validation

When scraping at scale data validation is vital for consistent results as real web pages change unpredictably and often.

There are multiple ways to approach validation but most importantly tracking results and matching them against schema and regular expression patterns can catch 99% of failures.

Intro to Scraped Data Validation

This introduction covers two popular data validation tools: strict types and schema validation.

Next - Blocking

We've covered most of web scraping subjects that web scraper developers come across when developing web scraping programs. Though, by far the biggest barrier in scraping is scraper blocking and next, let's take a look at what it is and how to avoid it.

< >