Data Processing, Output & Validation
The last step of every web scraping project is final data processing.
This often involved data validation, cleanup and storage.
As scrapers are dealing with data from an unknown source data processing can be
a surprisingly complex challenge. For long-term scraping, data validation can be
crucial for scraper maintenance. Using tools to ensure scraped result
quality can prevent scrapers from silently breaking or performing sub-optimally.
Web scraping datasets can vary greatly from small predictable structures to large,
complex data graphs.
Most commonly the CSV and
JSON formats are used.
CSV is great for flat datasets with a consistent structure.
This format can be directly imported to spreadsheet software (Excel, Google Sheets, etc)
and doesn't require compression as the format is already very compact.
Here's a short example scraper that stores data to CSV:
import httpx
import csv
from parsel import Selector
urls = [
"https://web-scraping.dev/product/1",
"https://web-scraping.dev/product/2",
"https://web-scraping.dev/product/3",
"https://web-scraping.dev/product/4",
"https://web-scraping.dev/product/5",
]
writer = csv.writer(open("products.csv", "w"))
writer.writerow(["url", "name", "price"])
for url in urls:
resp = httpx.get(url)
sel = Selector(resp.text)
price = sel.css(".product-price::text").get()
name = sel.css(".product-title::text").get()
writer.writerow([str(resp.url), name, price])
import csv
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")
urls = [
"https://web-scraping.dev/product/1",
"https://web-scraping.dev/product/2",
"https://web-scraping.dev/product/3",
"https://web-scraping.dev/product/4",
"https://web-scraping.dev/product/5",
]
writer = csv.writer(open("products.csv", "w"))
writer.writerow(["url", "name", "price"])
for url in urls:
result = scrapfly.scrape(ScrapeConfig(url))
price = result.selector.css(".product-price::text").get()
name = result.selector.css(".product-title::text").get()
writer.writerow([result.context['url'], name, price])
Some things to note when working with CSV:
- The separator characters (default
,
) have to be escaped
- CSV is a flat structure so nested datasets have to be flattened
JSON is great for complex structures as it allows easy nesting and key-to-value
structuring. However, JSON datasets can take up a lot of space and require compression
for best storage efficiency.
Here's a short example scraper that stores data in JSON:
import httpx
import json
from parsel import Selector
urls = [
"https://web-scraping.dev/product/1",
"https://web-scraping.dev/product/2",
"https://web-scraping.dev/product/3",
"https://web-scraping.dev/product/4",
"https://web-scraping.dev/product/5",
]
results = []
for url in urls:
resp = httpx.get(url)
sel = Selector(resp.text)
price = sel.css(".product-price::text").get()
name = sel.css(".product-title::text").get()
results.append({
"url": str(resp.url),
"name": name,
"price": price
})
with open('results.json', 'w') as f:
json.dump(results, f, indent=2, ensure_ascii=False)
import json
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient("YOUR SCRAPFLY API KEY")
urls = [
"https://web-scraping.dev/product/1",
"https://web-scraping.dev/product/2",
"https://web-scraping.dev/product/3",
"https://web-scraping.dev/product/4",
"https://web-scraping.dev/product/5",
]
results = []
for url in urls:
result = scrapfly.scrape(ScrapeConfig(url))
price = result.selector.css(".product-price::text").get()
name = result.selector.css(".product-title::text").get()
results.append({
"url": result.context['url'],
"name": name,
"price": price
})
with open('results.json', 'w') as f:
json.dump(results, f, indent=2, ensure_ascii=False)
Some things to note when working with JSON:
- The quote (") characters have to be escaped
- Unicode support is often not enabled by default
JSONL is a particularly popular
JSON format variant in web scraping datasets
where each line is a JSON object. This behavior allows for easy result-streaming
which makes scrapers easier to work with.
Here's an example of a simple JSON Lines scraper:
import httpx
import json
from parsel import Selector
urls = [
"https://web-scraping.dev/product/1",
"https://web-scraping.dev/product/2",
"https://web-scraping.dev/product/3",
"https://web-scraping.dev/product/4",
"https://web-scraping.dev/product/5",
]
results = []
for url in urls:
resp = httpx.get(url)
sel = Selector(resp.text)
price = sel.css(".product-price::text").get()
name = sel.css(".product-title::text").get()
item = {
"url": str(resp.url),
"name": name,
"price": price
}
with open('results.jsonl', 'a') as f:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
import json
from scrapfly import ScrapflyClient, ScrapeConfig
scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")
urls = [
"https://web-scraping.dev/product/1",
"https://web-scraping.dev/product/2",
"https://web-scraping.dev/product/3",
"https://web-scraping.dev/product/4",
"https://web-scraping.dev/product/5",
]
results = []
for url in urls:
result = scrapfly.scrape(ScrapeConfig(url))
price = result.selector.css(".product-price::text").get()
name = result.selector.css(".product-title::text").get()
item = {
"url": result.context['url'],
"name": name,
"price": price
}
with open('results.jsonl', 'a') as f:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
Above, we open our output file on each iteration and append a new line.
This enables easy data streaming for our web scrapers which is a much
easier way to handle large data flows.
Spreadsheets
Spreadsheets are a natural fit for web scraping as they are designed to handle
dynamic data, can be streamed (appending rows) and can be easy to work with.
CSV output is already compatible with spreadsheets but other formats like
Google Sheets can add extra features like online access, version control and collaboration.
Data Processing
Most of the datapoint found on the web a free format. Dates and prices for example are expressed in text
needing to be converted to matching data types. Here are some tips we cover on this subject:
Data Validation
When scraping at scale data validation is vital for consistent results as
real web pages change unpredictably and often.
There are multiple ways to approach validation but most importantly tracking
results and matching them against schema and regular expression patterns can
catch 99% of failures.
Next - Blocking
We've covered most of web scraping subjects that web scraper developers come across when
developing web scraping programs. Though, by far the biggest barrier in scraping is
scraper blocking and next, let's take a look at what it is and how to avoid it.