Self-Healing Web Scrapers: Building Anti-Fragile Data Pipelines
We built 130+ production scrapers for Lynk that extract product data from major e-commerce sites. The challenge? Websites change constantly. Our solution? Self-healing scrapers that adapt automatically.
The Problem
Traditional scrapers break when:
- CSS selectors change
- Page structure is redesigned
- New anti-bot measures are added
- A/B tests alter the DOM
With 130+ scrapers, manual maintenance is impossible.
Architecture Overview
[Scraper Fleet] → [Change Detection] → [Auto-Adaptation] → [AWS SQS] → [Vector DB]
Scraper Categories
We organize scrapers by product category:
CATEGORIES = {
"electronics": ["amazon", "bestbuy", "newegg"],
"fashion": ["nike", "adidas", "zara"],
"home": ["wayfair", "ikea", "homedepot"]
}This allows category-specific extraction rules.
Selector Versioning
Each scraper maintains multiple selector versions:
class ProductScraper:
SELECTORS = {
"v1": {
"title": "h1.product-title",
"price": "span.price-now",
"image": "img.product-image"
},
"v2": {
"title": "h1[data-product-title]",
"price": "div.pricing span.current",
"image": "figure.gallery img"
}
}
def extract(self, html):
for version in reversed(self.SELECTORS.keys()):
try:
data = self._extract_with_version(html, version)
if self._validate(data):
return data
except:
continue
raise ExtractionError("All selector versions failed")DOM Change Detection
We hash critical page elements and monitor for changes:
import hashlib
from bs4 import BeautifulSoup
def get_dom_fingerprint(html: str) -> str:
soup = BeautifulSoup(html, 'html.parser')
# Extract structural elements
structure = {
"tag_counts": {tag.name: len(soup.find_all(tag.name))
for tag in soup.find_all()},
"class_list": [tag.get('class') for tag in soup.find_all(class_=True)],
"ids": [tag.get('id') for tag in soup.find_all(id=True)]
}
fingerprint = hashlib.md5(
str(structure).encode()
).hexdigest()
return fingerprint
def detect_changes(url: str, stored_fingerprint: str) -> bool:
current_html = fetch_page(url)
current_fingerprint = get_dom_fingerprint(current_html)
if current_fingerprint != stored_fingerprint:
logger.warning(f"DOM changed for {url}")
return True
return FalseAutomated Selector Discovery
When selectors fail, we automatically search for alternatives:
def find_product_title(soup):
# Try common patterns
candidates = [
soup.select_one("h1[class*='title']"),
soup.select_one("h1[class*='product']"),
soup.select_one("h1[id*='title']"),
soup.find("h1", {"itemprop": "name"})
]
# Score by heuristics
for candidate in candidates:
if candidate and len(candidate.text.strip()) > 10:
return candidate.text.strip()
# Fallback: ML-based detection
return ml_predict_title(soup)Anti-Detection Strategies
1. Rotating User Agents
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64)...",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...",
# ... 50+ more
]
def get_random_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
}2. Residential Proxies
from selenium import webdriver
PROXY_POOL = load_proxies("proxies.txt")
def create_driver(proxy=None):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument(f"--proxy-server={proxy or get_random_proxy()}")
return webdriver.Chrome(options=options)3. Request Timing
import time
import random
def human_like_delay():
# Random delay between 1-5 seconds
delay = random.uniform(1.0, 5.0)
time.sleep(delay)
def scrape_with_rate_limit(urls, requests_per_minute=10):
delay = 60 / requests_per_minute
for url in urls:
scrape_page(url)
time.sleep(delay + random.uniform(0, 1))Data Pipeline Integration
All scraped data flows through AWS SQS:
import boto3
import json
sqs = boto3.client('sqs')
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/account/queue"
def send_to_queue(product_data):
message = {
"source": "scraper",
"category": product_data["category"],
"data": product_data,
"timestamp": datetime.utcnow().isoformat()
}
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps(message),
MessageAttributes={
"Category": {
"StringValue": product_data["category"],
"DataType": "String"
}
}
)Monitoring & Alerts
Daily Health Checks
import schedule
def health_check():
failed_scrapers = []
for scraper in SCRAPER_REGISTRY:
try:
test_data = scraper.scrape_single_page()
if not validate_data(test_data):
failed_scrapers.append(scraper.name)
except Exception as e:
logger.error(f"{scraper.name} failed: {e}")
failed_scrapers.append(scraper.name)
if failed_scrapers:
send_alert(f"Scrapers failed: {', '.join(failed_scrapers)}")
schedule.every().day.at("02:00").do(health_check)Slack Notifications
def send_alert(message):
webhook_url = os.getenv("SLACK_WEBHOOK")
requests.post(webhook_url, json={
"text": f"🚨 Scraper Alert: {message}",
"channel": "#scraper-alerts"
})Graceful Degradation
When scrapers fail, fall back to API alternatives:
def get_product_data(product_url):
# Try scraping first
try:
return scrape_product(product_url)
except ScraperError:
logger.warning("Scraper failed, trying API")
# Fall back to official API if available
try:
product_id = extract_id(product_url)
return fetch_from_api(product_id)
except APIError:
logger.error("Both scraper and API failed")
return NoneResults
- 130+ active scrapers across 50+ websites
- 24-hour self-healing - detects and adapts to changes daily
- Millions of products indexed
- 95%+ uptime across all scrapers
Best Practices
- Version your selectors - Don't rely on a single set
- Monitor proactively - Daily health checks catch issues early
- Use multiple extraction strategies - CSS, XPath, ML-based
- Respect robots.txt - Be a good citizen
- Cache aggressively - Don't hammer servers
Conclusion
Self-healing scrapers aren't about perfect code - they're about building systems that adapt to inevitable changes. With proper monitoring and fallbacks, you can maintain large scraper fleets with minimal intervention.
Questions about web scraping at scale? Let's chat.
