Web Scraping

LeBonCoin Scraper: Anti-Scraping Bypass & Data Extraction

Robust Python web scraper extracting seller contacts from Leboncoin.fr with 100% success rate using ScrapFly's anti-scraping technology.

Project Overview

The Challenge

Leboncoin.fr implements sophisticated anti-scraping protection that blocks automated requests and bot traffic, making traditional scraping approaches ineffective.

Phone numbers are only revealed after JavaScript execution and user interaction, requiring full browser rendering capabilities rather than simple HTML parsing.

Access to local listings requires French IP addresses due to geolocation restrictions, creating barriers for international scraping operations.

Frequent HTML structure changes and dynamic CSS classes require robust extraction strategies to maintain reliability over time.

Need to balance extraction success rate with API costs for scalable operations, optimizing the cost-per-contact metric.

Developed a production-grade web scraper for Leboncoin.fr, France's largest classified ads platform, achieving 100% phone extraction success rate in production tests.

Implemented sophisticated anti-scraping bypass using ScrapFly's residential proxy network with French geolocation and JavaScript rendering capabilities to reveal dynamically-loaded phone numbers.

Designed a robust 3-tier fallback extraction strategy combining HTML selectors, phone link detection, and regex patterns to handle frequent HTML structure changes and ensure data capture reliability.

Optimized for cost efficiency with configurable limits and environment-based configuration, achieving $0.33 per 10 contacts while maintaining high success rates and scalability from light to heavy usage.

Technical Architecture

Click diagram to zoom

ScrapFly Integration Layer: Handles API authentication, anti-scraping protection bypass, and residential proxy configuration with French geolocation

Two-Phase Scraping Pipeline: Phase 1 collects ad URLs from search results, Phase 2 extracts detailed information including phone numbers

Multi-Method Phone Extraction: Implements 3-tier fallback strategy using HTML selectors, phone links, and regex patterns with French format validation

Data Processing Pipeline: BeautifulSoup HTML parsing with robust price extraction handling Unicode spaces and structured JSON output

Key Challenges & Solutions

Anti-Scraping Detection

Implemented ScrapFly's ASP bypass with residential proxy rotation, realistic browser headers, and auto-retry mechanism to avoid detection and blocking.

JavaScript-Rendered Content

Enabled full JavaScript execution with 3-second rendering wait, DOM readiness checks, and auto-scroll to trigger lazy-loaded phone numbers.

Dynamic HTML Selectors

Built 3-tier fallback extraction using data-qa-id attributes, tel: links, and regex patterns to handle frequent HTML structure changes.

Cost Optimization

Implemented configurable limits, environment-based tuning, and efficient two-phase approach to minimize API costs while maintaining quality.

Data Validation

Created comprehensive validation system with regex patterns, length checks, prefix validation, and duplicate detection for data quality.

Impact & Results

Achieved 100% phone extraction success rate in production tests (10/10 contacts)

Optimized performance to 60 seconds execution time for 10 contacts

Reduced cost to $0.33 per 10 contacts through efficient API usage

Enabled scalable operations from $4.50/month (light) to $30/month (heavy usage)

Open-sourced with MIT license, comprehensive documentation, and validation reports

Key Features

Anti-scraping protection bypass with ScrapFly ASP
Residential French proxies for geolocation compliance
JavaScript rendering for dynamic content extraction
Multi-method phone extraction with 3-tier fallback
Robust price parsing with Unicode space handling
Environment-based configuration (.env support)
Comprehensive error handling and logging
Production-validated with detailed test reports

Technologies Used

PythonScrapFly SDKBeautifulSoup4Regexpython-dotenvJSON

View Source Code

Project Gallery

Technical Architecture: Two-phase scraping pipeline with anti-scraping bypass

Web scraping system with residential proxies and JavaScript rendering

Project Details

Client

Personal Project

Timeline

2025

Role

Solo Developer