A 2026 web scraping stack is no longer a single library and a list of proxies. It's an architecture: rotating residential at the front, headless browsers in the middle, an unblocker API for the worst sites, and a queue that knows when to escalate. Get the layers right and the cost-per-million-successful-requests math collapses; get them wrong and you'll burn five times the budget on retries you didn't need to make. Here's how we build scraping stacks for our own jobs and for the clients we audit.
The four layers of a modern scraper
| Layer | Purpose | Typical cost | Tools |
|---|---|---|---|
| 1. Transport (proxies) | Disguise origin, distribute load | $0.50 – $8 / GB | RapidProxy, SwiftProxy, Decodo |
| 2. Client (HTTP/headless) | Replay realistic traffic | CPU + RAM | Scrapy, Playwright, Puppeteer |
| 3. Anti-bot bypass | Solve TLS, JS, CAPTCHAs | $1 – $5 / 1k requests | Web Unblocker, curl-impersonate |
| 4. Orchestration | Queue, retry, deduplicate | CPU + ops time | Airflow, Temporal, custom |
Layer 1: picking the right proxy mix
The cheapest valid configuration covers your traffic with the lowest tier of proxy that actually works. We default to a three-tier router:
- Datacenter v6 or v4 for unprotected targets — RSS feeds, sitemaps, open APIs, government data. Proxy-Cheap at $0.15/IP IPv6 is the structurally cheapest option.
- Rotating residential for protected targets — anything behind Cloudflare, DataDome, PerimeterX, Akamai. RapidProxy with code
ATBKU256W, SwiftProxy withIWBRT6TBA, or Decodo withPCMAG10. - ISP / mobile for sticky-session work — accounts, sneakers, social. MarsProxies ISP, Proxidize mobile.
The router moves a request up the ladder when the response code suggests blocking (typically 403, 429, 503, or a Cloudflare interstitial). Most teams start with rotating residential and never bother with the cheap tier. That's fine for prototypes; it's expensive in production.
Tier-up rule: if your block rate on rotating residential exceeds 8% on a target, switch that target to an unblocker API. The compute and engineering time to debug your way past 8% is almost always more expensive than $5/1k unblocker requests.
Layer 2: clients and frameworks
Scrapy
Scrapy is still the default for high-throughput pure-HTTP crawling at hundreds to thousands of requests per second. Pair it with scrapy-playwright when you need JavaScript on a subset of pages and you keep the simple-Scrapy speed for everything else. The proxy plug-in scrapy-rotating-proxies handles per-request rotation cleanly.
Playwright
Playwright is our recommendation for any new scraper that needs a browser engine. It's faster than Selenium, has first-class Python and Node bindings, and ships with stealth plugins (playwright-stealth) that handle most of the trivial fingerprinting tells. Browser context per proxy works cleanly:
browser = await playwright.chromium.launch(proxy={
"server": "http://gate.proxyhost.com:8080",
"username": "user",
"password": "pass",
})
Puppeteer
Puppeteer-extra with the stealth plugin is the older Node ecosystem option. Slightly less polished than Playwright but still widely deployed. If you're working with an existing Puppeteer codebase there's no strong reason to migrate.
curl-impersonate
For TLS-fingerprint-sensitive targets that don't actually need JavaScript, curl-impersonate is a drop-in replacement for cURL that mimics Chrome's, Firefox's or Safari's TLS handshake byte-for-byte. Combined with rotating residential, it bypasses many JA3/JA4 checks without spinning up a headless browser. Massively cheaper at scale.
Layer 3: anti-bot bypass
The defence vendors you'll meet most often: Cloudflare (Turnstile, Bot Fight Mode), DataDome, PerimeterX (now Human Security), Akamai Bot Manager, Imperva, Kasada and Reblaze. Each enforces a different mix of TLS fingerprint checks, JavaScript challenges, behavioural heuristics and CAPTCHA rendering. Beating them in 2026 typically means one of three approaches:
- DIY: rotating residential + curl-impersonate + behavioural pacing. Cheapest but engineering-intensive.
- Headless + stealth: Playwright with stealth + clean residential. Works for most Cloudflare and Imperva configurations.
- Web Unblocker API: hand the request to an endpoint that solves the puzzle for you. Bright Data Web Unlocker, Oxylabs Unblocker, Decodo Site Unblocker and Novada Unblocker all sell this as a managed service.
We compare the unblockers head-to-head in our unblocker comparison. The short version: pay for an unblocker on your hardest 10–20% of targets, DIY the rest.
Need an unblocker on the hardest sites?
Decodo Site Unblocker handles Cloudflare, DataDome and PerimeterX from one endpoint. Reader code PCMAG10 trims 10%.
Layer 4: orchestration and retries
The orchestrator is where most scrapers quietly fail. Three rules we've learned the hard way:
- Retry with escalation, not retry with the same proxy. A 429 from a residential IP doesn't get better when you hit it again. Move the request to ISP or unblocker on retry.
- Deduplicate after success, not before. Pre-deduplication breaks when target pages change cache headers; post-success dedup is robust.
- Persist state outside the worker. Redis or Postgres for queue state. Workers are cattle, queues are pets.
For small jobs, a Python script and a Redis queue is enough. For anything serious, Temporal or Airflow gives you durable retries, exponential back-off and a UI when something goes wrong at 3am.
Cost math, with real numbers
Take a target list of 10 million product pages, 30% of them behind Cloudflare. A naive setup using rotating residential at $1/GB with average page weight 200 KB:
10M pages × 200 KB = 2 TB
2 TB × $1/GB = $2,000
Effective cost with 70% success rate = $2,857
A tiered setup using IPv6 datacenter at $0.15/IP for the 70% unprotected pages, residential for the 25% protected, and unblocker at $4/1k for the 5% hardest:
7M pages × IPv6 (effectively free at scale) ≈ $50 in IPs
2.5M pages × residential 200 KB = 500 GB × $0.70 = $350
500k pages × unblocker $4/1k = $2,000
Total = $2,400 with 95%+ success
Same job, similar dollars, dramatically better data quality. The trick is the routing logic, not the proxy network you bought.
Avoiding the common mistakes
Never scrape with default cURL or default Python requests. Their TLS fingerprints are flagged as scraper fingerprints by every defence vendor. Use curl-impersonate, requests-go, or a real browser.
- Don't pin one IP per session by accident. If your scraper opens a TCP connection per worker and the worker reuses it for an hour, you have a sticky session whether you wanted one or not. Disable HTTP keep-alive on rotating workloads.
- Honour
robots.txton every target. Not for legal reasons (it isn't binding) but because aggressive scraping accelerates target hardening, which makes future runs more expensive. - Pace per domain, not per worker. Hammering a single hostname at 1,000 RPS will block you regardless of how clean the proxies are.
- Log per-target success rate, not aggregate. Aggregates hide the one target that's silently driving 80% of your retries.
Provider shortlist for scraping
- Best price-to-performance: RapidProxy with
ATBKU256W. - Cheapest at scale: SwiftProxy with
IWBRT6TBA, 711Proxy with81EF19, NSOCKS with2KUSS5. - Best beginner UI: Decodo with
PCMAG10. - Enterprise / compliance: Bright Data with
RESIGB50, Oxylabs. - Built-in unblocker: Novada, Thordata with
5U5OU857.
Hosting the scraper
Pair the proxy network with a clean, low-latency VPS. vpsrated.com/proxy tracks the cleanest options for proxy traffic, eurohosting.org covers EU compute close to most residential pools, and russiavps.site appears in our shortlist when scraping benefits from RU egress. The community directories at 5-proxy.com and proxytrust.site both maintain scraping-specific filters that surface niche networks before they hit the major review sites.
Final verdict
A 2026 web scraping stack is a tiered system, not a tool. Datacenter for the easy 70%, residential for the protected 25%, unblocker for the hardest 5%. Pair the transport with curl-impersonate or Playwright, orchestrate with a queue that escalates on failure, and your cost-per-million collapses. Build the routing logic once and you'll save more on bandwidth than any single coupon ever did. The comparison engine filtered by use case is the easiest place to start your shortlist, and our 2026 residential ranking is the next stop.
Frequently asked questions
What's the most cost-effective scraping stack in 2026?
For most readers: a $20–$50/month residential plan from SwiftProxy or 711Proxy, paired with a small VPS for orchestration and Playwright with stealth plugins for hardened targets. That covers 80% of scraping use cases at well under $100/month — far below what packaged scraping SaaS charges. Add a managed Web Unblocker on top of that only for the hardest 5% of targets.
How do I avoid getting my scraper banned?
Five tactics, in order of impact: rotate residential IPs aggressively, randomise headers and user-agents, use realistic request pacing (0.5–2 req/sec per IP), persist cookies across multi-step flows, and respect retry-after headers. Skip any of these and your block rate climbs sharply. Our fingerprint guide covers the deeper signals beyond IP rotation.
Should I use Scrapy, Playwright or a custom client?
Scrapy for high-volume, simple-HTML scraping. Playwright when JavaScript needs to execute (modern e-commerce, social, anything Cloudflare-fronted). A custom client (httpx + tls-client) for the middle ground. Almost no production team uses just one — they pick per target and pipe the results through a single parser layer downstream.
How does Web Unblocker pricing compare to a raw proxy?
For a 50% raw success rate on hardened targets, a $0.005/successful-request unblocker matches a $0.001/raw-request proxy on cost-per-output. Above 70% raw success, the proxy is cheaper. Below 50%, the unblocker pulls ahead. Our unblocker comparison goes into the per-target math.
Is it legal to scrape behind login pages?
Generally not without authorisation. Public data scraping has been clarified post-hiQ v. LinkedIn, but logged-in scraping bypasses an access control and exposes you to CFAA / Computer Misuse Act risk. Our ethics and legality guide covers the legal landscape in detail.
Where can I find scraping-friendly hosting?
The independent hosting reviews at vpsrated.com and eurohosting.org rank providers on uplink stability, IP reputation and abuse-handling responsiveness. russiavps.site covers RU/CIS hosts whose IP ranges aren't pre-blacklisted by Western anti-bot systems. 5-proxy.com and proxytrust.site publish provider trust scores that update weekly.