Scraping at scale: The numbers that quietly decide whether your crawler succeeds

Most scraping projects fail for mundane reasons that can be measured in milliseconds and megabytes, not in code cleverness. After building and running collectors across retail, travel, classifieds and search, the same data points keep predicting outcomes.

If you want predictable throughput and low block rates, you need to treat scraping as a performance and network engineering problem backed by hard numbers, not hunches.

The physics of distance and why it matters

Network distance is not abstract. Light in fiber travels at roughly two thirds the speed of light, which works out to about 5 microseconds per kilometer. Put a crawler 5,000 kilometers from an origin and you add around 25 milliseconds one way, 50 milliseconds round trip, before any server processing.

A typical TLS 1.3 session over TCP involves about two round trips before first byte. On a distant target, that is easily 100 milliseconds burned before you even request HTML. If a page requires multiple origin fetches for JSON or images, the penalty multiplies. This is why placing crawlers near targets consistently lifts throughput and reduces timeout-induced retries.

Page weight and request fan out

Median web pages now weigh about 2 MB and trigger roughly 70 network requests. JavaScript alone often accounts for around 450 KB. Dynamic catalogs and infinite scrolls amplify that by chaining API calls after the initial HTML.

Every extra request expands the surface area for blocks, latencies and failures. If your collector is blind to third-party calls pulled in by the page, it will miss content and mis-estimate costs. Measuring request count, domains touched and total bytes per page upfront is the easiest win you can bank.

  • Baseline the median and 95th percentile of requests per page before scaling
  • Cap concurrent requests per target domain to avoid bursty block patterns
  • Cache stable assets and reuse sessions where allowed
  • Audit third-party calls the page initiates, not just the origin

Why origin location and IP reputation matter

Close to half of all web traffic is automated. That means many sites run active bot management that scores behavior, fingerprints, and IP reputation. If your collector originates from an address range associated with data center automation or known abuse, your error budget will evaporate, no matter how polite your crawler is.

Location is just as important as reputation. Content, prices, and availability often shift by region. Geo-fenced catalogs and ad stacks render different responses for the same URL depending on where the request appears to come from. When you need EU visibility, a Dutch egress point is a pragmatic default because it sits physically and legally within the region while offering excellent connectivity. If that is your use case, a single, well-placed Netherlands proxy can collapse latency, align geo-targeting, and materially improve success rates.

Headless, headers and the cost of looking real

JavaScript-heavy sites push significant logic client side. Rendering with a modern headless browser increases realism but adds CPU time and memory. Plan for that cost. Keep profiles consistent: user agent families, viewport sizes, timezone and locale, media codecs, WebGL and canvas fingerprints. Rotate too much and you look synthetic. Rotate too little and you look farmed.

At the HTTP layer, HTTP/2 multiplexes requests and reduces connection churn. HTTP/3 trims handshake overhead and avoids head-of-line blocking. Both help stability, but they do not erase block triggers. The basics still decide outcomes: session reuse, cookie discipline, cache control and idempotent retries with jitter.

Measure outcomes, not intent

Success rate is the primary KPI, but not the only one. Track response code mix, median and tail latency, bytes transferred, and the sequence of events that lead to blocks. Time to first byte is especially revealing because it exposes distance, handshake cost and server pacing in one number.

Set explicit budgets per page: target requests, total bytes, maximum render time and retriable error thresholds. When a property exceeds budget, downshift concurrency and preserve success rate instead of chasing raw volume. It is better to run reliably at a lower clip than to trigger adaptive defenses you will struggle to unwind.

Compliance is an availability feature

Respecting robots.txt and documented rate limits is not just ethics. It keeps you off escalations that turn soft blocks into hard bans. Identify and honor opt-out signals. Use crawl windows that match business hours only when you have a reason. Silence is a signal too; long strings of 4xx and 5xx responses are a request to slow down.

In the end, scraping at scale is an engineering exercise grounded in a few stubborn facts. Distance adds latency. Heavy pages multiply risk. Reputation and locality change what you are allowed to see. Measure those forces up front, design to their constraints, and your collectors will be both faster and quieter.