Skip to content
Back
ScrapeAny Team

ScrapeAny Team

TLS Fingerprinting: How Anti-Bot Systems Detect Your Scraper

TLS Fingerprinting: How Anti-Bot Systems Detect Your Scraper

Your Scraper Is Identified Before It Sends a Single Request

Most developers think bot detection starts with HTTP headers or JavaScript challenges. It doesn't. It starts at the TLS handshake — before your scraper even sends a URL path, a User-Agent, or a cookie.

When your client connects to a server over HTTPS, it sends a ClientHello message as the first step of the TLS handshake. This message contains a rich set of parameters: supported cipher suites, TLS extensions, elliptic curves, signature algorithms, and protocol versions. Together, these parameters form a fingerprint that uniquely identifies your client.

Every HTTP library, every browser version, and every programming language runtime produces a distinct ClientHello. Anti-bot systems exploit this fact ruthlessly.

Anatomy of a TLS Fingerprint

The ClientHello message includes several fields that anti-bot systems analyze:

  • Supported cipher suites — the list of encryption algorithms your client supports, in the order it prefers them
  • TLS extensions — capabilities like Server Name Indication (SNI), Application-Layer Protocol Negotiation (ALPN), and supported groups
  • Elliptic curves and point formats — the specific cryptographic curves your client supports
  • Signature algorithms — which signing methods your client accepts
  • TLS version — TLS 1.2, 1.3, or both

A real Chrome browser on Windows produces a very specific combination of these values. Python's requests library (which uses OpenSSL under the hood) produces a completely different combination. The differences are obvious to anyone — or any system — that knows what to look for.

The JA3 Era (2017-2021)

In 2017, researchers at Salesforce published JA3, the first widely-adopted TLS fingerprinting method. JA3 works by hashing five fields from the ClientHello into a single MD5 hash:

  1. TLS version
  2. Cipher suites
  3. Extensions
  4. Elliptic curves
  5. Elliptic curve point formats

The resulting 32-character hash gives every client a deterministic identifier. For example:

  • Chrome 120 on Windows might produce a0e9f5d64349fb13958b57b...
  • Python requests 2.31 might produce eb1d94daa7e0344597e756a...
  • curl 8.4 might produce 456523fc94726331a8d05d2...

Anti-bot systems maintain databases of known JA3 hashes. If your hash matches a known scraping library, you're blocked immediately — before the HTTP request is even processed.

Why JA3 Worked So Well

JA3 was effective because most scraping tools had no way to customize their TLS behavior. The requests library in Python uses whatever OpenSSL provides. You couldn't change the cipher suite order or add arbitrary extensions without forking the SSL library itself.

This made JA3 a highly reliable signal: if the hash matched a non-browser client, you could block with confidence.

JA3's Limitations

By 2021, the scraping community had adapted. Tools like curl_cffi and tls_client emerged, allowing developers to impersonate browser TLS fingerprints. Since JA3 was a static hash, all you needed to do was reproduce the exact same ClientHello fields, and you'd produce the same hash.

from curl_cffi import requests

# This produces a JA3 hash identical to Chrome
response = requests.get(
    "https://protected-site.com",
    impersonate="chrome"
)

JA3 also struggled with TLS 1.3, which significantly reduces the number of fields available for fingerprinting (most cipher suites and extensions are now encrypted or standardized).

The JA4 Transition (2023-2025)

JA4, published by FoxIO in 2023, addressed JA3's shortcomings with a fundamentally different approach. Instead of a single hash, JA4 produces a structured fingerprint with multiple components:

  • JA4 — the core TLS fingerprint, using a readable format instead of an opaque hash
  • JA4S — server-side TLS fingerprint (the ServerHello response)
  • JA4H — HTTP client fingerprint (header order, values)
  • JA4X — X.509 certificate fingerprint
  • JA4T — TCP fingerprint
  • JA4SSH — SSH fingerprint

This multi-signal approach makes evasion dramatically harder. Even if you perfectly spoof the JA4 TLS fingerprint, your JA4H (HTTP) fingerprint might betray you. Your TCP behavior might be inconsistent with the browser you're claiming to be. The combination of signals creates a composite identity that's much harder to fake than a single hash.

What Makes JA4 Harder to Beat

JA4 sorts cipher suites and extensions alphabetically before hashing, which means simply reordering your parameters (a common JA3 evasion technique) no longer works. It also uses a human-readable prefix format that encodes:

  • TLS version (q = QUIC, t = TCP)
  • SNI presence
  • Number of cipher suites
  • Number of extensions
  • ALPN value

This prefix alone narrows the fingerprint space significantly, even before the full hash comparison.

Multi-Signal Fingerprinting: The Modern Reality

Modern anti-bot systems in 2026 don't rely on TLS fingerprinting alone. They combine multiple signals into a composite detection model:

Layer 1: Network-Level

  • TLS fingerprint (JA4 family)
  • TCP/IP fingerprint (window size, TTL, MSS)
  • IP reputation and ASN classification (data center vs. residential)

Layer 2: Protocol-Level

  • HTTP/2 settings and frame ordering (SETTINGS frame values, priority trees)
  • Header order and capitalization
  • Cookie handling behavior

Layer 3: Application-Level

  • JavaScript execution environment (browser APIs, timing)
  • Canvas and WebGL fingerprinting
  • Navigator and screen properties
  • Event handling patterns (mouse, keyboard, touch)

Layer 4: Behavioral

  • Request timing and cadence
  • Navigation patterns (do you load CSS, images, fonts?)
  • Session continuity and state management

Anti-bot systems feed these signals into machine learning models that classify requests on a probability spectrum rather than a binary allow/block decision. A request might have a perfect TLS fingerprint but suspicious timing patterns, resulting in a CAPTCHA challenge rather than an outright block.

What Actually Works for Evasion

Given the multi-layered detection landscape, here's what works — and what doesn't — in 2026:

Residential Proxies + Real Browser Engine

The most reliable approach combines residential proxy IPs (which have clean reputations and correct ASN classifications) with a real browser engine (Chromium via Playwright or Puppeteer). This ensures consistency across all fingerprinting layers:

  • The TLS fingerprint matches a real browser
  • The HTTP/2 behavior matches a real browser
  • JavaScript challenges are executed natively
  • The IP comes from a residential ISP

The downside: it's resource-intensive. Each browser instance consumes 100-300MB of RAM, and residential proxy bandwidth is expensive.

curl-impersonate and TLS Spoofing Libraries

For sites that primarily rely on TLS fingerprinting without heavy JavaScript challenges, libraries like curl_cffi (which is built on curl-impersonate) offer a lightweight alternative. They reproduce browser TLS fingerprints at a fraction of the resource cost of a full browser.

from curl_cffi import requests

session = requests.Session(impersonate="chrome")

# TLS fingerprint matches Chrome
# But no JS execution capability
response = session.get("https://target-site.com")

This works well against basic to moderate protection. It fails against systems that require JavaScript execution or behavioral analysis.

Distributed Architecture

Spreading requests across many IP addresses with randomized timing patterns helps avoid behavioral detection. Key principles:

  • Randomize request intervals — avoid fixed delays
  • Rotate sessions — don't reuse the same proxy-fingerprint combination for too long
  • Mimic natural patterns — real users don't request 100 pages per minute with perfectly even spacing
  • Load secondary resources — real browsers fetch CSS, images, and fonts. A scraper that only fetches HTML stands out.

Defense Effectiveness Matrix

Here's how different protection mechanisms perform against common evasion techniques:

Protection LayerBasic ScraperTLS SpoofingHeadless BrowserManaged Service
JA3/JA4 fingerprintingBlocksBypassesBypassesBypasses
JavaScript challengesBlocksBlocksBypassesBypasses
HTTP/2 fingerprintingBlocksPartialBypassesBypasses
Behavioral analysisBlocksBlocksPartialBypasses
ML composite scoringBlocksPartialPartialBypasses

The pattern is clear: the more detection layers a site uses, the more sophistication you need on the evasion side. Each additional layer requires a different kind of expertise and infrastructure.

The Arms Race Continues

TLS fingerprinting is just one front in the ongoing arms race between scrapers and anti-bot systems. The trend is unmistakable: detection is becoming multi-dimensional, ML-driven, and increasingly difficult to circumvent with simple technical tricks.

For individual developers scraping a few sites, understanding TLS fingerprinting and using tools like curl_cffi is often enough. For organizations that need reliable, large-scale data collection across dozens of protected sites, the complexity of maintaining evasion across all detection layers becomes a full-time engineering challenge.

If you're dealing with advanced anti-bot systems and need data reliably without building a detection evasion team in-house, talk to us. TLS fingerprinting, behavioral modeling, and multi-layer bypass are problems we solve every day — so you don't have to.

Ready to turn the internet into usable data?

Tell us about your project. We'll review it and get back to you within 24 hours.

Contact Us

Tell us about your scraping needs. Our experts will review your project and help you find the right solution. We typically respond within 24 hours.