BlogAI Search Scraping·14 min read

How to scrape Google AI Overviews in 2026.

Python + mobile proxies. The first scraping guide that actually addresses Google's 2026 AI search reality — what AI Overviews are, why they're harder to scrape than regular SERPs, and the IP-quality requirements nobody talks about.

tl;dr

AI Overviews are the SGE-style answer blocks Google now serves above traditional SERPs. Datacenter and most residential IPs don't see them. Mobile carrier IPs trigger AI Overviews most reliably. Use Playwright + Pool Gateway mobile proxies, parse the .gNBE / overview data block, store the source citations. That's the entire pipeline. Below: the working code.

background

What are Google AI Overviews?

Google AI Overviews — formerly the Search Generative Experience (SGE) — are the AI-generated answer blocks Google now serves above the traditional 10 blue links for many query types. A typical AI Overview includes:

  • A multi-paragraph LLM-generated answer synthesized from the top sources
  • 3-7 source citations as inline carousel cards (each with title, domain, favicon, snippet)
  • "People also ask" expanded blocks below the answer
  • Optional product carousels, video embeds, or location packs depending on query intent

For SEOs, marketers, and competitive intel teams, AI Overviews are a new ranking surface. Being cited as a source in an AI Overview now drives substantial traffic. Tracking which competitors get cited — and for what queries — is the new SERP tracking. The problem: there's no official API.

the catch

Why scraping AI Overviews is harder than regular SERPs

Three structural reasons traditional SERP scrapers don't reliably surface AI Overviews:

  • 1.Conditional rendering. Google decides per-query whether to serve an AI Overview. The decision is influenced by query intent, freshness, location, device class, and IP trust. The same query from a datacenter IP often gets no overview; from a mobile IP, it does.
  • 2.JavaScript-only payload. AI Overviews load via async client-side JS. Plain requests + BeautifulSoup scrapers see nothing. You need a real headless browser.
  • 3.Aggressive bot detection. Google's 2026 anti-bot stack (briefly: TLS JA4 fingerprinting + behavioral analysis + IP reputation) actively suppresses AI Overviews on flagged sessions. Even if you bypass the detection, you may get a stripped response with no overview.
step 1

Setup: Python + Playwright + Mobile Proxies

We use Playwright (headless Chromium) routed through Pool Gateway mobile proxies. Mobile carrier IPs trigger AI Overviews substantially more reliably than residential or datacenter IPs because Google ranks mobile users as the primary surface for AI search.

bash
pip install playwright
playwright install chromium

# Get your proxy creds at client.proxies.sx
export PSX_USERNAME="psx_YOUR_ID-mbl-us"
export PSX_PASSWORD="YOUR_PROXY_PASSWORD"
step 2

Basic AI Overview scraper

python
import asyncio
import os
from urllib.parse import quote_plus
from playwright.async_api import async_playwright

PSX_USERNAME = os.environ['PSX_USERNAME']
PSX_PASSWORD = os.environ['PSX_PASSWORD']

async def scrape_ai_overview(query: str) -> dict:
    """Scrape a Google search and extract the AI Overview block if present."""
    url = f"https://www.google.com/search?q={quote_plus(query)}&hl=en"

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={
                "server": "http://gw.proxies.sx:7000",
                "username": PSX_USERNAME,
                "password": PSX_PASSWORD,
            },
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
                       "AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1",
            viewport={"width": 390, "height": 844},
            device_scale_factor=3,
            is_mobile=True,
            has_touch=True,
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle", timeout=30000)

        # Wait briefly for AI Overview JS to populate
        await page.wait_for_timeout(2500)

        # AI Overview container — Google rotates class names; we rely on data-attrs and structure
        overview = await page.evaluate("""
() => {
  const blocks = document.querySelectorAll('[data-attrid*="overview"], [aria-label*="AI Overview" i]');
  if (!blocks.length) return null;
  const root = blocks[0];
  return {
    text: root.innerText.slice(0, 4000),
    sources: [...root.querySelectorAll('a[href*="://"]')]
              .map(a => ({ href: a.href, title: a.textContent.trim().slice(0, 200) }))
              .filter(s => s.title),
    raw_html_length: root.innerHTML.length,
  };
}
        """)

        await browser.close()
        return {
            "query": query,
            "ai_overview": overview,
            "scraped_via": PSX_USERNAME,
        }

if __name__ == "__main__":
    result = asyncio.run(scrape_ai_overview("how does CGNAT work"))
    print(result)

Notes on the code: we set a mobile User-Agent and viewport because AI Overviews surface more reliably for the mobile experience. The proxy creds use our Pool Gateway with mbl-us for US T-Mobile carrier IPs. Build your own username string with the Username Builder.

step 3

Parsing the AI Overview block

Google rotates class names aggressively to defeat scrapers, but as of April 2026 there are three layers of selector you can rely on, ranked by stability:

  1. Most stable: data-attrid attributes and ARIA labels (e.g. [data-attrid*="overview"], [aria-label*="AI Overview"])
  2. Currently stable (April 2026): the obfuscated class names Kevs9, Y3BBE, li.jydCyd, Nn35F — these have held since early 2026 but expect them to rotate without notice
  3. Always-rotating: generic class hashes — never trust these

Defensive parsing strategy: use selector #1 as the primary, fall back to #2 if structure changes. The structure typically nests:

text
AI Overview container ([data-attrid*="overview"] OR .Kevs9)
├── Heading element ("AI Overview" or generative-style label)
├── Multi-paragraph answer body (.Y3BBE region)
├── Source carousel (li.jydCyd cards)
│   ├── Card 1 → <a href="…" title="…"> with citation number
│   ├── Card 2 → …
│   └── Card N
├── "Show more" / expand button (.Nn35F)
└── Optional: People Also Ask, related searches
2026 gotcha

Google now serves AI Overviews in two modes: inline (full content rendered in the initial response) and page_token deferred (a token that lets you fetch the full expanded content via a separate request). The page_token expires within ~1 minute of issue, so if you're building a queue-based scraper, expand immediately or store the token-fetch URL for instant follow-up. Don't batch-process expansion later.

Note that the new Google AI Mode (a separate product from AI Overviews — Google's standalone AI search experience launched late 2025) requires a different scraping approach. We'll cover that in a follow-up article. AI Overviews remain the higher-volume target because they appear above traditional SERPs for billions of queries.

step 4

Rate limiting and best practices

Google's rate limits on SERP scraping have tightened significantly in 2026. Best practices to stay functional at scale:

  • Rotate IPs aggressively. Use -rot-auto5 in the username for a new IP every 5 minutes. Pool Gateway handles this server-side — no client-side complexity.
  • Stick to mobile carriers. Datacenter IPs trigger CAPTCHAs within 5-10 queries. Real T-Mobile/Verizon/Vodafone IPs survive 50-200 queries before any friction.
  • Vary city + carrier. Token combinations like -city-newyork-carrier-tmobile and -city-losangeles-carrier-verizon spread the fingerprint surface.
  • Realistic delays. 4-12 seconds between queries with jitter. Faster patterns are clearly bot.
  • Persist cookies per IP. Real users have history. Use Playwright contexts with per-IP storage to add light cookie state on long-running scrapers.
why this matters

Real use cases

  • SEO competitive intel. Track which domains Google cites in AI Overviews for your money keywords. The cited sources get the lion's share of traffic now — being cited matters more than ranking #1 organically.
  • Brand monitoring. Run daily AI Overview scrapes for your brand name + variants. Detect when Google starts citing competitor content as the authoritative answer for your branded queries.
  • Content gap analysis. Compare AI Overviews across geos. Different countries get different cited sources — a market opportunity map.
  • LLM training data. AI Overviews are themselves LLM outputs. Scraped at volume, they form a useful corpus of what Google's Gemini considers the "canonical" answer to common queries.

US courts (Van Buren v. United States, hiQ v. LinkedIn) have generally held that scraping publicly accessible web data does not violate the CFAA. AI Overviews are publicly served to anyone running a browser, which puts them squarely in "public web" territory. Google's Terms of Service prohibit automated access; a TOS violation is contract law, not criminal. Practical guidance: don't scrape behind login walls, don't republish verbatim, attribute sources, respect robots.txt where it makes sense for your use case.

FAQ

Why mobile IPs specifically — won't residential work?
Residential IPs work for some queries but AI Overviews appear less consistently. Google's personalization layer treats mobile carrier IPs as the primary search surface, so AI Overviews surface more reliably. Run the same query on residential vs mobile and you'll see the difference.
How often does Google rotate the AI Overview HTML structure?
Class names change every 4-8 weeks. data-attrid attributes are more stable but also rotate. Build your selectors defensively: prefer ARIA labels and data attributes over class names, and have fallback parsing logic for when structure changes.
Can I run this without a paid proxy service?
Technically yes with free residential proxies, but they're unreliable and Google blocks them within a handful of queries. For any production use case, paid mobile proxies are the only path that scales.
What about SerpAPI or third-party SERP APIs?
They've improved meaningfully — SerpAPI publishes a 68% AI Overview detection rate as of early 2026, with ~2-second avg response time. The dedicated /search?engine=google_ai_overview endpoint returns structured JSON. The catch: even at 68%, they miss roughly 1 in 3 queries, the cost adds up at volume, and you have no control over which IPs are used. For high-volume tracking or geo-specific work, building your own scraper with mobile proxies remains the fuller-coverage option.
What's "Google AI Mode" and is it the same as AI Overviews?
Different products. AI Overviews are inline blocks above traditional SERPs (the topic of this guide). "Google AI Mode" is Google's standalone AI search experience launched in late 2025 — a separate UI accessed via a tab toggle. AI Mode is more conversational; AI Overviews are stitched into normal search. Both are scrape-able but require different selectors. AI Overviews remain higher volume.
How do I scale this past hundreds of queries per day?
Run multiple Playwright workers in parallel with rotating proxy strings (different country/carrier/city combinations in the username DSL). Pool Gateway handles concurrency up to 100 simultaneous connections per account. For higher volume, contact us for enterprise tier.
next step
Get mobile proxies, run the script, see real AI Overview data.