What are Google AI Overviews?
Google AI Overviews — formerly the Search Generative Experience (SGE) — are the AI-generated answer blocks Google now serves above the traditional 10 blue links for many query types. A typical AI Overview includes:
- →A multi-paragraph LLM-generated answer synthesized from the top sources
- →3-7 source citations as inline carousel cards (each with title, domain, favicon, snippet)
- →"People also ask" expanded blocks below the answer
- →Optional product carousels, video embeds, or location packs depending on query intent
For SEOs, marketers, and competitive intel teams, AI Overviews are a new ranking surface. Being cited as a source in an AI Overview now drives substantial traffic. Tracking which competitors get cited — and for what queries — is the new SERP tracking. The problem: there's no official API.
Why scraping AI Overviews is harder than regular SERPs
Three structural reasons traditional SERP scrapers don't reliably surface AI Overviews:
- 1.Conditional rendering. Google decides per-query whether to serve an AI Overview. The decision is influenced by query intent, freshness, location, device class, and IP trust. The same query from a datacenter IP often gets no overview; from a mobile IP, it does.
- 2.JavaScript-only payload. AI Overviews load via async client-side JS. Plain requests + BeautifulSoup scrapers see nothing. You need a real headless browser.
- 3.Aggressive bot detection. Google's 2026 anti-bot stack (briefly: TLS JA4 fingerprinting + behavioral analysis + IP reputation) actively suppresses AI Overviews on flagged sessions. Even if you bypass the detection, you may get a stripped response with no overview.
Setup: Python + Playwright + Mobile Proxies
We use Playwright (headless Chromium) routed through Pool Gateway mobile proxies. Mobile carrier IPs trigger AI Overviews substantially more reliably than residential or datacenter IPs because Google ranks mobile users as the primary surface for AI search.
pip install playwright
playwright install chromium
# Get your proxy creds at client.proxies.sx
export PSX_USERNAME="psx_YOUR_ID-mbl-us"
export PSX_PASSWORD="YOUR_PROXY_PASSWORD"Basic AI Overview scraper
import asyncio
import os
from urllib.parse import quote_plus
from playwright.async_api import async_playwright
PSX_USERNAME = os.environ['PSX_USERNAME']
PSX_PASSWORD = os.environ['PSX_PASSWORD']
async def scrape_ai_overview(query: str) -> dict:
"""Scrape a Google search and extract the AI Overview block if present."""
url = f"https://www.google.com/search?q={quote_plus(query)}&hl=en"
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={
"server": "http://gw.proxies.sx:7000",
"username": PSX_USERNAME,
"password": PSX_PASSWORD,
},
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1",
viewport={"width": 390, "height": 844},
device_scale_factor=3,
is_mobile=True,
has_touch=True,
)
page = await context.new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
# Wait briefly for AI Overview JS to populate
await page.wait_for_timeout(2500)
# AI Overview container — Google rotates class names; we rely on data-attrs and structure
overview = await page.evaluate("""
() => {
const blocks = document.querySelectorAll('[data-attrid*="overview"], [aria-label*="AI Overview" i]');
if (!blocks.length) return null;
const root = blocks[0];
return {
text: root.innerText.slice(0, 4000),
sources: [...root.querySelectorAll('a[href*="://"]')]
.map(a => ({ href: a.href, title: a.textContent.trim().slice(0, 200) }))
.filter(s => s.title),
raw_html_length: root.innerHTML.length,
};
}
""")
await browser.close()
return {
"query": query,
"ai_overview": overview,
"scraped_via": PSX_USERNAME,
}
if __name__ == "__main__":
result = asyncio.run(scrape_ai_overview("how does CGNAT work"))
print(result)Notes on the code: we set a mobile User-Agent and viewport because AI Overviews surface more reliably for the mobile experience. The proxy creds use our Pool Gateway with mbl-us for US T-Mobile carrier IPs. Build your own username string with the Username Builder.
Parsing the AI Overview block
Google rotates class names aggressively to defeat scrapers, but as of April 2026 there are three layers of selector you can rely on, ranked by stability:
- Most stable: data-attrid attributes and ARIA labels (e.g. [data-attrid*="overview"], [aria-label*="AI Overview"])
- Currently stable (April 2026): the obfuscated class names Kevs9, Y3BBE, li.jydCyd, Nn35F — these have held since early 2026 but expect them to rotate without notice
- Always-rotating: generic class hashes — never trust these
Defensive parsing strategy: use selector #1 as the primary, fall back to #2 if structure changes. The structure typically nests:
AI Overview container ([data-attrid*="overview"] OR .Kevs9)
├── Heading element ("AI Overview" or generative-style label)
├── Multi-paragraph answer body (.Y3BBE region)
├── Source carousel (li.jydCyd cards)
│ ├── Card 1 → <a href="…" title="…"> with citation number
│ ├── Card 2 → …
│ └── Card N
├── "Show more" / expand button (.Nn35F)
└── Optional: People Also Ask, related searchesGoogle now serves AI Overviews in two modes: inline (full content rendered in the initial response) and page_token deferred (a token that lets you fetch the full expanded content via a separate request). The page_token expires within ~1 minute of issue, so if you're building a queue-based scraper, expand immediately or store the token-fetch URL for instant follow-up. Don't batch-process expansion later.
Note that the new Google AI Mode (a separate product from AI Overviews — Google's standalone AI search experience launched late 2025) requires a different scraping approach. We'll cover that in a follow-up article. AI Overviews remain the higher-volume target because they appear above traditional SERPs for billions of queries.
Rate limiting and best practices
Google's rate limits on SERP scraping have tightened significantly in 2026. Best practices to stay functional at scale:
- →Rotate IPs aggressively. Use -rot-auto5 in the username for a new IP every 5 minutes. Pool Gateway handles this server-side — no client-side complexity.
- →Stick to mobile carriers. Datacenter IPs trigger CAPTCHAs within 5-10 queries. Real T-Mobile/Verizon/Vodafone IPs survive 50-200 queries before any friction.
- →Vary city + carrier. Token combinations like -city-newyork-carrier-tmobile and -city-losangeles-carrier-verizon spread the fingerprint surface.
- →Realistic delays. 4-12 seconds between queries with jitter. Faster patterns are clearly bot.
- →Persist cookies per IP. Real users have history. Use Playwright contexts with per-IP storage to add light cookie state on long-running scrapers.
Real use cases
- →SEO competitive intel. Track which domains Google cites in AI Overviews for your money keywords. The cited sources get the lion's share of traffic now — being cited matters more than ranking #1 organically.
- →Brand monitoring. Run daily AI Overview scrapes for your brand name + variants. Detect when Google starts citing competitor content as the authoritative answer for your branded queries.
- →Content gap analysis. Compare AI Overviews across geos. Different countries get different cited sources — a market opportunity map.
- →LLM training data. AI Overviews are themselves LLM outputs. Scraped at volume, they form a useful corpus of what Google's Gemini considers the "canonical" answer to common queries.
Legal considerations
US courts (Van Buren v. United States, hiQ v. LinkedIn) have generally held that scraping publicly accessible web data does not violate the CFAA. AI Overviews are publicly served to anyone running a browser, which puts them squarely in "public web" territory. Google's Terms of Service prohibit automated access; a TOS violation is contract law, not criminal. Practical guidance: don't scrape behind login walls, don't republish verbatim, attribute sources, respect robots.txt where it makes sense for your use case.