BlogAI Search Scraping·12 min read

Scrape ChatGPT Search results in 2026.

Track citations, AI-generated summaries, and source rankings in ChatGPT Search. Python + Playwright + mobile proxies. The patterns that survive Cloudflare in 2026.

tl;dr

Hit chatgpt.com/search?q=... via Playwright through Pool Gateway mobile proxies. Wait for the streaming AI response to finish, parse the citation cards from the right-side rail, store source URLs + brand mentions. Rotate IPs every 5-10 queries. Mobile carrier IPs survive Cloudflare; datacenter IPs don't.

why this matters

Why track ChatGPT Search at all?

ChatGPT Search is now the second-largest AI search surface (after Google AI Overviews). When a brand gets cited as a source in a ChatGPT Search response, that's the new top-of-funnel — users see the brand inside the AI answer and click through. Marketing teams that tracked Google SERPs religiously now need to track ChatGPT Search the same way.

The 2026 landscape complication: OpenAI launched ChatGPT Atlas in late 2025 — their own agent-capable browser available on macOS (Windows / iOS / Android coming). Atlas users get an "agent mode" that lets ChatGPT directly browse and complete tasks. This means more queries route through Atlas's in-browser search rather than chatgpt.com/search, fragmenting the visibility surface. For tracking purposes you still target chatgpt.com/search (the largest surface), but be aware Atlas-specific behaviors are emerging.

step 1

Setup

bash
pip install playwright httpx
playwright install chromium

export PSX_USERNAME="psx_YOUR_ID-mbl-us-rot-auto10"
export PSX_PASSWORD="YOUR_PROXY_PASSWORD"
step 2

Basic ChatGPT Search scraper

python
import asyncio
import os
from urllib.parse import quote_plus
from playwright.async_api import async_playwright

async def scrape_chatgpt_search(query: str) -> dict:
    url = f"https://chatgpt.com/search?q={quote_plus(query)}"

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={
                "server": "http://gw.proxies.sx:7000",
                "username": os.environ['PSX_USERNAME'],
                "password": os.environ['PSX_PASSWORD'],
            },
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
                       "(KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle", timeout=45000)

        # Wait for streaming response to complete (response container fully loaded)
        await page.wait_for_selector('[data-testid="conversation-turn-2"]', timeout=30000)
        await page.wait_for_timeout(3500)  # wait for citation rail to populate

        result = await page.evaluate("""
() => {
  // Main answer text
  const answerEl = document.querySelector('[data-message-author-role="assistant"]');
  const answer = answerEl ? answerEl.innerText : null;

  // Citation cards (right-side rail in ChatGPT Search)
  const citations = [...document.querySelectorAll('a[href*="://"][target="_blank"]')]
    .filter(a => a.closest('[class*="citation"], [class*="source"]'))
    .map(a => ({
      url: a.href,
      title: a.textContent.trim().slice(0, 200),
      domain: new URL(a.href).hostname,
    }));

  return { answer: answer ? answer.slice(0, 6000) : null, citations };
}
        """)

        await browser.close()
        return { "query": query, **result }

if __name__ == "__main__":
    print(asyncio.run(scrape_chatgpt_search("best mobile proxy provider 2026")))
step 3

Parsing the response

ChatGPT Search rotates DOM structure ~monthly. Robust selector strategy:

  • Use data-message-author-role="assistant" for the answer body — stable across UI revisions
  • Filter a[target="_blank"] for citations; class names rotate but the target attribute holds
  • The right-side citation rail is conditionally rendered — wait for it explicitly with page.wait_for_selector
anti-bot

Cloudflare + OpenAI bot detection

OpenAI fronts chatgpt.com with Cloudflare. Cloudflare in 2026 fingerprints TLS (JA4), profiles browsing behavior, and aggressively blocks datacenter ranges. Practical defenses:

  • Mobile carrier IPs only. Datacenter IPs hit a Cloudflare challenge within 1-3 queries. Real mobile IPs survive 50-100+. Why this is true.
  • Realistic user agent. Default Playwright UA is flagged. Use a current Chrome 131+ UA matching macOS or Windows.
  • IP rotation in username. Use -rot-auto10 in your proxy username to get a fresh IP every 10 minutes. Build the right string with the Username Builder.
  • Random delays between queries. 6-15 seconds with jitter. Faster patterns are clearly bot.
why this matters

Use cases

  • Brand monitoring — daily check whether ChatGPT cites your domain for relevant queries
  • Competitive research — see which competitors get cited and for what topics
  • Content gap analysis — query topics where you should rank but don't
  • SEO + GEO (Generative Engine Optimization) measurement

FAQ

Will OpenAI sue me for scraping?
Public web data is generally protected scraping territory under US precedent (Van Buren, hiQ v LinkedIn). OpenAI's ToS prohibits automated access — that's contract law, not criminal. Don't scrape behind login, don't republish verbatim, attribute, and you're in normal scraping territory.
Can I use this with the OpenAI API instead?
The OpenAI API exposes web search to LLM tools but doesn't return search results as queryable JSON — they get consumed by the model. To track what users see in ChatGPT Search, you have to scrape the user-facing UI.
How many queries per day can I run?
With proper IP rotation (mobile carrier IPs, -rot-auto10 in the username, 6-15s delays), 200-500 queries per day per account is sustainable. For more, run multiple Playwright workers with different proxy strings.
What about Bing Copilot or Perplexity?
Same pattern, different DOM. Bing Copilot is easier (less aggressive bot detection). Perplexity has its own quirks — see /blog/perplexity-api-alternatives-self-hosted-2026 for that one.
Does ChatGPT Atlas change the scraping picture?
Atlas is OpenAI's own browser, launched late 2025 on macOS. It has an agent mode that lets ChatGPT browse on the user's behalf. For tracking purposes, you still scrape chatgpt.com/search — that's where the bulk of brand-visibility queries land. Atlas-specific scraping is its own topic; we'll cover it as adoption grows.
Does Bright Data have a ChatGPT Scraper API?
Yes — they ship a managed ChatGPT Scraper API that returns structured search results, citations, and answers. Useful if you want to skip the Playwright plumbing entirely. Tradeoff: less control, per-query pricing instead of per-GB. For high-volume tracking, the DIY mobile-proxy approach is cheaper.
Get mobile proxies, run the scraper, see what ChatGPT cites for your money queries.