Perplexity Sonar API pricing — current state
Honest framing first: Perplexity has expanded substantially since launch. As of April 2026 they ship four Sonar SKUs plus separate Reasoning, Deep Research, Search, and Agentic Research APIs.
| Model | Input · per 1M tokens | Output · per 1M tokens | Best for |
|---|---|---|---|
| Sonar Small Online | $0.20 | $0.20 | High-volume cheap queries |
| Sonar Large Online | $1.00 | $1.00 | Balanced quality/cost |
| Sonar Huge Online | $5.00 | $5.00 | High-quality answers |
| Sonar Pro | $3.00 | $15.00 | Flagship · best reasoning |
Plus per-request fees for search context (varies by depth) on Sonar / Sonar Pro / Sonar Reasoning Pro. Good news for 2026: citation tokens are no longer billed for standard Sonar / Sonar Pro (they used to count against your token budget — meaningful saving for citation-heavy workloads).
Beyond the Sonar models, Perplexity now exposes:
- →Search API: raw web results without LLM synthesis at $5 per 1,000 requests flat. Good if you want to plug Perplexity's search into your own LLM pipeline.
- →Agentic Research API: third-party model access (OpenAI, Anthropic, Google, xAI) at provider direct rates with web-search tool calls billed at $0.005/invocation and URL fetches at $0.0005/invocation.
- →Sonar Reasoning Pro: for chain-of-thought research workloads.
- →Sonar Deep Research: longer-horizon multi-step research tasks.
Why self-host anyway?
With pricing this concrete, when does building your own pipeline still make sense? Three structural reasons:
- →Volume math. Sonar Small at $0.40 per 1M tokens (input + output combined) is already cheap. But for 100K+ queries/month with reasoning models or large search context, costs add up. A custom pipeline (Pool Gateway proxies + Claude Sonnet/Haiku for the LLM step) can land in the $0.005-0.025 range per query.
- →Model choice. Sonar wraps Perplexity's opinionated stack. The Agentic Research API gives you third-party model access but with their search-tool fees layered on. Self-hosting lets you swap LLM providers per query type freely.
- →Source control. Sonar decides which sources to consider. For domain-specific research (legal, medical, technical), you may want to constrain search to particular sites or sources — easy with a custom scraper, harder via Sonar.
- →Vendor independence. Your prompts, fetch logic, and evaluation harness stay yours. Perplexity changes pricing or model behavior, you're not locked in.
Architecture
User query
│
▼
┌─────────────────────────────────────────┐
│ 1. Search step │
│ Scrape Google / Bing / etc. for │
│ top N URLs related to the query │
│ (via Pool Gateway mobile proxies) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 2. Content fetch │
│ For each URL, fetch the page │
│ (also via mobile proxies — many │
│ sites block datacenter scrapers) │
│ Extract main content with Readability│
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 3. LLM summarize │
│ Send query + sources to Claude/GPT │
│ Get back: summary + per-citation │
│ inline references │
└─────────────────────────────────────────┘
│
▼
Final response (matches Perplexity output shape)Implementation
pip install httpx playwright readability-lxml anthropic
playwright install chromiumStep 1: search via Google SERP scraper
import asyncio, os
from urllib.parse import quote_plus
from playwright.async_api import async_playwright
PSX_USER = os.environ['PSX_USERNAME']
PSX_PASS = os.environ['PSX_PASSWORD']
async def search_google(query: str, top_n: int = 5) -> list[str]:
"""Get top N URLs for a query."""
url = f"https://www.google.com/search?q={quote_plus(query)}&hl=en&num=20"
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": "http://gw.proxies.sx:7000", "username": PSX_USER, "password": PSX_PASS},
)
page = await (await browser.new_context()).new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
await page.wait_for_timeout(1500)
urls = await page.evaluate("""
() => [...document.querySelectorAll('h3')]
.map(h => h.closest('a'))
.filter(a => a && a.href.startsWith('http'))
.map(a => a.href)
.filter(u => !u.includes('google.com'))
""")
await browser.close()
return list(dict.fromkeys(urls))[:top_n] # dedup, sliceStep 2: fetch and clean content
import httpx
from readability import Document
async def fetch_clean(url: str) -> dict:
"""Fetch URL through mobile proxy, extract main content."""
proxy_url = f"http://{PSX_USER}:{PSX_PASS}@gw.proxies.sx:7000"
async with httpx.AsyncClient(proxies=proxy_url, timeout=30, follow_redirects=True) as client:
try:
r = await client.get(url, headers={"User-Agent": "Mozilla/5.0 ..."})
doc = Document(r.text)
return {
"url": url,
"title": doc.title(),
"content": doc.summary(html_partial=True)[:8000], # cap context
}
except Exception as e:
return {"url": url, "error": str(e)}Step 3: summarize with Claude
from anthropic import Anthropic
anthropic = Anthropic()
async def ai_search(query: str, top_n: int = 5) -> dict:
urls = await search_google(query, top_n)
sources = await asyncio.gather(*[fetch_clean(u) for u in urls])
sources = [s for s in sources if "content" in s]
# Build a prompt with numbered citations
citations_block = "\n\n".join(
f"[{i+1}] {s['title']}\nURL: {s['url']}\n{s['content']}"
for i, s in enumerate(sources)
)
msg = anthropic.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"""Answer the user's question using the provided sources. Cite sources inline as [1], [2], etc.
QUESTION: {query}
SOURCES:
{citations_block}
ANSWER (with inline citations):"""
}]
)
return {
"query": query,
"answer": msg.content[0].text,
"sources": [{"index": i+1, "url": s["url"], "title": s["title"]} for i, s in enumerate(sources)],
}
# Use it
result = asyncio.run(ai_search("how does CGNAT work"))
print(result)Cost comparison vs Sonar Pro
Approximate per-query cost for a typical research query (5 source pages, ~5K tokens of context, ~500 tokens out):
| Stack | Per-query cost | Notes |
|---|---|---|
| Sonar Small Online | ~$0.0022 | Cheapest Sonar SKU · weakest answers |
| Sonar Large Online | ~$0.011 | Decent quality |
| Sonar Pro | ~$0.022 | Plus per-request fee for search context |
| DIY (Pool Gateway + Haiku) | ~$0.003-0.008 | Mobile proxies + Anthropic Claude Haiku |
| DIY (Pool Gateway + Sonnet) | ~$0.012-0.025 | Mobile proxies + Claude Sonnet |
DIY math: ~5MB of mobile-proxy traffic per query (SERP scrape + 5 fetches) at shared-tier per-GB pricing is <1¢. The LLM call dominates the cost. Pick Haiku for high-volume cost-sensitive workloads, Sonnet for balanced quality.
The verdict: Sonar Small is cheap but quality-limited. Sonar Pro's pricing is close to a DIY-Sonnet stack while being more constrained. For prototypes, Sonar wins on speed-of-setup. For production at moderate-to-high volume, DIY wins on cost + flexibility.
Tradeoffs
Self-hosting is great when you need flexibility and cost control. It's worse when you need:
- →Latency < 2 seconds. Perplexity has aggressive caching + indexing. Custom scrape-then-summarize is typically 5-15 seconds end-to-end depending on source count and LLM.
- →Zero-ops onboarding. You maintain the scraper, handle SERP changes, monitor proxy health. Perplexity does this for you.
- →Realtime news. Perplexity has freshness signals tuned. Your custom pipeline depends on what Google indexes.
For most B2B AI search use cases — research dashboards, internal knowledge tools, monitoring agents — self-hosting wins on cost and flexibility. For consumer-facing latency-sensitive apps, stick with the commercial API.