Notícias
Claude Code tem config oculta (seu agente é caro/lento)
Notícias
5 min de leitura
29 de maio de 2026

Claude Code tem config oculta (seu agente é caro/lento)

Claude Code docs não mostram todas as configs. Seu agente usa config ruim. Quando config é wrong, agente fica lento/caro.

Equipe OpenClaw

Equipe OpenClaw · Time de Engenharia & Produto

A Equipe OpenClaw é formada por engenheiros, designers e especialistas em IA dedicados a construir a melhor plataforma de agentes conversacionais para negócios brasileiros. Combinamos expertise…


Claude Code tem config oculta (seu agente é caro/lento)

Você tem SaaS.

Seu SaaS: agente IA usando Claude (Anthropic).

Você lançou agente (usando Claude Code).

Você seguiu docs oficiais (do Anthropic):

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ {"role": "user", "content": "Hello"} ] )

Agente funciona (respostas corretas).

MAS:

Você nota problema:

  • Agente é LENTO (10+ segundos pra responder simples)
  • Agente é CARO (bill/mês está alto)
  • Agente usa MUITOS tokens (why?)

Você pensa:

"Claude é slow?

Claude é caro?

Vou mudar pra GPT?"

MAS:

Notícia (May 2026):

"Claude Code tem CONFIGURAÇÕES OCULTAS na documentação.

Docs oficiais do Anthropic NÃO mostram todas as configs.

Agentes usando default configs = LENTO + CARO.

Agentes otimizados = RÁPIDO + BARATO.

Você vê notícia.

Você pensa:

"WTF? Claude Code tem configs ocultas?

Meu agente está usando default ruins?

Como eu não sabia?"

Resposta:

Seus agente ESTÁ usando config RUIM.

Docs oficiais (Anthropic) mostram BASIC setup.

Docs NÃO mostram TUNING configs (performance, cost optimization).

Você rodando agente com default = INEFICIENTE.

Você pagando EXTRA (muitos tokens, requests desnecessários).

Você sofrendo LATENCY (agente lento, customer pissed).


O problema (Claude Code config default = lento + caro)

Docs oficiais Anthropic (básico, não otimizado)

DOCS MOSTRA:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, messages=[ {"role": "user", "content": "What is 2+2?"} ] )

PRÃBLEM:

  1. max_tokens=1024 (very generous, wasteful)

    • For simple question (2+2), Claude needs ~5 tokens
    • You paying for 1024 tokens (1000+ tokens wasted)
    • Monthly: 1000 wasteful requests * 1000 tokens = 1M tokens (CARO)
  2. No caching (docs don't mention)

    • Every request processed from scratch
    • Same system prompt repeated N times
    • No reuse of cached context
    • Monthly: 1000 requests * 100 tokens (system prompt) = 100k tokens (WASTEFUL)
  3. No streaming (docs don't mention)

    • Full response buffered in memory
    • No incremental responses (slow perceived latency)
    • Memory waste (large responses)
    • Customer waits longer (bad UX)
  4. No batching (docs don't mention)

    • Each request is independent
    • No bulk processing optimization
    • API overhead per request
    • Slower processing (no parallelization)
  5. No budget constraints (docs don't mention)

    • No rate limiting
    • No cost caps
    • Runaway costs if agente goes haywire
    • Bill surprise (month-end shock)

RESULT:

Your agente is SLOW (no streaming, no batching) Your agente is EXPENSIVE (wastes tokens, no caching) Your bill is HIGH (no budget constraints) Your customer is UNHAPPY (slow responses, unreliable)

Competitor com Claude Code otimizado (rápido + barato)

COMPETITOR USES:

from anthropic import Anthropic

client = Anthropic( timeout=30.0, # Timeout constraint (docs don't mention) max_retries=2 # Retry logic (docs don't mention) )

response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, # Optimized (not 1024, specific to task) stream=True, # Streaming (docs don't mention benefit) system=[ {"type": "text", "text": "You are helpful assistant", "cache_control": {"type": "ephemeral"}} ], # Cache system prompt (docs don't mention) messages=[ {"role": "user", "content": "What is 2+2?"} ] )

Stream response (don't buffer)

for event in response: if event.type == 'content_block_delta': print(event.delta.text, end='', flush=True)

BENEFIT:

  1. max_tokens=100 (specific, not wasteful)

    • For 2+2, Claude uses ~5 tokens (not 1024)
    • Monthly: 1000 requests * 5 tokens = 5k tokens (vs 1M)
    • SAVINGS: 995k tokens/month = 99.5% cheaper
  2. Cache system prompt (cache_control)

    • System prompt cached (first request: 100 tokens cost)
    • Subsequent requests: ~90% cheaper for system prompt
    • Monthly: 1 * 100 tokens + 999 * 10 tokens = 10k tokens (vs 100k)
    • SAVINGS: 90k tokens/month
  3. Streaming (stream=True)

    • Response streamed incrementally (no buffering)
    • Customer sees partial response immediately (better UX)
    • Lower memory footprint
    • Perceived latency: 50% lower
  4. Batching (send multiple requests in parallel)

    • Send 10 requests at once (vs one-by-one)
    • API processes in parallel
    • Network overhead amortized
    • Throughput: 10x faster
  5. Budget constraints (rate limiting)

    • Max tokens/day (e.g., 10M tokens)
    • Alerts if approaching limit
    • Kill switch if exceeded
    • Prevent runaway costs

RESULT:

Competitor agente is FAST (streaming, batching) Competitor agente is CHEAP (optimized tokens, caching) Competitor bill is LOW (budget constraints, no waste) Competitor customer is HAPPY (fast responses, reliable)

COMPARISON:

Your agente: 1M tokens/month, 10s latency, $50/month Competitor: 50k tokens/month, 2s latency, $2.50/month Difference: 20x more expensive, 5x slower

Solução (Claude Code config optimization = rápido + barato)

Config hidden 1: max_tokens (token waste)

PROBLEM:

Default: max_tokens=1024 (very generous)

Example: Customer asks "What is your name?" Claude needs: ~10 tokens You paying for: 1024 tokens (1014 wasted)

Monthly impact: 1000 requests/month * 1024 tokens = 1.024M tokens 1000 requests/month * 10 tokens (optimal) = 10k tokens Waste: 1.014M tokens/month

At Claude pricing (~$0.01 per 1k tokens): You pay: $10.24/month (wasteful) Optimal: $0.10/month (optimized) Wasted cost: $10.14/month (90% waste)

SOLUTION:

Set max_tokens to MINIMUM needed:

  • Simple FAQ: max_tokens=50
  • Moderate response: max_tokens=200
  • Complex analysis: max_tokens=500
  • Never: max_tokens=1024 (default, wasteful)

IMPACT:

Monthly: 1000 requests * 100 tokens (average) = 100k tokens Cost: $1/month (vs $10.24 wasteful) Savings: 90% reduction

Config hidden 2: caching (repeat waste)

PROBLEM:

Default: No caching (every request re-processes everything)

Example: System prompt

Every request includes: "You are a helpful customer support agent. Your job is to..." (100 tokens)

Monthly: 1000 requests * 100 tokens = 100k tokens (system prompt alone!)

SOLUTION:

Enable prompt caching:

from anthropic import Anthropic

client = Anthropic()

system_prompt = "You are a helpful customer support agent..."

response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, system=[ { "type": "text", "text": system_prompt, "cache_control": {"type": "ephemeral"} # ENABLE CACHE } ], messages=[...] )

IMPACT:

First request: 100 tokens cost (system prompt) Subsequent 999 requests: ~10 tokens cost (cached, 90% cheaper) Total monthly: 100 + (999 * 10) = 10k tokens Waste reduction: 90% (100k → 10k) Savings: $0.90/month per 1000 requests

For large SaaS (100k requests/month): Savings: $90/month just from caching Annual: $1080/year

Config hidden 3: streaming (latency)

PROBLEM:

Default: No streaming (buffer entire response)

Example: Customer asks question

Without streaming:

  1. Claude processes (5 seconds)
  2. Full response buffered in memory (2 seconds)
  3. Response sent to customer (1 second)
  4. Customer receives (8 seconds total, bad UX)

SOLUTION:

Enable streaming:

response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, stream=True, # ENABLE STREAMING messages=[...] )

Stream response incrementally

for event in response: if event.type == 'content_block_delta': print(event.delta.text, end='', flush=True) # Print as received

IMPACT:

With streaming:

  1. Claude processes (5 seconds)
  2. First tokens streamed immediately (0.5 seconds)
  3. Subsequent tokens streamed (0.5 seconds)
  4. Customer receives (5.5 seconds total, better UX)

Percieved latency: 5.5s (streaming) vs 8s (buffering) Improvement: 31% faster Customer experience: MUCH better (sees partial response immediately)

Config hidden 4: batching (throughput)

PROBLEM:

Default: Process requests sequentially (slow throughput)

Example: 100 customer requests

Sequential:

  1. Send request 1 (wait for response: 5 sec)
  2. Send request 2 (wait for response: 5 sec)
  3. Send request 3 (wait for response: 5 sec) ...
  4. Send request 100 (wait for response: 5 sec)

Total time: 100 * 5 = 500 seconds (8 minutes)

SOLUTION:

Batch requests (send multiple at once):

import concurrent.futures

requests = [...] # List of 100 requests

def process_request(req): return client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, messages=req['messages'] )

Process requests in parallel (batched)

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor: results = list(executor.map(process_request, requests))

IMPACT:

Parallel (10 workers):

  1. Send requests 1-10 (in parallel)
  2. Wait for first to complete (5 sec)
  3. Send requests 11-20 (in parallel, others still processing)
  4. Continue...

Total time: ~50 seconds (vs 500 seconds sequential) Speedup: 10x faster Throughput: 100x requests handled (same resources)

Config hidden 5: budget constraints (runaway costs)

PROBLEM:

Default: No rate limiting (runaway costs possible)

Example: Agente bug sends 1M requests in 1 hour

Without constraints:

  1. Bug triggers (malicious user? code error?)
  2. Agente sends 1M requests
  3. Each request: 1024 tokens
  4. Total: 1M * 1024 = 1B tokens
  5. Cost: 1B * $0.01/1k = $10,000 (one hour!)
  6. You receive bill: $10,000 surprise
  7. Budget exhausted: SaaS credit card declined

SOLUTION:

Implement budget constraints:

from datetime import datetime, timedelta

class BudgetController: def init(self, daily_limit=10_000_000): # 10M tokens/day self.limit = daily_limit self.used = 0 self.reset_time = datetime.now() + timedelta(days=1)

def check_budget(self, tokens_needed):
    if datetime.now() > self.reset_time:
        self.used = 0
        self.reset_time = datetime.now() + timedelta(days=1)
    
    if self.used + tokens_needed > self.limit:
        raise BudgetExceededError("Daily token limit exceeded")
    
    self.used += tokens_needed

def process_request(self, request):
    # Estimate tokens needed
    estimated_tokens = len(request['messages'][0]['content']) + 100
    
    self.check_budget(estimated_tokens)
    
    # Process request
    response = client.messages.create(...)
    return response

IMPACT:

With budget constraints:

  1. Bug triggers (1M requests attempt)
  2. First 100 requests processed (1M tokens used)
  3. Budget check: 10M limit exceeded
  4. Request denied: "Budget exceeded"
  5. Bug stopped (no more requests)
  6. Cost: $10 (vs $10,000 without constraints)

Savings: $9,990 (disaster prevented)

3 sinais que seu agente está usando Claude Code config RUIM

Sinal 1: Bill é alto (wasteful tokens)

SYMPTOM:

You process: 1000 requests/month You pay: $50+/month Competitor processes: 1000 requests/month Competitor pays: $2/month

PROBLEM:

You: 1000 * 1024 = 1.024M tokens (default max_tokens) Competitor: 1000 * 100 = 100k tokens (optimized max_tokens)

Ratio: 10x more expensive (same volume)

SOLUTION:

Optimize max_tokens (set to minimum needed, not default) Enable caching (system prompt cached) Monitor token usage (find waste)

Sinal 2: Latency é alto (no streaming)

SYMPTOM:

Customer asks: "What time is it?" Customer waits: 8 segundos (buffering) Customer thinks: "Agente é lento" Customer leaves: Churn

Competitor: Customer asks: "What time is it?" Customer sees: Partial response in 1 segundo (streaming) Customer thinks: "Agente é rápido" Customer stays: No churn

PROBLEM:

You: No streaming (buffer entire response) Competitor: Streaming (send partial response immediately)

SOLUTION:

Enable streaming (stream=True) Perceived latency: 50% lower Customer experience: Better

Sinal 3: Throughput é baixo (sequential processing)

SYMPTOM:

You: Process 10 requests sequentially

  • Request 1: 5 sec
  • Request 2: 5 sec
  • Request 3: 5 sec
  • ...
  • Request 10: 5 sec Total: 50 seconds

Competitor: Process 10 requests in parallel (batching)

  • Requests 1-10: Send all at once
  • Wait for longest: 5 sec Total: 5 seconds

Ratio: 10x faster (same requests)

SOLUTION:

Implement batching (ThreadPoolExecutor, async, etc) Process multiple requests in parallel Throughput: 10x higher (same infrastructure)

Como otimizar Claude Code config (step-by-step)

Passo 1: Audit current config (find waste)

ACÇÃO:

  1. Log all API calls

    • model used
    • max_tokens set
    • actual tokens used
    • response time
    • cost
  2. Analyze patterns

    • Average tokens per request (if >500, probably wasteful)
    • Response time distribution (if avg > 5 sec, probably no streaming)
    • Monthly cost (if > $10, probably suboptimal)
  3. Find opportunities

    • Requests that use <10% of max_tokens (reduce max_tokens)
    • Repeated system prompts (add caching)
    • Slow responses (add streaming)
    • Sequential requests (add batching)

RESULT: Identified optimization opportunities

Passo 2: Implement optimizations (max_tokens, caching, streaming)

ACÇÃO:

  1. Reduce max_tokens (task-specific)

    • Simple: 50 tokens
    • Moderate: 200 tokens
    • Complex: 500 tokens
    • Never: 1024 (default, wasteful)
  2. Add caching (system prompt, context)

    • system=[{"type": "text", "text": ..., "cache_control": {"type": "ephemeral"}}]
    • Cache repeated contexts
    • 90% cheaper for cached tokens
  3. Enable streaming (incremental response)

    • stream=True
    • Process tokens as they arrive
    • Better perceived latency
  4. Add batching (parallel requests)

    • ThreadPoolExecutor / asyncio
    • Process multiple requests in parallel
    • 10x throughput (same workers)
  5. Add budget constraints (prevent runaway)

    • Daily token limit (e.g., 10M)
    • Rate limiting
    • Alert on budget exceeded

RESULT: Optimized Claude Code config

Passo 3: Monitor + measure (cost reduction)

ACÇÃO:

  1. Track metrics

    • Daily tokens used
    • Daily cost
    • Average latency
    • Throughput (requests/sec)
    • Budget remaining
  2. Compare before/after

    • Before: 1M tokens/month, $50/month, 8s latency
    • After: 100k tokens/month, $5/month, 2s latency
    • Improvement: 90% cheaper, 4x faster
  3. Alert on anomalies

    • Daily tokens spike (bug? runaway?)
    • Budget near limit (alert user)
    • Latency increased (degradation?)

RESULT: Optimized, monitored, controlled

Conclusão: Claude Code config oculta = seu agente é caro/lento

**O que você precisa saber:

  1. Docs oficiais Anthropic (básico, não otimizado)

    • max_tokens=1024 (wasteful, 90% typical waste)
    • No caching (system prompt repeated every request)
    • No streaming (full response buffered)
    • No batching (sequential processing, slow)
    • No budget constraints (runaway costs possible)
  2. Config oculta (não mostrada em docs oficiais)

    • max_tokens (optimize to task needs)
    • cache_control (cache system prompt, contexts)
    • stream (enable streaming for faster perceived latency)
    • batching (parallel requests, 10x throughput)
    • budget limits (prevent runaway costs)
  3. Impact of optimization

    • Cost: 90% reduction (1M tokens → 100k tokens/month)
    • Latency: 50% reduction (8s → 4s perceived)
    • Throughput: 10x increase (sequential → parallel)
    • Budget: Controlled (no runaway costs)
  4. 3 sinais que agente está usando config RUIM

    • Bill é alto (wasteful tokens, no optimization)
    • Latency é alto (no streaming, buffer everything)
    • Throughput é baixo (sequential processing, no batching)
  5. Como otimizar (3 passos)

    • Passo 1: Audit current config (find waste)
    • Passo 2: Implement optimizations (max_tokens, caching, streaming, batching)
    • Passo 3: Monitor + measure (cost reduction, improvement validation)

Na OpenClaw, ajudamos startup de agente IA a:

  • AUDIT Claude Code config (find waste)
  • OPTIMIZE max_tokens (task-specific)
  • IMPLEMENT caching (system prompt cached)
  • ENABLE streaming (faster perceived latency)
  • ADD batching (10x throughput)
  • SETUP budget constraints (prevent runaway)
  • MONITOR metrics (cost, latency, throughput)

Resultado: Seu agente é OTIMIZADO (barato) + RÁPIDO (streaming) + ESCALÁVEL (batching) + CONTROLLED (budget).

Seu agente está usando config RUIM (caro, lento, uncontrolled)?

Ou seu agente é OTIMIZADO (barato, rápido, controlled)?

Otimize Claude Code config agora →


Publicado em 29 de maio de 2026

Leia também