Seu agente IA falha (software layer ruim, não modelo ruim)

Notícias

5 min de leitura

29 de maio de 2026

Seu agente IA falha (software layer ruim, não modelo ruim)

LLM não é bottleneck. Software layer (code) é. Seu agente falha porque código é ruim. Quando code é fraco, agente colapsa.

Equipe OpenClaw · Time de Engenharia & Produto

A Equipe OpenClaw é formada por engenheiros, designers e especialistas em IA dedicados a construir a melhor plataforma de agentes conversacionais para negócios brasileiros. Combinamos expertise…

Seu agente IA falha (software layer ruim, não modelo ruim)

Você tem SaaS.

Seu SaaS: agente IA no WhatsApp (atendimento).

Você lançou agente (usando Claude/GPT).

Agente funciona:

Customer envia mensagem
Agente responde (OK)
Customer happy (OK)

MAS:

Alguns casos quebram:

Customer pergunta complexo → agente hallucina (WTF?)
Customer volta no contexto antigo → agente esquece (broken memory?)
Agente tenta usar integração com Salesforce → error (integration broken?)
Agente responde algo que deveria ser escalado → decision wrong (no boundaries?)
Agente testa resposta → response inválida (no testing?)

Você pensa:

"Model é ruim? Claude é bom modelo.

Por que agente quebra?

Devo mudar pra GPT-4?

Devo treinar custom model?

Por que Claude não funciona aqui?"

Resposta:

NÃO É O MODELO.

É O CÓDIGO (software layer) ao redor do modelo.

Recent review paper (2026):

"Code é como agente PENSA e AGE.

Não é só output.

Software layer (tools, memory, testing, permissions) = agente.

Modelo sozinho = stateless (sem contexto, sem memory, sem tools).

Modelo + software layer = working agente."

Exemplo:

MODELO SOZINHO (Claude):

Input: "Customer tem problema no pedido #123"
Output: "Desculpe, vou verificar seu pedido."
Done (stateless, sem tools, sem memory, sem context)

MAS:

Agente não consegue REALMENTE acessar pedido #123 (sem tools)
Agente não consegue GUARDAR contexto (sem memory)
Agente não consegue DECIDIR se responde ou escala (sem permissions)
Agente não consegue TESTAR resposta antes de enviar (sem testing)

RESULTADO: Agente FAKE (responde, mas não funciona)

MODELO + SOFTWARE LAYER (Code):

Input: "Customer tem problema no pedido #123"
Code layer: "Buscar pedido #123 na integração Salesforce"
Model: "Baseado no pedido, a causa é X"
Code layer: "Testar resposta (é válida?)"
Code layer: "Guardar contexto (memory)"
Code layer: "Verificar permissão (responder ou escaladel?)"
Output: "Seu pedido #123 tem problema X, aqui está a solução..."

RESULTADO: Agente REAL (responde E funciona)

O problema (software layer fraco = agente fraco)

Mito: Agente falha = modelo ruim

ASSUMPÇÃO:

"Meu agente falha em alguns casos.

Portanto, modelo é ruim.

Vou mudar pra GPT-4 (modelo melhor)."

REALIDADE:

99% do tempo, agente falha porque SOFTWARE LAYER é fraco. Não porque modelo é fraco.

EXAMPLE:

Agente hallucina (inventa informação)?

Mito: "Model é ruim (alucinações)"
Realidade: "Software layer não testou resposta antes de enviar (missing testing layer)"
Fix: Adicionar testing (validação antes de responder)

Agente esquece contexto?

Mito: "Model tem memory curta"
Realidade: "Software layer não guardam memory (missing memory layer)"
Fix: Adicionar memory (store conversation history)

Agente não consegue acessar dados?

Mito: "Model não sabe como acessar dados"
Realidade: "Software layer não tem tools (missing tools layer)"
Fix: Adicionar tools (integração com APIs)

Agente responde coisas que deveria escalar?

Mito: "Model não sabe quando escalar"
Realidade: "Software layer não tem permissões (missing permissions layer)"
Fix: Adicionar boundaries (when to escalate)

RESULT: Problema não é modelo, é software layer.

4 camadas de software layer (como agente funciona)

CAMADA 1: TOOLS (como agente ACESSA dados)

Definição:

Tools = integração com APIs externas
Tool = plugin que agente pode chamar
Sem tools = agente é só chat (não consegue fazer nada real)

Exemplo:

Sem tools:

Customer: "Qual é meu saldo?"
Agente: "Desculpe, não consegui verificar seu saldo." (useless)

Com tools:

Customer: "Qual é meu saldo?"
Agente: [CALLS TOOL: get_balance(customer_id=123)] → R$ 5.000
Agente: "Seu saldo é R$ 5.000." (useful)

Tools comuns:

API integrations (Salesforce, HubSpot, Stripe)
Database queries (fetch customer data)
Search tools (Google, Docs, knowledge base)
Calculator tools (arithmetic)
File tools (read/write files)

RISK: Sem tools → agente inútil (só conversa)

CAMADA 2: MEMORY (como agente LEMBRA contexto)

Definição:

Memory = guardar conversação histórica
Memory = contexto (agente sabe o que foi dito antes)
Sem memory = cada mensagem é isolada (sem contexto)

Exemplo:

Sem memory:

Customer (msg 1): "Quero mudar minha senha"
Agente (msg 1): "OK, qual é sua senha atual?"
Customer (msg 2): "123456"
Agente (msg 2): "Qual é sua senha? (perdeu contexto da msg 1!)" (broken)

Com memory:

Customer (msg 1): "Quero mudar minha senha"
Agente (msg 1): "OK, qual é sua senha atual?"
[MEMORY: "Customer wants to change password"]
Customer (msg 2): "123456"
Agente (msg 2): "OK, senha atual verificada. Qual é sua nova senha?" (context aware)

Memory tipos:

Short-term: Current conversation (últimas 10 mensagens)
Long-term: Historical (conversations from 6 months ago)
Semantic: Summary ("Customer wants refund for order #123")

RISK: Sem memory → agente aparenta dumb (não lembra contexto)

CAMADA 3: TESTING (como agente VALIDA respostas)

Definição:

Testing = verificar resposta antes de enviar
Testing = quality gate (is response correct?)
Sem testing = agente envia respostas ruins (hallucinations)

Exemplo:

Sem testing:

Customer: "Quanto custa o plano Pro?"
Model: "O plano Pro custa R$ 5.000.000/mês" (wrong!)
Agente: ENVIA (no validation) → Customer vê wrong price (bad)

Com testing:

Customer: "Quanto custa o plano Pro?"
Model: "O plano Pro custa R$ 5.000.000/mês" (wrong!)
[TEST]: "Is this price in our price list?" → NO
[TEST]: "Price is unreasonable (R$ 5M?)" → YES
Agente: "I'm not sure, let me verify." OR escalate (good)

Testing tipos:

Fact checking (is this true?)
Price validation (is price in range?)
Data validation (is response format correct?)
Sanity checks (does this make sense?)
Human review (escalate if uncertain)

RISK: Sem testing → agente hallucina (sends wrong info)

CAMADA 4: PERMISSIONS (como agente DECIDE o que fazer)

Definição:

Permissions = boundaries (what can agente do?)
Permissions = when to escalate (when to human?
Sem permissions = agente faz qualquer coisa (dangerous)

Exemplo:

Sem permissions:

Customer: "Delete meu banco de dados"
Agente: "OK, deletando..." (DANGEROUS! No boundaries)

Com permissions:

Customer: "Delete meu banco de dados"
[PERMISSION CHECK]: "Can agente delete database?" → NO
[PERMISSION CHECK]: "Should escalate to human?" → YES
Agente: "Essa ação requer aprovação humana. Estou escalando para um especialista." (safe)

Permissions tipos:

Read-only (agente só lê dados)
Limited write (agente escreve, mas com confirmação)
Escalation rules (when to hand off to human)
Rate limits (how many requests per minute?)
Data access (what data can agente see?)

RISK: Sem permissions → agente dangerous (no boundaries)

Deepseek formula (Model + Harness = Agent)

DEEPSEEK APPROACH (built dedicated "Harness" team):

Formula:

AGENT = MODEL + HARNESS

Where:

MODEL = Claude/GPT (LLM)
HARNESS = Software layer (code, tools, memory, testing, permissions)

DEEPSEEK insight:

Model alone = useless (stateless, no tools, no memory)
Harness alone = useless (no intelligence)
Model + Harness = working agent

DEEPSEEK investment:

Hiring dedicated "Harness" team in Beijing
Core focus: Build best-in-class software layer
Result: Model (Deepseek LLM) + Harness (code layer) = competitive agent

IMPLICATION:

Best agents are built by teams that focus on HARNESS (not just model)
If you only focus on model quality = competitive disadvantage
If you focus on model + harness = competitive advantage

Solução (build strong software layer)

Passo 1: AUDIT software layer (identify gaps)

ACÇÃO:

TOOLS (can agente access data?)
- List all integrations (Salesforce, HubSpot, etc)
- Test each integration (do they work?)
- Identify missing integrations (what data can't agente access?)
- Result: Tools checklist
MEMORY (does agente remember context?)
- Test conversation (can agente reference previous messages?)
- Test long-term memory (can agente remember 6 months ago?)
- Test semantic memory (does agente summarize correctly?)
- Result: Memory checklist
TESTING (does agente validate responses?)
- Test hallucination (can agente catch wrong info?)
- Test validation (does agente check facts?)
- Test quality gates (is response acceptable?)
- Result: Testing checklist
PERMISSIONS (does agente have boundaries?)
- Test escalation (when does agente escalate to human?)
- Test data access (what data can agente see?)
- Test dangerous actions (can agente delete data?)
- Result: Permissions checklist

OUTPUT: Identified gaps in software layer

Passo 2: BUILD harness (fix gaps)

FIX 1: TOOLS (integrations)

Gap: Agente não consegue acessar Salesforce Fix: python class SalesforceTools: def get_customer(self, customer_id): return salesforce_client.query(f"SELECT * FROM Customer WHERE id={customer_id}")

def get_orders(self, customer_id):
    return salesforce_client.query(f"SELECT * FROM Order WHERE customer_id={customer_id}")

def create_ticket(self, customer_id, subject, description):
    return salesforce_client.create("Case", {
        "AccountId": customer_id,
        "Subject": subject,
        "Description": description
    })

Agent uses tools

agent.register_tools(SalesforceTools())

FIX 2: MEMORY (context)

Gap: Agente esquece contexto Fix: python class ConversationMemory: def init(self): self.messages = [] # short-term self.summary = "" # long-term

def add_message(self, role, content):
    self.messages.append({"role": role, "content": content})
    # Keep last 20 messages (short-term)
    if len(self.messages) > 20:
        self.summarize()  # compress old messages

def get_context(self):
    return "Summary: " + self.summary + "\n" + "Recent: " + str(self.messages[-5:])

Agent uses memory

agent.memory = ConversationMemory()

FIX 3: TESTING (validation)

Gap: Agente hallucina (sends wrong info) Fix: python class ResponseValidator: def validate(self, response): # Fact check if "price" in response: if not self.is_valid_price(response): return False

    # Sanity check
    if "delete" in response.lower():
        if not self.confirm_action():
            return False
    
    return True

def is_valid_price(self, response):
    # Extract price, check against price list
    price = extract_price(response)
    return price in self.valid_prices

Agent uses validator

agent.validator = ResponseValidator() response = model.generate() if not agent.validator.validate(response): response = "I'm not sure, let me verify with a specialist."

FIX 4: PERMISSIONS (boundaries)

Gap: Agente doesn't know when to escalate Fix: python class PermissionManager: def can_execute(self, action, user_role): # Define permissions permissions = { "read_customer": ["agent", "admin"], "create_ticket": ["agent", "admin"], "delete_customer": ["admin"], # agent can't delete "refund": ["supervisor", "admin"] # need escalation } return user_role in permissions.get(action, [])

def should_escalate(self, action):
    # Define escalation rules
    escalate_actions = ["refund", "delete", "bulk_action"]
    return action in escalate_actions

Agent checks permissions

if not agent.permission_manager.can_execute(action, "agent"): if agent.permission_manager.should_escalate(action): agent.escalate_to_human(action) else: agent.respond("Sorry, I can't do that.")

OUTPUT: Built software layer (harness)

Passo 3: MONITOR harness (keep it working)

METRICS:

TOOLS health
- Integration success rate (% API calls successful)
- Integration latency (how fast?)
- Integration errors (what's breaking?)
MEMORY quality
- Context recall (does agente remember?)
- Summary accuracy (are summaries correct?)
- Memory size (is memory growing too big?)
TESTING effectiveness
- Hallucination rate (how many wrong responses?)
- Validation success rate (how many responses validated?)
- False positive rate (how many valid responses rejected?)
PERMISSIONS compliance
- Escalation rate (how many escalations?)
- Permission violations (any unauthorized actions?)
- Boundary violations (did agente exceed permissions?)

ACTION:

Track metrics weekly
Alert if metrics degrade
Fix issues immediately

Conclusão: Software layer = real bottleneck (não modelo)

**O que você precisa saber:

LLM não é bottleneck (software layer é)
- Recent review paper: "Code é como agente pensa e age"
- Modelo sozinho = stateless (sem tools, memory, testing, permissions)
- Modelo + software layer = working agent
- Deepseek built dedicated "Harness" team (focused on software layer)
4 camadas de software layer
- Tools: Como agente acessa dados (integração com APIs)
- Memory: Como agente lembra contexto (conversation history)
- Testing: Como agente valida respostas (quality gates)
- Permissions: Como agente decide o que fazer (boundaries)
Agente falha = software layer fraco (não modelo fraco)
- Hallucina? Testing layer fraco (sem validação)
- Esquece contexto? Memory layer fraco (sem history)
- Não consegue acessar dados? Tools layer fraco (sem integração)
- Faz ações perigosas? Permissions layer fraco (sem boundaries)
Deepseek formula: Model + Harness = Agent
- Model = LLM (Claude, GPT, Deepseek)
- Harness = Software layer (code, tools, memory, testing, permissions)
- Best agents = best harness (not just best model)
Ação imediata
- AUDIT: What's missing in your software layer?
- BUILD: Add tools, memory, testing, permissions
- MONITOR: Track harness health (metrics, alerts)

Na OpenClaw, ajudamos startup de agente IA a:

AUDIT software layer (identify gaps in tools, memory, testing, permissions)
BUILD harness (tools, memory, testing, permissions)
INTEGRATE APIs (Salesforce, HubSpot, Stripe, custom)
IMPLEMENT memory (short-term, long-term, semantic)
CREATE testing layer (fact checking, validation, sanity checks)
DEFINE permissions (boundaries, escalation rules, data access)
MONITOR harness (health metrics, alerts, debugging)

Resultado: Seu agente é STRONG (software layer built, not just model) + RELIABLE (testing, memory, tools) + SAFE (permissions, boundaries).

Seu agente falha porque código é ruim (software layer fraco)?

Ou seu agente é BUILT (modelo + harness = working agent)?

Build seu harness agora →

Publicado em 29 de maio de 2026

Seu agente IA falha (software layer ruim, não modelo ruim)

Seu agente IA falha (software layer ruim, não modelo ruim)

O problema (software layer fraco = agente fraco)

Mito: Agente falha = modelo ruim

4 camadas de software layer (como agente funciona)

Deepseek formula (Model + Harness = Agent)

Solução (build strong software layer)

Passo 1: AUDIT software layer (identify gaps)

Passo 2: BUILD harness (fix gaps)

Agent uses tools

Agent uses memory

Agent uses validator

Agent checks permissions

Passo 3: MONITOR harness (keep it working)

Conclusão: Software layer = real bottleneck (não modelo)

Leia também