Notícias
Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)
Notícias
5 min de leitura
3 de junho de 2026

Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)

Seu agente IA chama API errada, executa ação errada, falha tarefas. Amazon: tool-calling accuracy é trainable (SFT + DPO).

Equipe OpenClaw

Equipe OpenClaw · Time de Engenharia & Produto

A Equipe OpenClaw é formada por engenheiros, designers e especialistas em IA dedicados a construir a melhor plataforma de agentes conversacionais para negócios brasileiros. Combinamos expertise…


Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)

Você tem SaaS.

Seu SaaS: agente IA (atendimento, vendas, suporte).

Agente precisa fazer tarefas multi-step (procurar informação, calcular, executar ação).

Como agente funciona:

Customer request → Agente processa → Agente escolhe FERRAMENTA (tool)

Exemplos de ferramentas:

  • Lookup customer info (API: get_customer_by_id)
  • Check order status (API: get_order_by_id)
  • Process refund (API: process_refund)
  • Send email (API: send_email)
  • Update database (API: update_customer_data)

Agente tem 5-10 ferramentas disponíveis.

Agente precisa escolher CERTA ferramenta pro request.

OK exemplo:

Customer: "What's my order status?"

Agente:

  1. Reconhece: Precisa checar order status
  2. Escolhe ferramenta: get_order_by_id ✓ (CERTA)
  3. Chama API: get_order_by_id(customer_id=123)
  4. Obtém resultado: Order #456, status: shipped
  5. Responde: "Seu pedido está enviado"

Resultado: ✓ Sucesso


BAD exemplo:

Customer: "What's my order status?"

Agente:

  1. Reconhece: Precisa... fazer algo?
  2. Escolhe ferramenta: process_refund ✗ (ERRADA)
  3. Tenta chamar: process_refund(customer_id=123)
  4. Erro: Não faz sentido (customer não pediu refund)
  5. Resultado: Falha, confusão, suporte ticket

Ou:

  1. Reconhece: Precisa checar status
  2. Escolhe ferramenta: send_email ✗ (ERRADA)
  3. Tenta chamar: send_email(message="What's your status?")
  4. Erro: Enviou email inútil em vez de checking status
  5. Resultado: Confusão, customer pissed

Reality: Your agent escolhe WRONG TOOL frequently.

Razão:

  • Generic LLM não tá trained pra tool-calling (foi trained em next-token prediction, não em "choose right tool")
  • Generic LLM não entende sua specific tools (não tem context)
  • Generic LLM halucina (inventa ferramentas que não existem)
  • Generic LLM escolhe random (quando não tem certeza)

Resultado:

Tool-calling accuracy (% de vezes agente escolhe CERTA ferramenta):

Generic LLM (sem treinamento): 60-75% accuracy

  • 25-40% de requests: agente escolhe WRONG tool
  • Cada wrong tool choice: customer vê erro, ticket de suporte, marca negativa

Exemplo em 1,000 requests:

  • 600-750 sucesso (agente acertou)
  • 250-400 falhas (agente errou)
  • Cada falha: 1-2 minutos support time
  • Total: 250-800 minutos support = R$ 2K-8K support cost (só por tool-calling failures)

Você não tá percebendo problema (porque agente funciona "ok", occasional failures você acha que é normal).

Ai vem notícia:

"Amazon SageMaker AI: Improve agent's tool-calling accuracy with SFT and DPO."

"SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) = trainable tool-calling accuracy."

"Result: Tool-calling accuracy 85-95% (vs 60-75% generic LLM)."

Você pensa:

"Wait, tool-calling accuracy é trainable?

Meu agente (60-75%) pode ser treinado pra 85-95%?

Competitors que fine-tuned their agents terão:

  • 85-95% tool-calling accuracy (vs meu 60-75%)
  • Menos failures (menos support tickets)
  • Melhor customer experience (agente funciona melhor)
  • Maior adoption (customers trust agente mais)

Meu agente (untrained, 60-75%) será outdated (unreliable, lots of failures, high support cost)?"

Sim. Sim. Sim. Sim.

Amazon just signaled: Tool-calling accuracy é trainable (não é inherent limitation, é training problem).

Your agente (untrained) é now capability-liability (can be improved, but you didn't invest).


THE PROBLEM: WRONG TOOL-CALLING DESTROYS USER EXPERIENCE

Problema 1: Customer vê agente falhar (broken trust)

SCENARIO: Customer support agente escolhe wrong tool

Customer: "I want to cancel my subscription"

Agente:

  1. Reconhece: Precisa processar cancellation
  2. Escolhe ferramenta: WRONG (escolhe get_subscription_info em vez de cancel_subscription)
  3. Tenta executar: Retorna info, não cancela
  4. Agente confuso: "I found your subscription... uh... wait" (halucina)
  5. Customer vê: Agente não consegue fazer coisa simples
  6. Customer pensa: "Este agente é lixo. Não confio."

Resultado:

  • Customer não usa agente mais
  • Customer calls suporte human (custo: R$ 50-100 por ticket)
  • Customer marca negativa ("Bot não funciona")
  • Adoption de agente cai

IMPACT SCALE:

Se você tem 10,000 customers, 100 requests/mês por customer: = 1,000,000 requests/mês

Se tool-calling accuracy é 65%: = 350,000 wrong tool calls/mês = 350K failures/mês

Se 10% desses failures resultam em support ticket: = 35,000 support tickets/mês = Cada ticket: 5 minutos support human = R$ 2-5 custo = Total: R$ 70K-175K/mês support cost (só por tool-calling failures)

Se tool-calling accuracy era 90% (trained): = 100,000 wrong tool calls/mês = 10,000 failures/mês = 1,000 support tickets/mês = R$ 2K-5K/mês support cost

Savings: R$ 65K-170K/mês (by improving tool-calling accuracy 65% → 90%)

Problema 2: Multi-step tarefas falham (chain breaks)

MANY TASKS SÃO MULTI-STEP:

Exemplo: "Can you process my refund and send me confirmation email?"

Required steps:

  1. Verify customer identity (use: verify_customer_identity tool)
  2. Verify refund eligibility (use: check_refund_eligibility tool)
  3. Process refund (use: process_refund tool)
  4. Send confirmation email (use: send_email tool)

If agent choose wrong tool at ANY step:

  • Step 1: Choose wrong tool → stops, can't verify
  • Step 2: Choose wrong tool → stops, can't check eligibility
  • Step 3: Choose wrong tool → stops, can't refund
  • Step 4: Choose wrong tool → refund processed, but no email sent

Accuracy requirement:

  • All 4 steps MUST be correct
  • If each step has 85% accuracy: 0.85^4 = 52% chance whole chain succeeds
  • If each step has 90% accuracy: 0.90^4 = 66% chance
  • If each step has 95% accuracy: 0.95^4 = 81% chance
  • If each step has 99% accuracy: 0.99^4 = 96% chance

Conclusion: Multi-step tasks DEMAND high tool-calling accuracy (each step must be correct)

Your agente (60-75% accuracy) = multi-step tasks fail often Trained agente (90-95% accuracy) = multi-step tasks succeed most times

Problema 3: Support costs explode (broken tool-calling = support tickets)

WRONG TOOL-CALLING → SUPPORT COSTS:

Example:

  • Customer requests refund
  • Agent chooses process_payment instead of process_refund
  • Customer's account CHARGED instead of refunded
  • Customer notices: "Wait, I was charged? I asked for refund!"
  • Customer calls support: "Your AI agent charged me"
  • Support human needs to: Investigate, reverse charge, apologize
  • Time: 15-30 minutes per case
  • Cost: R$ 100-300 per case
  • Customer satisfaction: -5/5 (pissed)

If this happens 100x/month: = R$ 10K-30K/month support cost = Massive customer churn = Reputation damage

All preventable by: Training agent pra high tool-calling accuracy


HOW TO TRAIN TOOL-CALLING ACCURACY (SFT + DPO)

What is SFT (Supervised Fine-Tuning)?

SFT = Supervised Fine-Tuning

Idea: Show agent examples of CORRECT tool calls, let it learn

PROCESS:

  1. Collect training data

    • Log 1,000 requests where agent was CORRECT
    • For each: Store (request, correct_tool_call, result)

    Example:

    • Request: "I want to cancel my subscription"
    • Correct tool: cancel_subscription
    • Result: "Subscription cancelled successfully"
  2. Fine-tune agent on this data

    • Agent learns: "When customer says 'cancel', use cancel_subscription tool"
    • Agent internalizes: Tool → request mapping
  3. Test on new requests

    • Give agent new requests (not in training data)
    • Agent now chooses correct tools more often
    • Accuracy improves: 60% → 75-80%

COST:

  • Collecting training data: 10-20 hours (log correct calls)
  • Fine-tuning: R$ 100-500 (compute cost on SageMaker)
  • Testing: 5-10 hours
  • Total: R$ 500-1K cost, 30-40 hours effort

Benefit:

  • Accuracy improvement: 60% → 75-80% (20% improvement)
  • Support cost reduction: R$ 20K-50K/month (from fewer failures)
  • Payback: 1-3 weeks

What is DPO (Direct Preference Optimization)?

DPO = Direct Preference Optimization

Idea: Show agent CORRECT vs INCORRECT tool calls, let it learn preferences

PROCESS:

  1. Collect preference data

    • For each request, get 2+ tool call options:
      • Option A: CORRECT tool
      • Option B: INCORRECT tool

    Example:

    • Request: "What's my order status?"
    • Option A (CORRECT): get_order_status tool ✓
    • Option B (INCORRECT): send_email tool ✗
  2. Train agent on preferences

    • Agent learns: "Prefer get_order_status over send_email for status queries"
    • Agent learns comparative preferences (not just absolute right/wrong)
  3. Test on new requests

    • Accuracy improves: 75% → 85-95% (even better than SFT alone)

Why DPO better than SFT:

  • SFT learns: "This is correct"
  • DPO learns: "This is better than that"
  • DPO closer to how humans think (comparative, not absolute)
  • DPO more robust (generalizes better)

COST:

  • Collecting preference data: 20-30 hours (label correct vs incorrect)
  • DPO training: R$ 200-1K (compute cost)
  • Testing: 5-10 hours
  • Total: R$ 1K-2K cost, 40-50 hours effort

Benefit:

  • Accuracy improvement: 75% → 85-95% (10-20% more improvement)
  • Support cost reduction: Additional R$ 20K-50K/month
  • Payback: 1-2 weeks

SFT + DPO Combined (Best approach)

COMBINED APPROACH:

  1. Start with SFT (quick win)

    • Collect 500-1000 correct examples
    • Fine-tune agent
    • Accuracy: 60% → 75-80%
    • Cost: R$ 500-1K
    • Time: 2-3 weeks
  2. Then add DPO (further improvement)

    • Collect preference pairs (correct vs incorrect)
    • Train DPO
    • Accuracy: 75-80% → 88-95%
    • Cost: R$ 1K-2K
    • Time: 2-3 weeks more
  3. Monitor and iterate

    • Test on real requests
    • Log failures
    • Improve training data based on failures
    • Rinse and repeat

Total investment:

  • Cost: R$ 2K-3K
  • Time: 4-6 weeks
  • Team: 1-2 engineers + 1 ML expert

Total benefit:

  • Accuracy improvement: 60% → 90%+ (30%+ improvement)
  • Support cost reduction: R$ 100K-200K/month (across all multi-step tasks)
  • Payback: Less than 1 week

HOW TO IMPLEMENT ON AMAZON SAGEMAKER

Step 1: Prepare training data (1-2 weeks)

COLLECT CORRECT EXAMPLES (FOR SFT):

  1. Log all agent requests

    • Store: customer request, agent's tool choice, result
  2. Filter for CORRECT cases

    • Only keep: Where agent chose right tool and customer was happy
  3. Format training data

    [ { "request": "I want to cancel my subscription", "correct_tool": "cancel_subscription", "tool_params": {"customer_id": "123"}, "result": "success" }, { "request": "What's my order status?", "correct_tool": "get_order_status", "tool_params": {"order_id": "456"}, "result": "success" } ]

  4. Collect 500-1000 examples

    • Min: 100 examples per tool (if you have 5 tools, 500 total)
    • Better: 200-500 examples per tool (1000-2500 total)

COLLECT PREFERENCE PAIRS (FOR DPO):

  1. For each request, generate 2+ tool options

    • Option A: CORRECT tool
    • Option B: INCORRECT tool (random, or sampled)
  2. Format preference data

    [ { "request": "Cancel my subscription", "preferred_tool": "cancel_subscription", "dispreferred_tool": "send_email" }, { "request": "Check order status", "preferred_tool": "get_order_status", "dispreferred_tool": "process_refund" } ]

  3. Collect 500-1000 preference pairs

    • Min: 100 per tool
    • Better: 200+ per tool

Step 2: Fine-tune on SageMaker (1 week)

USING AMAZON SAGEMAKER AI:

  1. Upload training data to S3 bash aws s3 cp training_data.json s3://my-bucket/sft-data/

  2. Create SageMaker fine-tuning job python import sagemaker

    role = "arn:aws:iam::ACCOUNT:role/SageMakerRole" sm = sagemaker.Session()

    estimator = sagemaker.estimator.Estimator( image_uri="IMAGE_URI", role=role, instance_count=1, instance_type="ml.g4dn.xlarge", hyperparameters={ "learning_rate": 1e-5, "num_epochs": 3, "batch_size": 8, "model_id": "your-base-model" } )

    estimator.fit(s3://my-bucket/sft-data/)

  3. Deploy fine-tuned model python predictor = estimator.deploy( initial_instance_count=1, instance_type="ml.g4dn.xlarge" )

  4. Test on real requests python response = predictor.predict({ "input": "I want to cancel my subscription" })

    Expected output: {"chosen_tool": "cancel_subscription"}

Cost:

  • Training: R$ 50-200 (depending on instance size, time)
  • Hosting: R$ 500-1K/month (inference costs)
  • Total: R$ 600-1K initial, R$ 500-1K/month ongoing

Step 3: Monitor and iterate (ongoing)

MONITOR TOOL-CALLING ACCURACY:

  1. Track metrics

    • Tool-calling accuracy: % correct tools chosen
    • Failure rate: % of tasks failed due to wrong tool
    • Customer satisfaction: Rating of agent responses
  2. Log failures

    • Store: Request, wrong tool chosen, correct tool should have been
  3. Iterate

    • Add failure cases to training data (as negative examples for DPO)
    • Re-run SFT + DPO every month
    • Gradually improve accuracy: 80% → 85% → 90% → 95%+

Expected progression:

  • Week 1: Deploy SFT-trained model → 75-80% accuracy
  • Week 2: Add DPO → 85-90% accuracy
  • Week 3-4: Iterate on failures → 90-93% accuracy
  • Month 2: Continue iteration → 93-96% accuracy
  • Month 3+: Near-perfect (96%+ accuracy on your domain)

CONCLUSÃO: SEU AGENTE IA PRECISA TREINAR TOOL-CALLING (URGENTE)

O que você precisa saber:

  1. Amazon signals: Tool-calling accuracy é trainable (não é inherent limitation)

    • Amazon (huge ML resources) invested in SFT + DPO for agents
    • Implication: Tool-calling is a learned skill (can be improved)
    • Competitors will train their agents (and beat you on reliability)
    • You need to train to stay competitive
  2. Your agent (untrained) tá falhando (60-75% accuracy)

    • 25-40% of tool calls are WRONG
    • Each wrong call = customer confusion, support ticket, adoption penalty
    • Support costs explode (R$ 50K-200K/month from tool failures)
    • Customer experience degrades
  3. Training é doable (SFT + DPO, 4-6 weeks)

    • Phase 1: Collect training data (1-2 weeks)
    • Phase 2: Fine-tune with SFT (1 week) → 75-80% accuracy
    • Phase 3: Train with DPO (1 week) → 85-95% accuracy
    • Phase 4: Iterate on failures (ongoing) → 95%+ accuracy
  4. Investment é small (R$ 2-3K + 1-2 engineers)

    • SFT + DPO training cost: R$ 2-3K
    • Infrastructure: R$ 500-1K/month (SageMaker hosting)
    • Effort: 1-2 engineers, 4-6 weeks
    • Payback: Less than 1 week (R$ 100K-200K/month support savings)
  5. Urgency: Start NOW (before competitors)

    • Competitors who train their agents will have 90%+ accuracy
    • Your untrained agent (60-75%) will fail in comparison
    • Customers will switch to more reliable agents (competitors)
    • You delay = market share lost

Na OpenClaw, ajudamos SaaS a treinar agentes IA pra high tool-calling accuracy:

  • AUDIT agente atual (tool-calling accuracy analysis, failure logging)
  • COLLECT training data (correct examples for SFT, preference pairs for DPO)
  • IMPLEMENT SFT (supervised fine-tuning on SageMaker, accuracy 75-80%)
  • IMPLEMENT DPO (direct preference optimization, accuracy 85-95%)
  • DEPLOY trained model (host on SageMaker, monitor accuracy)
  • ITERATE on failures (log failures, improve training data, re-train monthly)
  • SCALE training (add more domains, more tools, more complex tasks)

Resultado: Seu agente IA passa de "untrained, unreliable, 60-75% accuracy" → "trained, reliable, 90-95%+ accuracy".

Seu agente IA escolhe ferramentas erradas frequentemente?

Tool-calling accuracy é 60-75% (abaixo da acceptable)?

Você tem R$ 50K-200K/month em support costs due to tool failures?

Você tem agent trained pra high tool-calling accuracy (SFT + DPO)?

Se não: Seu agente é capability-liability (untrained = unreliable = customers don't trust = adoption breaks = fix exists but you didn't invest = competitors train their agents and beat you = urgent train your agent now, antes competition launches reliable trained agent that undercuts your reliability and steals market share, antes you lose customers to more dependable competitors, antes tool-calling failure costs compound against you, before it's too late).

O que você vai fazer?

Treinar seu agente IA pra high tool-calling accuracy (SFT + DPO fine-tuning, 90-95%+ accuracy, R$ 100K-200K/month support savings) →


Publicado em 3 de junho de 2026

Leia também