Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)
Seu agente IA chama API errada, executa ação errada, falha tarefas. Amazon: tool-calling accuracy é trainable (SFT + DPO).
Equipe OpenClaw · Time de Engenharia & Produto
A Equipe OpenClaw é formada por engenheiros, designers e especialistas em IA dedicados a construir a melhor plataforma de agentes conversacionais para negócios brasileiros. Combinamos expertise…
Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)
Você tem SaaS.
Seu SaaS: agente IA (atendimento, vendas, suporte).
Agente precisa fazer tarefas multi-step (procurar informação, calcular, executar ação).
Como agente funciona:
Customer request → Agente processa → Agente escolhe FERRAMENTA (tool)
Exemplos de ferramentas:
- Lookup customer info (API: get_customer_by_id)
- Check order status (API: get_order_by_id)
- Process refund (API: process_refund)
- Send email (API: send_email)
- Update database (API: update_customer_data)
Agente tem 5-10 ferramentas disponíveis.
Agente precisa escolher CERTA ferramenta pro request.
OK exemplo:
Customer: "What's my order status?"
Agente:
- Reconhece: Precisa checar order status
- Escolhe ferramenta: get_order_by_id ✓ (CERTA)
- Chama API: get_order_by_id(customer_id=123)
- Obtém resultado: Order #456, status: shipped
- Responde: "Seu pedido está enviado"
Resultado: ✓ Sucesso
BAD exemplo:
Customer: "What's my order status?"
Agente:
- Reconhece: Precisa... fazer algo?
- Escolhe ferramenta: process_refund ✗ (ERRADA)
- Tenta chamar: process_refund(customer_id=123)
- Erro: Não faz sentido (customer não pediu refund)
- Resultado: Falha, confusão, suporte ticket
Ou:
- Reconhece: Precisa checar status
- Escolhe ferramenta: send_email ✗ (ERRADA)
- Tenta chamar: send_email(message="What's your status?")
- Erro: Enviou email inútil em vez de checking status
- Resultado: Confusão, customer pissed
Reality: Your agent escolhe WRONG TOOL frequently.
Razão:
- Generic LLM não tá trained pra tool-calling (foi trained em next-token prediction, não em "choose right tool")
- Generic LLM não entende sua specific tools (não tem context)
- Generic LLM halucina (inventa ferramentas que não existem)
- Generic LLM escolhe random (quando não tem certeza)
Resultado:
Tool-calling accuracy (% de vezes agente escolhe CERTA ferramenta):
Generic LLM (sem treinamento): 60-75% accuracy
- 25-40% de requests: agente escolhe WRONG tool
- Cada wrong tool choice: customer vê erro, ticket de suporte, marca negativa
Exemplo em 1,000 requests:
- 600-750 sucesso (agente acertou)
- 250-400 falhas (agente errou)
- Cada falha: 1-2 minutos support time
- Total: 250-800 minutos support = R$ 2K-8K support cost (só por tool-calling failures)
Você não tá percebendo problema (porque agente funciona "ok", occasional failures você acha que é normal).
Ai vem notícia:
"Amazon SageMaker AI: Improve agent's tool-calling accuracy with SFT and DPO."
"SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) = trainable tool-calling accuracy."
"Result: Tool-calling accuracy 85-95% (vs 60-75% generic LLM)."
Você pensa:
"Wait, tool-calling accuracy é trainable?
Meu agente (60-75%) pode ser treinado pra 85-95%?
Competitors que fine-tuned their agents terão:
- 85-95% tool-calling accuracy (vs meu 60-75%)
- Menos failures (menos support tickets)
- Melhor customer experience (agente funciona melhor)
- Maior adoption (customers trust agente mais)
Meu agente (untrained, 60-75%) será outdated (unreliable, lots of failures, high support cost)?"
Sim. Sim. Sim. Sim.
Amazon just signaled: Tool-calling accuracy é trainable (não é inherent limitation, é training problem).
Your agente (untrained) é now capability-liability (can be improved, but you didn't invest).
THE PROBLEM: WRONG TOOL-CALLING DESTROYS USER EXPERIENCE
Problema 1: Customer vê agente falhar (broken trust)
SCENARIO: Customer support agente escolhe wrong tool
Customer: "I want to cancel my subscription"
Agente:
- Reconhece: Precisa processar cancellation
- Escolhe ferramenta: WRONG (escolhe get_subscription_info em vez de cancel_subscription)
- Tenta executar: Retorna info, não cancela
- Agente confuso: "I found your subscription... uh... wait" (halucina)
- Customer vê: Agente não consegue fazer coisa simples
- Customer pensa: "Este agente é lixo. Não confio."
Resultado:
- Customer não usa agente mais
- Customer calls suporte human (custo: R$ 50-100 por ticket)
- Customer marca negativa ("Bot não funciona")
- Adoption de agente cai
IMPACT SCALE:
Se você tem 10,000 customers, 100 requests/mês por customer: = 1,000,000 requests/mês
Se tool-calling accuracy é 65%: = 350,000 wrong tool calls/mês = 350K failures/mês
Se 10% desses failures resultam em support ticket: = 35,000 support tickets/mês = Cada ticket: 5 minutos support human = R$ 2-5 custo = Total: R$ 70K-175K/mês support cost (só por tool-calling failures)
Se tool-calling accuracy era 90% (trained): = 100,000 wrong tool calls/mês = 10,000 failures/mês = 1,000 support tickets/mês = R$ 2K-5K/mês support cost
Savings: R$ 65K-170K/mês (by improving tool-calling accuracy 65% → 90%)
Problema 2: Multi-step tarefas falham (chain breaks)
MANY TASKS SÃO MULTI-STEP:
Exemplo: "Can you process my refund and send me confirmation email?"
Required steps:
- Verify customer identity (use: verify_customer_identity tool)
- Verify refund eligibility (use: check_refund_eligibility tool)
- Process refund (use: process_refund tool)
- Send confirmation email (use: send_email tool)
If agent choose wrong tool at ANY step:
- Step 1: Choose wrong tool → stops, can't verify
- Step 2: Choose wrong tool → stops, can't check eligibility
- Step 3: Choose wrong tool → stops, can't refund
- Step 4: Choose wrong tool → refund processed, but no email sent
Accuracy requirement:
- All 4 steps MUST be correct
- If each step has 85% accuracy: 0.85^4 = 52% chance whole chain succeeds
- If each step has 90% accuracy: 0.90^4 = 66% chance
- If each step has 95% accuracy: 0.95^4 = 81% chance
- If each step has 99% accuracy: 0.99^4 = 96% chance
Conclusion: Multi-step tasks DEMAND high tool-calling accuracy (each step must be correct)
Your agente (60-75% accuracy) = multi-step tasks fail often Trained agente (90-95% accuracy) = multi-step tasks succeed most times
Problema 3: Support costs explode (broken tool-calling = support tickets)
WRONG TOOL-CALLING → SUPPORT COSTS:
Example:
- Customer requests refund
- Agent chooses process_payment instead of process_refund
- Customer's account CHARGED instead of refunded
- Customer notices: "Wait, I was charged? I asked for refund!"
- Customer calls support: "Your AI agent charged me"
- Support human needs to: Investigate, reverse charge, apologize
- Time: 15-30 minutes per case
- Cost: R$ 100-300 per case
- Customer satisfaction: -5/5 (pissed)
If this happens 100x/month: = R$ 10K-30K/month support cost = Massive customer churn = Reputation damage
All preventable by: Training agent pra high tool-calling accuracy
HOW TO TRAIN TOOL-CALLING ACCURACY (SFT + DPO)
What is SFT (Supervised Fine-Tuning)?
SFT = Supervised Fine-Tuning
Idea: Show agent examples of CORRECT tool calls, let it learn
PROCESS:
-
Collect training data
- Log 1,000 requests where agent was CORRECT
- For each: Store (request, correct_tool_call, result)
Example:
- Request: "I want to cancel my subscription"
- Correct tool: cancel_subscription
- Result: "Subscription cancelled successfully"
-
Fine-tune agent on this data
- Agent learns: "When customer says 'cancel', use cancel_subscription tool"
- Agent internalizes: Tool → request mapping
-
Test on new requests
- Give agent new requests (not in training data)
- Agent now chooses correct tools more often
- Accuracy improves: 60% → 75-80%
COST:
- Collecting training data: 10-20 hours (log correct calls)
- Fine-tuning: R$ 100-500 (compute cost on SageMaker)
- Testing: 5-10 hours
- Total: R$ 500-1K cost, 30-40 hours effort
Benefit:
- Accuracy improvement: 60% → 75-80% (20% improvement)
- Support cost reduction: R$ 20K-50K/month (from fewer failures)
- Payback: 1-3 weeks
What is DPO (Direct Preference Optimization)?
DPO = Direct Preference Optimization
Idea: Show agent CORRECT vs INCORRECT tool calls, let it learn preferences
PROCESS:
-
Collect preference data
- For each request, get 2+ tool call options:
- Option A: CORRECT tool
- Option B: INCORRECT tool
Example:
- Request: "What's my order status?"
- Option A (CORRECT): get_order_status tool ✓
- Option B (INCORRECT): send_email tool ✗
- For each request, get 2+ tool call options:
-
Train agent on preferences
- Agent learns: "Prefer get_order_status over send_email for status queries"
- Agent learns comparative preferences (not just absolute right/wrong)
-
Test on new requests
- Accuracy improves: 75% → 85-95% (even better than SFT alone)
Why DPO better than SFT:
- SFT learns: "This is correct"
- DPO learns: "This is better than that"
- DPO closer to how humans think (comparative, not absolute)
- DPO more robust (generalizes better)
COST:
- Collecting preference data: 20-30 hours (label correct vs incorrect)
- DPO training: R$ 200-1K (compute cost)
- Testing: 5-10 hours
- Total: R$ 1K-2K cost, 40-50 hours effort
Benefit:
- Accuracy improvement: 75% → 85-95% (10-20% more improvement)
- Support cost reduction: Additional R$ 20K-50K/month
- Payback: 1-2 weeks
SFT + DPO Combined (Best approach)
COMBINED APPROACH:
-
Start with SFT (quick win)
- Collect 500-1000 correct examples
- Fine-tune agent
- Accuracy: 60% → 75-80%
- Cost: R$ 500-1K
- Time: 2-3 weeks
-
Then add DPO (further improvement)
- Collect preference pairs (correct vs incorrect)
- Train DPO
- Accuracy: 75-80% → 88-95%
- Cost: R$ 1K-2K
- Time: 2-3 weeks more
-
Monitor and iterate
- Test on real requests
- Log failures
- Improve training data based on failures
- Rinse and repeat
Total investment:
- Cost: R$ 2K-3K
- Time: 4-6 weeks
- Team: 1-2 engineers + 1 ML expert
Total benefit:
- Accuracy improvement: 60% → 90%+ (30%+ improvement)
- Support cost reduction: R$ 100K-200K/month (across all multi-step tasks)
- Payback: Less than 1 week
HOW TO IMPLEMENT ON AMAZON SAGEMAKER
Step 1: Prepare training data (1-2 weeks)
COLLECT CORRECT EXAMPLES (FOR SFT):
-
Log all agent requests
- Store: customer request, agent's tool choice, result
-
Filter for CORRECT cases
- Only keep: Where agent chose right tool and customer was happy
-
Format training data
[ { "request": "I want to cancel my subscription", "correct_tool": "cancel_subscription", "tool_params": {"customer_id": "123"}, "result": "success" }, { "request": "What's my order status?", "correct_tool": "get_order_status", "tool_params": {"order_id": "456"}, "result": "success" } ]
-
Collect 500-1000 examples
- Min: 100 examples per tool (if you have 5 tools, 500 total)
- Better: 200-500 examples per tool (1000-2500 total)
COLLECT PREFERENCE PAIRS (FOR DPO):
-
For each request, generate 2+ tool options
- Option A: CORRECT tool
- Option B: INCORRECT tool (random, or sampled)
-
Format preference data
[ { "request": "Cancel my subscription", "preferred_tool": "cancel_subscription", "dispreferred_tool": "send_email" }, { "request": "Check order status", "preferred_tool": "get_order_status", "dispreferred_tool": "process_refund" } ]
-
Collect 500-1000 preference pairs
- Min: 100 per tool
- Better: 200+ per tool
Step 2: Fine-tune on SageMaker (1 week)
USING AMAZON SAGEMAKER AI:
-
Upload training data to S3 bash aws s3 cp training_data.json s3://my-bucket/sft-data/
-
Create SageMaker fine-tuning job python import sagemaker
role = "arn:aws:iam::ACCOUNT:role/SageMakerRole" sm = sagemaker.Session()
estimator = sagemaker.estimator.Estimator( image_uri="IMAGE_URI", role=role, instance_count=1, instance_type="ml.g4dn.xlarge", hyperparameters={ "learning_rate": 1e-5, "num_epochs": 3, "batch_size": 8, "model_id": "your-base-model" } )
estimator.fit(s3://my-bucket/sft-data/)
-
Deploy fine-tuned model python predictor = estimator.deploy( initial_instance_count=1, instance_type="ml.g4dn.xlarge" )
-
Test on real requests python response = predictor.predict({ "input": "I want to cancel my subscription" })
Expected output: {"chosen_tool": "cancel_subscription"}
Cost:
- Training: R$ 50-200 (depending on instance size, time)
- Hosting: R$ 500-1K/month (inference costs)
- Total: R$ 600-1K initial, R$ 500-1K/month ongoing
Step 3: Monitor and iterate (ongoing)
MONITOR TOOL-CALLING ACCURACY:
-
Track metrics
- Tool-calling accuracy: % correct tools chosen
- Failure rate: % of tasks failed due to wrong tool
- Customer satisfaction: Rating of agent responses
-
Log failures
- Store: Request, wrong tool chosen, correct tool should have been
-
Iterate
- Add failure cases to training data (as negative examples for DPO)
- Re-run SFT + DPO every month
- Gradually improve accuracy: 80% → 85% → 90% → 95%+
Expected progression:
- Week 1: Deploy SFT-trained model → 75-80% accuracy
- Week 2: Add DPO → 85-90% accuracy
- Week 3-4: Iterate on failures → 90-93% accuracy
- Month 2: Continue iteration → 93-96% accuracy
- Month 3+: Near-perfect (96%+ accuracy on your domain)
CONCLUSÃO: SEU AGENTE IA PRECISA TREINAR TOOL-CALLING (URGENTE)
O que você precisa saber:
-
Amazon signals: Tool-calling accuracy é trainable (não é inherent limitation)
- Amazon (huge ML resources) invested in SFT + DPO for agents
- Implication: Tool-calling is a learned skill (can be improved)
- Competitors will train their agents (and beat you on reliability)
- You need to train to stay competitive
-
Your agent (untrained) tá falhando (60-75% accuracy)
- 25-40% of tool calls are WRONG
- Each wrong call = customer confusion, support ticket, adoption penalty
- Support costs explode (R$ 50K-200K/month from tool failures)
- Customer experience degrades
-
Training é doable (SFT + DPO, 4-6 weeks)
- Phase 1: Collect training data (1-2 weeks)
- Phase 2: Fine-tune with SFT (1 week) → 75-80% accuracy
- Phase 3: Train with DPO (1 week) → 85-95% accuracy
- Phase 4: Iterate on failures (ongoing) → 95%+ accuracy
-
Investment é small (R$ 2-3K + 1-2 engineers)
- SFT + DPO training cost: R$ 2-3K
- Infrastructure: R$ 500-1K/month (SageMaker hosting)
- Effort: 1-2 engineers, 4-6 weeks
- Payback: Less than 1 week (R$ 100K-200K/month support savings)
-
Urgency: Start NOW (before competitors)
- Competitors who train their agents will have 90%+ accuracy
- Your untrained agent (60-75%) will fail in comparison
- Customers will switch to more reliable agents (competitors)
- You delay = market share lost
Na OpenClaw, ajudamos SaaS a treinar agentes IA pra high tool-calling accuracy:
- AUDIT agente atual (tool-calling accuracy analysis, failure logging)
- COLLECT training data (correct examples for SFT, preference pairs for DPO)
- IMPLEMENT SFT (supervised fine-tuning on SageMaker, accuracy 75-80%)
- IMPLEMENT DPO (direct preference optimization, accuracy 85-95%)
- DEPLOY trained model (host on SageMaker, monitor accuracy)
- ITERATE on failures (log failures, improve training data, re-train monthly)
- SCALE training (add more domains, more tools, more complex tasks)
Resultado: Seu agente IA passa de "untrained, unreliable, 60-75% accuracy" → "trained, reliable, 90-95%+ accuracy".
Seu agente IA escolhe ferramentas erradas frequentemente?
Tool-calling accuracy é 60-75% (abaixo da acceptable)?
Você tem R$ 50K-200K/month em support costs due to tool failures?
Você tem agent trained pra high tool-calling accuracy (SFT + DPO)?
Se não: Seu agente é capability-liability (untrained = unreliable = customers don't trust = adoption breaks = fix exists but you didn't invest = competitors train their agents and beat you = urgent train your agent now, antes competition launches reliable trained agent that undercuts your reliability and steals market share, antes you lose customers to more dependable competitors, antes tool-calling failure costs compound against you, before it's too late).
O que você vai fazer?
Publicado em 3 de junho de 2026