Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)

Notícias

5 min de leitura

3 de junho de 2026

Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)

Seu agente IA chama API errada, executa ação errada, falha tarefas. Amazon: tool-calling accuracy é trainable (SFT + DPO).

Equipe OpenClaw · Time de Engenharia & Produto

A Equipe OpenClaw é formada por engenheiros, designers e especialistas em IA dedicados a construir a melhor plataforma de agentes conversacionais para negócios brasileiros. Combinamos expertise…

Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)

Você tem SaaS.

Seu SaaS: agente IA (atendimento, vendas, suporte).

Agente precisa fazer tarefas multi-step (procurar informação, calcular, executar ação).

Como agente funciona:

Customer request → Agente processa → Agente escolhe FERRAMENTA (tool)

Exemplos de ferramentas:

Lookup customer info (API: get_customer_by_id)
Check order status (API: get_order_by_id)
Process refund (API: process_refund)
Send email (API: send_email)
Update database (API: update_customer_data)

Agente tem 5-10 ferramentas disponíveis.

Agente precisa escolher CERTA ferramenta pro request.

OK exemplo:

Customer: "What's my order status?"

Agente:

Reconhece: Precisa checar order status
Escolhe ferramenta: get_order_by_id ✓ (CERTA)
Chama API: get_order_by_id(customer_id=123)
Obtém resultado: Order #456, status: shipped
Responde: "Seu pedido está enviado"

Resultado: ✓ Sucesso

BAD exemplo:

Customer: "What's my order status?"

Agente:

Reconhece: Precisa... fazer algo?
Escolhe ferramenta: process_refund ✗ (ERRADA)
Tenta chamar: process_refund(customer_id=123)
Erro: Não faz sentido (customer não pediu refund)
Resultado: Falha, confusão, suporte ticket

Ou:

Reconhece: Precisa checar status
Escolhe ferramenta: send_email ✗ (ERRADA)
Tenta chamar: send_email(message="What's your status?")
Erro: Enviou email inútil em vez de checking status
Resultado: Confusão, customer pissed

Reality: Your agent escolhe WRONG TOOL frequently.

Razão:

Generic LLM não tá trained pra tool-calling (foi trained em next-token prediction, não em "choose right tool")
Generic LLM não entende sua specific tools (não tem context)
Generic LLM halucina (inventa ferramentas que não existem)
Generic LLM escolhe random (quando não tem certeza)

Resultado:

Tool-calling accuracy (% de vezes agente escolhe CERTA ferramenta):

Generic LLM (sem treinamento): 60-75% accuracy

25-40% de requests: agente escolhe WRONG tool
Cada wrong tool choice: customer vê erro, ticket de suporte, marca negativa

Exemplo em 1,000 requests:

600-750 sucesso (agente acertou)
250-400 falhas (agente errou)
Cada falha: 1-2 minutos support time
Total: 250-800 minutos support = R$ 2K-8K support cost (só por tool-calling failures)

Você não tá percebendo problema (porque agente funciona "ok", occasional failures você acha que é normal).

Ai vem notícia:

"Amazon SageMaker AI: Improve agent's tool-calling accuracy with SFT and DPO."

"SFT (Supervised Fine-Tuning) + DPO (Direct Preference Optimization) = trainable tool-calling accuracy."

"Result: Tool-calling accuracy 85-95% (vs 60-75% generic LLM)."

Você pensa:

"Wait, tool-calling accuracy é trainable?

Meu agente (60-75%) pode ser treinado pra 85-95%?

Competitors que fine-tuned their agents terão:

85-95% tool-calling accuracy (vs meu 60-75%)
Menos failures (menos support tickets)
Melhor customer experience (agente funciona melhor)
Maior adoption (customers trust agente mais)

Meu agente (untrained, 60-75%) será outdated (unreliable, lots of failures, high support cost)?"

Sim. Sim. Sim. Sim.

Amazon just signaled: Tool-calling accuracy é trainable (não é inherent limitation, é training problem).

Your agente (untrained) é now capability-liability (can be improved, but you didn't invest).

THE PROBLEM: WRONG TOOL-CALLING DESTROYS USER EXPERIENCE

Problema 1: Customer vê agente falhar (broken trust)

SCENARIO: Customer support agente escolhe wrong tool

Customer: "I want to cancel my subscription"

Agente:

Reconhece: Precisa processar cancellation
Escolhe ferramenta: WRONG (escolhe get_subscription_info em vez de cancel_subscription)
Tenta executar: Retorna info, não cancela
Agente confuso: "I found your subscription... uh... wait" (halucina)
Customer vê: Agente não consegue fazer coisa simples
Customer pensa: "Este agente é lixo. Não confio."

Resultado:

Customer não usa agente mais
Customer calls suporte human (custo: R$ 50-100 por ticket)
Customer marca negativa ("Bot não funciona")
Adoption de agente cai

IMPACT SCALE:

Se você tem 10,000 customers, 100 requests/mês por customer: = 1,000,000 requests/mês

Se tool-calling accuracy é 65%: = 350,000 wrong tool calls/mês = 350K failures/mês

Se 10% desses failures resultam em support ticket: = 35,000 support tickets/mês = Cada ticket: 5 minutos support human = R$ 2-5 custo = Total: R$ 70K-175K/mês support cost (só por tool-calling failures)

Se tool-calling accuracy era 90% (trained): = 100,000 wrong tool calls/mês = 10,000 failures/mês = 1,000 support tickets/mês = R$ 2K-5K/mês support cost

Savings: R$ 65K-170K/mês (by improving tool-calling accuracy 65% → 90%)

Problema 2: Multi-step tarefas falham (chain breaks)

MANY TASKS SÃO MULTI-STEP:

Exemplo: "Can you process my refund and send me confirmation email?"

Required steps:

Verify customer identity (use: verify_customer_identity tool)
Verify refund eligibility (use: check_refund_eligibility tool)
Process refund (use: process_refund tool)
Send confirmation email (use: send_email tool)

If agent choose wrong tool at ANY step:

Step 1: Choose wrong tool → stops, can't verify
Step 2: Choose wrong tool → stops, can't check eligibility
Step 3: Choose wrong tool → stops, can't refund
Step 4: Choose wrong tool → refund processed, but no email sent

Accuracy requirement:

All 4 steps MUST be correct
If each step has 85% accuracy: 0.85^4 = 52% chance whole chain succeeds
If each step has 90% accuracy: 0.90^4 = 66% chance
If each step has 95% accuracy: 0.95^4 = 81% chance
If each step has 99% accuracy: 0.99^4 = 96% chance

Conclusion: Multi-step tasks DEMAND high tool-calling accuracy (each step must be correct)

Your agente (60-75% accuracy) = multi-step tasks fail often Trained agente (90-95% accuracy) = multi-step tasks succeed most times

Problema 3: Support costs explode (broken tool-calling = support tickets)

WRONG TOOL-CALLING → SUPPORT COSTS:

Example:

Customer requests refund
Agent chooses process_payment instead of process_refund
Customer's account CHARGED instead of refunded
Customer notices: "Wait, I was charged? I asked for refund!"
Customer calls support: "Your AI agent charged me"
Support human needs to: Investigate, reverse charge, apologize
Time: 15-30 minutes per case
Cost: R$ 100-300 per case
Customer satisfaction: -5/5 (pissed)

If this happens 100x/month: = R$ 10K-30K/month support cost = Massive customer churn = Reputation damage

All preventable by: Training agent pra high tool-calling accuracy

HOW TO TRAIN TOOL-CALLING ACCURACY (SFT + DPO)

What is SFT (Supervised Fine-Tuning)?

SFT = Supervised Fine-Tuning

Idea: Show agent examples of CORRECT tool calls, let it learn

PROCESS:

Collect training data
- Log 1,000 requests where agent was CORRECT
- For each: Store (request, correct_tool_call, result)
Example:
- Request: "I want to cancel my subscription"
- Correct tool: cancel_subscription
- Result: "Subscription cancelled successfully"
Fine-tune agent on this data
- Agent learns: "When customer says 'cancel', use cancel_subscription tool"
- Agent internalizes: Tool → request mapping
Test on new requests
- Give agent new requests (not in training data)
- Agent now chooses correct tools more often
- Accuracy improves: 60% → 75-80%

COST:

Collecting training data: 10-20 hours (log correct calls)
Fine-tuning: R$ 100-500 (compute cost on SageMaker)
Testing: 5-10 hours
Total: R$ 500-1K cost, 30-40 hours effort

Benefit:

Accuracy improvement: 60% → 75-80% (20% improvement)
Support cost reduction: R$ 20K-50K/month (from fewer failures)
Payback: 1-3 weeks

What is DPO (Direct Preference Optimization)?

DPO = Direct Preference Optimization

Idea: Show agent CORRECT vs INCORRECT tool calls, let it learn preferences

PROCESS:

Collect preference data
- For each request, get 2+ tool call options:
  - Option A: CORRECT tool
  - Option B: INCORRECT tool
Example:
- Request: "What's my order status?"
- Option A (CORRECT): get_order_status tool ✓
- Option B (INCORRECT): send_email tool ✗
Train agent on preferences
- Agent learns: "Prefer get_order_status over send_email for status queries"
- Agent learns comparative preferences (not just absolute right/wrong)
Test on new requests
- Accuracy improves: 75% → 85-95% (even better than SFT alone)

Why DPO better than SFT:

SFT learns: "This is correct"
DPO learns: "This is better than that"
DPO closer to how humans think (comparative, not absolute)
DPO more robust (generalizes better)

COST:

Collecting preference data: 20-30 hours (label correct vs incorrect)
DPO training: R$ 200-1K (compute cost)
Testing: 5-10 hours
Total: R$ 1K-2K cost, 40-50 hours effort

Benefit:

Accuracy improvement: 75% → 85-95% (10-20% more improvement)
Support cost reduction: Additional R$ 20K-50K/month
Payback: 1-2 weeks

SFT + DPO Combined (Best approach)

COMBINED APPROACH:

Start with SFT (quick win)
- Collect 500-1000 correct examples
- Fine-tune agent
- Accuracy: 60% → 75-80%
- Cost: R$ 500-1K
- Time: 2-3 weeks
Then add DPO (further improvement)
- Collect preference pairs (correct vs incorrect)
- Train DPO
- Accuracy: 75-80% → 88-95%
- Cost: R$ 1K-2K
- Time: 2-3 weeks more
Monitor and iterate
- Test on real requests
- Log failures
- Improve training data based on failures
- Rinse and repeat

Total investment:

Cost: R$ 2K-3K
Time: 4-6 weeks
Team: 1-2 engineers + 1 ML expert

Total benefit:

Accuracy improvement: 60% → 90%+ (30%+ improvement)
Support cost reduction: R$ 100K-200K/month (across all multi-step tasks)
Payback: Less than 1 week

HOW TO IMPLEMENT ON AMAZON SAGEMAKER

Step 1: Prepare training data (1-2 weeks)

COLLECT CORRECT EXAMPLES (FOR SFT):

Log all agent requests
- Store: customer request, agent's tool choice, result
Filter for CORRECT cases
- Only keep: Where agent chose right tool and customer was happy
Format training data

[ { "request": "I want to cancel my subscription", "correct_tool": "cancel_subscription", "tool_params": {"customer_id": "123"}, "result": "success" }, { "request": "What's my order status?", "correct_tool": "get_order_status", "tool_params": {"order_id": "456"}, "result": "success" } ]
Collect 500-1000 examples
- Min: 100 examples per tool (if you have 5 tools, 500 total)
- Better: 200-500 examples per tool (1000-2500 total)

COLLECT PREFERENCE PAIRS (FOR DPO):

For each request, generate 2+ tool options
- Option A: CORRECT tool
- Option B: INCORRECT tool (random, or sampled)
Format preference data

[ { "request": "Cancel my subscription", "preferred_tool": "cancel_subscription", "dispreferred_tool": "send_email" }, { "request": "Check order status", "preferred_tool": "get_order_status", "dispreferred_tool": "process_refund" } ]
Collect 500-1000 preference pairs
- Min: 100 per tool
- Better: 200+ per tool

Step 2: Fine-tune on SageMaker (1 week)

USING AMAZON SAGEMAKER AI:

Upload training data to S3 bash aws s3 cp training_data.json s3://my-bucket/sft-data/
Create SageMaker fine-tuning job python import sagemaker

role = "arn:aws:iam::ACCOUNT:role/SageMakerRole" sm = sagemaker.Session()

estimator = sagemaker.estimator.Estimator( image_uri="IMAGE_URI", role=role, instance_count=1, instance_type="ml.g4dn.xlarge", hyperparameters={ "learning_rate": 1e-5, "num_epochs": 3, "batch_size": 8, "model_id": "your-base-model" } )

estimator.fit(s3://my-bucket/sft-data/)
Deploy fine-tuned model python predictor = estimator.deploy( initial_instance_count=1, instance_type="ml.g4dn.xlarge" )
Test on real requests python response = predictor.predict({ "input": "I want to cancel my subscription" })

Expected output: {"chosen_tool": "cancel_subscription"}

Cost:

Training: R$ 50-200 (depending on instance size, time)
Hosting: R$ 500-1K/month (inference costs)
Total: R$ 600-1K initial, R$ 500-1K/month ongoing

Step 3: Monitor and iterate (ongoing)

MONITOR TOOL-CALLING ACCURACY:

Track metrics
- Tool-calling accuracy: % correct tools chosen
- Failure rate: % of tasks failed due to wrong tool
- Customer satisfaction: Rating of agent responses
Log failures
- Store: Request, wrong tool chosen, correct tool should have been
Iterate
- Add failure cases to training data (as negative examples for DPO)
- Re-run SFT + DPO every month
- Gradually improve accuracy: 80% → 85% → 90% → 95%+

Expected progression:

Week 1: Deploy SFT-trained model → 75-80% accuracy
Week 2: Add DPO → 85-90% accuracy
Week 3-4: Iterate on failures → 90-93% accuracy
Month 2: Continue iteration → 93-96% accuracy
Month 3+: Near-perfect (96%+ accuracy on your domain)

CONCLUSÃO: SEU AGENTE IA PRECISA TREINAR TOOL-CALLING (URGENTE)

O que você precisa saber:

Amazon signals: Tool-calling accuracy é trainable (não é inherent limitation)
- Amazon (huge ML resources) invested in SFT + DPO for agents
- Implication: Tool-calling is a learned skill (can be improved)
- Competitors will train their agents (and beat you on reliability)
- You need to train to stay competitive
Your agent (untrained) tá falhando (60-75% accuracy)
- 25-40% of tool calls are WRONG
- Each wrong call = customer confusion, support ticket, adoption penalty
- Support costs explode (R$ 50K-200K/month from tool failures)
- Customer experience degrades
Training é doable (SFT + DPO, 4-6 weeks)
- Phase 1: Collect training data (1-2 weeks)
- Phase 2: Fine-tune with SFT (1 week) → 75-80% accuracy
- Phase 3: Train with DPO (1 week) → 85-95% accuracy
- Phase 4: Iterate on failures (ongoing) → 95%+ accuracy
Investment é small (R$ 2-3K + 1-2 engineers)
- SFT + DPO training cost: R$ 2-3K
- Infrastructure: R$ 500-1K/month (SageMaker hosting)
- Effort: 1-2 engineers, 4-6 weeks
- Payback: Less than 1 week (R$ 100K-200K/month support savings)
Urgency: Start NOW (before competitors)
- Competitors who train their agents will have 90%+ accuracy
- Your untrained agent (60-75%) will fail in comparison
- Customers will switch to more reliable agents (competitors)
- You delay = market share lost

Na OpenClaw, ajudamos SaaS a treinar agentes IA pra high tool-calling accuracy:

AUDIT agente atual (tool-calling accuracy analysis, failure logging)
COLLECT training data (correct examples for SFT, preference pairs for DPO)
IMPLEMENT SFT (supervised fine-tuning on SageMaker, accuracy 75-80%)
IMPLEMENT DPO (direct preference optimization, accuracy 85-95%)
DEPLOY trained model (host on SageMaker, monitor accuracy)
ITERATE on failures (log failures, improve training data, re-train monthly)
SCALE training (add more domains, more tools, more complex tasks)

Resultado: Seu agente IA passa de "untrained, unreliable, 60-75% accuracy" → "trained, reliable, 90-95%+ accuracy".

Seu agente IA escolhe ferramentas erradas frequentemente?

Tool-calling accuracy é 60-75% (abaixo da acceptable)?

Você tem R$ 50K-200K/month em support costs due to tool failures?

Você tem agent trained pra high tool-calling accuracy (SFT + DPO)?

Se não: Seu agente é capability-liability (untrained = unreliable = customers don't trust = adoption breaks = fix exists but you didn't invest = competitors train their agents and beat you = urgent train your agent now, antes competition launches reliable trained agent that undercuts your reliability and steals market share, antes you lose customers to more dependable competitors, antes tool-calling failure costs compound against you, before it's too late).

O que você vai fazer?

Treinar seu agente IA pra high tool-calling accuracy (SFT + DPO fine-tuning, 90-95%+ accuracy, R$ 100K-200K/month support savings) →

Publicado em 3 de junho de 2026

Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)

Seu agente IA escolhe ferramentas erradas (Amazon prova como treinar)

THE PROBLEM: WRONG TOOL-CALLING DESTROYS USER EXPERIENCE

Problema 1: Customer vê agente falhar (broken trust)

Problema 2: Multi-step tarefas falham (chain breaks)

Problema 3: Support costs explode (broken tool-calling = support tickets)

HOW TO TRAIN TOOL-CALLING ACCURACY (SFT + DPO)

What is SFT (Supervised Fine-Tuning)?

What is DPO (Direct Preference Optimization)?

SFT + DPO Combined (Best approach)

HOW TO IMPLEMENT ON AMAZON SAGEMAKER

Step 1: Prepare training data (1-2 weeks)

Step 2: Fine-tune on SageMaker (1 week)

Expected output: {"chosen_tool": "cancel_subscription"}

Step 3: Monitor and iterate (ongoing)

CONCLUSÃO: SEU AGENTE IA PRECISA TREINAR TOOL-CALLING (URGENTE)

Leia também