Seu agente IA é text-only-incomplete (Qwen multimodal agents viáveis)
Qwen3.7-Plus: visual perception + GUI operation + coding (single loop). Seu agente: text-only (não vê, não clica). Multimodal virou commodity.
Equipe OpenClaw · Time de Engenharia & Produto
A Equipe OpenClaw é formada por engenheiros, designers e especialistas em IA dedicados a construir a melhor plataforma de agentes conversacionais para negócios brasileiros. Combinamos expertise…
Seu agente IA é text-only-incomplete (Qwen multimodal agents viáveis)
Você é founder/CEO de SaaS.
Seu SaaS: agente IA (automação, vendas, suporte, atendimento).
Seu agente funciona:
- Customer envia mensagem (text input)
- Agente processa texto (LLM processes text)
- Agente retorna resposta (text output)
- Customer lê resposta (text only)
Sua postura sobre multimodal:
- Vision capability: None (agente não vê imagens)
- GUI automation: None (agente não consegue clicar, arrastar, digitar em sistemas)
- Code execution: None (agente não consegue rodar código)
- Screen interaction: None (agente não consegue ver/navegar telas)
- File handling: None (agente não consegue ler documentos, PDFs, planilhas)
- Assumption: "Text is sufficient (customers only need chat)"
Você pensa:
- "Vision é complicated (requires computer vision, hard to implement)"
- "GUI automation é risky (pode quebrar sistemas)"
- "Code execution é dangerous (security risk)"
- "Multimodal é future (not now)"
- "Text agente é OK (good enough)"
Ai vem notícia:
Qwen3.7-Plus: Alibaba's multimodal agent combines visual perception, GUI operation, and coding in single agent loop.
Reality: Multimodal agents that can SEE screenshots, CLICK buttons, and EXECUTE code are NOW viable (proven in demos).
Message: Multimodal capability is no longer technically impossible—it's now a competitive requirement.
Implication: Your text-only agente = incomplete (customers expect visual + action agents).
O problema (seu agente é text-only-incomplete)
Qwen3.7-Plus proves multimodal agents are viable now
What the Qwen announcement signals:
Before (2024-2025):
Multimodal agents: Theoretical, complex, proprietary GUI automation: Hard (requires computer vision + API integration) Code execution: Dangerous (security risk, not worth it) Market perception: "Multimodal agents are future tech"
After (2026, now - Qwen3.7-Plus):
Multimodal agents: Production-ready, working demos, open competition GUI automation: Proven (Qwen demo: app development with 10K lines of code) Code execution: Proven (embedded in single agent loop, works at scale) Market perception: "Multimodal agents are NOW available"
What this means:
- Vision capability barrier is BROKEN (Qwen proves it works)
- GUI automation is PROVEN (not theoretical anymore)
- Code execution is SAFE (Qwen handles it safely)
- Market expectation SHIFTS (multimodal becomes expected, not optional)
- Your text-only agente = INCOMPLETE (customers expect visual + action)
Multimodal agents are coming (and they're better)
What Qwen3.7-Plus agent can do:
Demo: "Build a vocabulary learning app"
Qwen agent:
- SEES requirement (reads text prompt)
- PLANS approach (decides architecture, components)
- CODES solution (generates 10K+ lines of code)
- EXECUTES code (runs locally, tests functionality)
- ITERATES (sees errors, fixes them)
- DELIVERS (working app, not just description)
Timeline: 11 hours, 1,000 agent calls Result: Fully functional vocabulary learning app
Comparison to text-only agent:
- Text-only: "Here's the code (copy-paste it yourself)"
- Multimodal: "Here's the working app (already running)"
Why multimodal > text-only:
Text-only workflow:
- Customer: "I want to automate data entry"
- Agente: "Copy this formula into Excel" (description)
- Customer: "Where do I paste it?" (confusion)
- Customer: "It doesn't work" (error)
- Customer: "Never mind, do it manually" (gives up)
Multimodal workflow:
- Customer: "I want to automate data entry"
- Agente: SEES customer's Excel sheet
- Agente: CLICKS correct cells (automates input)
- Agente: EXECUTES formula (works immediately)
- Customer: "Done! Why is this so easy?" (satisfied)
Difference: Multimodal agente DOES the work, text-only only DESCRIBES
Multimodal is the market demand (not nice-to-have)
Customer expectation evolution:
2024: "Your agente supports chat? Nice!" ↓ 2025: "Your agente can help me with tasks? Good." ↓ 2026 (now): "Your agente can SEE my screen and AUTOMATE actions? Amazing!" ↓ 2026+: "Your agente can HANDLE COMPLETE WORKFLOWS (plan → execute → deliver)?"
Trend: From "chat" → "assistance" → "automation" → "autonomous execution"
Enterprise automation use cases (where multimodal matters):
-
Data entry automation
- Agente sees spreadsheet → clicks cells → enters data
- Text-only: "Here's the formula" (customer has to implement)
- Multimodal: "Done" (agente did it)
-
Document processing
- Agente sees PDF → extracts data → populates form
- Text-only: "Here's the data (copy-paste yourself)"
- Multimodal: "Form filled (ready to submit)"
-
System integration
- Agente sees ERP screen → navigates → updates records
- Text-only: "Here are the steps (follow them)"
- Multimodal: "Updated (no manual steps needed)"
-
Web automation
- Agente sees website → fills forms → submits
- Text-only: "Here's the URL (click yourself)"
- Multimodal: "Submitted (done automatically)"
-
Customer support escalation
- Agente sees customer screen (screenshot) → understands issue
- Text-only: "Customer is complaining (read text description)"
- Multimodal: "I see the error (visual context, faster resolution)"
Common thread: AUTOMATION (not just description) Benefit: Time savings (agent does work, not customer) Result: Higher ROI (business value delivered, not just information)
Your text-only agente is missing the action layer
Current state (your text-only agente):
Agent capabilities:
- Perception: Reads text (can't see images, screenshots, PDFs)
- Understanding: Processes text semantically
- Generation: Outputs text response
- Action: NONE (can't click, type, execute, change systems)
Result: Agent is PURE CHATBOT
- Answers questions (good)
- Suggests solutions (good)
- Actually implements (BAD - can't do it)
Value delivered: Information (agent tells you what to do) Value NOT delivered: Automation (agent doesn't DO anything)
Future state (multimodal agente like Qwen):
Agent capabilities:
- Perception: Sees images, screenshots, documents, code
- Understanding: Analyzes visual + text context
- Planning: Decides action sequence (steps to solve problem)
- Execution: Clicks buttons, types text, runs code, updates systems
- Iteration: Sees results, adjusts approach if needed
Result: Agent is AUTONOMOUS WORKER
- Understands problem (visual context)
- Plans solution (reasoning)
- Executes workflow (action)
- Handles errors (iteration)
- Delivers results (completion)
Value delivered: Automation (agent DOES the work) Value NOT delivered: Text description (no need for manual steps)
Gap widening:
In 6-12 months:
- Competitor introduces multimodal agente
- Customers see: "Competitor's agente automated my data entry in 2 hours"
- Customers see: "Your agente told me how to do it (still 8 hours of work)"
- Customers choose: Competitor (multimodal, time-saving)
- You lose: Deal, revenue, customer
The multimodal crisis (why this matters now)
Enterprise customers are asking: can your agente automate GUI tasks?
Enterprise procurement shift:
Old question (2024): "Can your agente answer questions?" New question (2026): "Can your agente automate our workflows (screens, clicks, actions)?"
Shift reason: Qwen demo proves multimodal is viable Result: Customers now expect automation, not just information
Decision driver change:
Before (2025): "Agente information quality matters most" Now (2026): "Agente automation capability matters most"
Example:
- Agente A: Excellent text responses, no automation
- Agente B: Good text responses, full GUI automation
Customer choice: Agente B ("Automates our work") Reason: Automation ROI > information quality
Qwen demo is proof (multimodal works at scale)
What the demo proved:
Scenario: "Build a vocabulary learning app"
Qwen agent results:
- 10,000+ lines of code (substantial, production-quality)
- 1,000 agent calls (iterative refinement, handling errors)
- 11 hours (reasonable timeline for app development)
- Autonomous (no human intervention needed)
- Multimodal: (visual understanding + code execution + system interaction)
Implication: If Qwen can autonomous-code 10K lines, it can DEFINITELY automate GUI tasks (much simpler)
Competitors will integrate multimodal (become default)
First-mover advantage in multimodal:
Competitor A (multimodal-enabled):
- Ships vision + GUI automation in Q3 2026
- Becomes known as "automation-first agente"
- Gains market reputation ("automates customer workflows")
- Locks in customers ("Already automated our processes")
Competitor B (you, text-only):
- Waiting to see if multimodal is necessary
- Ships multimodal in Q4 2026 (late)
- Perceived as "follower" ("Copy competitor A")
- Loses deals ("Competitor A automates, yours doesn't")
Result: First-mover advantage in multimodal = market leadership You (late mover) = perceived as inferior
Your roadmap (4 steps to multimodal agente)
Step 1: Understand multimodal requirements (what customers need)
Phase 1: Interview customers (Week 1)
Ask customers:
- "What % of your work is repetitive data entry/clicks?"
- "Would automation of those tasks save you time?"
- "How much time per week does repetitive work take?"
- "Would you pay more for agente that automates (vs. just tells you)?"
- "What systems do you want agente to automate? (Excel, CRM, website, etc.)"
Result: You quantify automation demand Expected: "70% of work is repetitive", "Automation would save 10+ hours/week"
Phase 2: Define automation scope (Week 1-2)
Automation capabilities (MVP scope):
-
Vision (can agente see?)
- Screenshot capture (see what customer sees)
- OCR (read text from screenshots)
- Element recognition (identify buttons, fields, links)
-
GUI interaction (can agente click/type?)
- Click buttons/links (interact with UI)
- Type text (fill forms, search fields)
- Scroll/navigate (move through screens)
- Submit forms (complete actions)
-
Code execution (can agente run code?)
- Execute Python/JS scripts (automate logic)
- Integrate with APIs (connect systems)
- Handle errors (retry, fallback)
-
Workflow automation (can agente chain actions?)
- See problem → Plan solution → Execute steps → Verify result
- Handle multi-step processes (login → navigate → fill → submit)
- Iterate on errors (detect failure, retry with adjustment)
MVP scope: Vision + GUI interaction (no code execution yet) Target: Data entry, form filling, system navigation
Step 2: Integrate vision capability (add image understanding)
Phase 1: Choose vision model (Week 2)
Vision options:
-
GPT-4o (OpenAI)
- Excellent image understanding
- Cost: $10/1M tokens (moderate)
- Integration: Via API (easy)
- Good for: Screenshot analysis, element recognition
-
Claude 3 Vision (Anthropic)
- Strong multimodal, context understanding
- Cost: $3/1M tokens (cheap)
- Integration: Via API (easy)
- Good for: Complex visual reasoning, workflow planning
-
Qwen3.7-Plus (Alibaba)
- Full multimodal (visual + GUI + code)
- Cost: Competitive (exact pricing TBD)
- Integration: Via API (needs Alibaba account)
- Good for: Full automation pipeline
-
LLaVA (Open-source)
- Free, open-source
- Cost: Zero (run locally)
- Integration: Self-hosted (complex)
- Good for: Privacy (data stays on-premise)
Recommendation: Start with Claude 3 Vision (cheap, easy, good) Timeline: 1 week to integrate
Phase 2: Implement screenshot capture (Week 2-3)
Implementation:
-
Capture mechanism:
- User shares screenshot (upload to agente)
- Or agente captures automatically (if access to customer's device)
- Or customer's system sends screenshots on-demand
-
Send to vision model:
- Convert screenshot to image format
- Send to Claude/GPT-4o API
- Request analysis ("What do you see?")
-
Parse vision response:
- Extract elements (buttons, fields, text)
- Understand layout (where is submit button?)
- Identify current state (form is empty, form is filled)
-
Plan next action:
- Based on vision analysis → decide next click/type
- Update customer ("I see the problem, here's the fix")
- Ready for GUI automation (next phase)
Example workflow:
- Customer: "My Excel is broken, can you fix it?"
- Customer: Uploads screenshot of Excel
- Agente: "I see column A has errors, I'll fix it"
- Agente: Ready to click/fill (next phase)
Step 3: Integrate GUI automation (add action capability)
Phase 1: Choose automation tool (Week 3)
GUI automation options:
-
Selenium (web automation)
- Automate website clicks, form filling
- Cost: Free, open-source
- Good for: Web-based apps
-
Playwright (modern web automation)
- Better than Selenium, faster
- Cost: Free, open-source
- Good for: Modern web apps, screenshot capture
-
PyAutoGUI (screen automation)
- Click anywhere on screen, type text
- Cost: Free, open-source
- Good for: Desktop apps, legacy systems
-
UiPath (enterprise automation)
- Professional RPA (Robotic Process Automation)
- Cost: Expensive ($5K+/year)
- Good for: Enterprise workflows, complex processes
Recommendation: Start with Playwright (web apps) + PyAutoGUI (desktop) Timeline: 2-3 weeks to integrate
Phase 2: Implement action execution (Week 3-4)
Implementation:
-
Action planning:
- Vision model analyzes screenshot
- LLM decides next action ("Click button X", "Type 'John'", "Submit")
- Translate to automation commands
-
Action execution:
- Playwright: Click web button, fill form field, submit form
- PyAutoGUI: Move mouse, click screen, type text
- Wait for result (screenshot changes)
-
Verification:
- Capture new screenshot (after action)
- Vision model analyzes: "Did action succeed?"
- If yes: Continue to next action
- If no: Retry or report error
-
Iteration:
- Keep looping until goal achieved
- Handle errors (element not found, timeout)
- Report progress to customer
Example workflow:
- Customer: "Fill my sales form with this data"
- Agente (vision): Sees empty form
- Agente (planning): Decides to fill Name field → Fill Email field → Submit
- Agente (action): Clicks Name field, types "John Doe"
- Agente (verification): Sees Name field is filled
- Agente (action): Clicks Email field, types "john@example.com"
- Agente (verification): Sees Email field is filled
- Agente (action): Clicks Submit button
- Agente (verification): Form submitted, success
- Agente (report): "Form filled and submitted"
Step 4: Market multimodal capability (competitive advantage)
Phase 1: Update messaging (Week 5)
Old messaging: "Agente IA para automação (via chat)" (Implies text-only, limited)
New messaging: "Agente IA que VÊAUTOMATE workflows (vision + GUI automation)" (Emphasizes multimodal, automation)
Or: "Agente IA que não só FALA, mas também CLICA (automação visual de processos)" (Emotional, highlights multimodal action)
Competitive positioning: "Unlike text-only competitors, our agente SEES your screen and AUTOMATES your tasks. You don't just get answers—you get work DONE."
Phase 2: Customer launch campaign (Week 5-6)
Launch to existing customers:
- Email: "Your agente can now automate clicks and form-filling."
- In-app notification: "New: Vision + GUI automation (early access)."
- Demo video: "Watch your agente automate data entry."
- Feature announcement: "Multimodal automation is here."
Result: Existing customers upgrade, see automation benefit Expected: 20-30% of customers enable automation features
Phase 3: Sales differentiation (Week 6+)
Sales pitch update:
Old: "Our agente supports automation" New: "Our agente VISUALLY understands your workflows and AUTOMATES them. No more manual clicking. Your agente does it."
Differentiators:
- Vision: "We see what you see (understand context)"
- Action: "We automate (don't just describe)"
- Iteration: "We handle errors (don't give up on first failure)"
- ROI: "You save 10+ hours/week (quantifiable value)"
Result: Multimodal becomes market differentiator Expected: New deals win on automation capability
Timeline (urgency)
Now (June 2026): Multimodal agents are proven viable
Current state:
- Qwen3.7-Plus released (proves multimodal works at scale)
- Market realizes multimodal is now possible (cost-effective)
- Competitors start planning multimodal integration
- Window for first-mover advantage opening
Q3 2026: Competitors integrate multimodal
Expected:
- Competitor A launches vision + GUI automation
- Becomes known as "automation-first agente"
- Market starts comparing on automation capability
- Customers demand multimodal ("Why don't you automate?")
Q4 2026: Multimodal becomes expected
Expected:
- Multiple competitors offer multimodal
- Text-only agentes perceived as inferior
- Multimodal is now table-stakes
- Late adopters catch up (but reputation damage done)
Conclusão: seu agente é text-only-incomplete (aja agora)
Qwen3.7-Plus proves that multimodal agents (visual + action) are NOW viable.
Message: Multimodal automation is no longer technically impossible—it's now a competitive requirement.
Seu agente (text-only, sem vision/action):
- Vision capability: None (can't see screenshots)
- GUI automation: None (can't click buttons)
- Code execution: None (can't run scripts)
- Automation delivery: None (customer has to do work manually)
- Customer expectation: "Why can't you just automate it?" (frustrated)
- Market positioning: Incomplete ("Text-only agente")
Your exposure:
- Competitors are planning multimodal integration (first-mover advantage)
- Qwen demo proves multimodal is viable (not theoretical)
- Customers demand automation (data entry, form-filling, system integration)
- In 6 months: Multimodal will be expected (not optional)
- Your text-only agente = disqualified ("Doesn't automate")
Your timeline:
This week: Accept that multimodal is now mandatory (not optional)
Next 1-2 weeks: Interview customers (validate automation demand)
Next 2-3 weeks: Integrate vision model (Claude 3 Vision for screenshots)
Next 2-3 weeks: Add GUI automation (Playwright for web, PyAutoGUI for desktop)
Next 1-2 weeks: Test MVP automation (data entry, form-filling workflows)
Next 1-2 weeks: Market multimodal capability (customer launch)
Result: Your agente is multimodal (vision + action, competitive advantage, customer expectation met).
Your alternative:
Ignore multimodal demand (keep text-only agente).
Wait for competitors to add multimodal (they will).
Wait for customers to prefer competitors ("They automate, yours doesn't").
Wait for market to shift ("Multimodal is expected").
You lose deals.
Your agente becomes obsolete.
At OpenClaw, ajudamos SaaS agentes implementar multimodal:
- DESIGN automation workflows (vision → planning → execution)
- INTEGRATE vision models (Claude, GPT-4o, screenshot analysis)
- ADD GUI automation (Playwright, PyAutoGUI, element interaction)
- IMPLEMENT error handling (iteration, retry, fallback)
- LAUNCH multimodal capability (market as automation differentiator)
- OPTIMIZE for ROI (measure time-savings, customer value)
Result: Seu agente é multimodal (customers don't just get answers—they get work DONE, competitors play catch-up, market leadership).
Qwen prova que multimodal é viável?
Seu agente é text-only (sem vision, sem automation)?
Clientes demandam automação (não só chat)?
Você quer agente que customers preferem (que AUTOMATIZA, não só descreve)?
Se não sabe por onde começar:
Publicado em 6 de junho de 2026