News

AWS Adds Multimodal Evaluators to Strands Evals

Sangmin WooMay 21, 2026 · about 2 months ago

AWS has announced four multimodal evaluators for Strands Evals that use large language models as judges to assess image-to-text task outputs. The evaluators, Overall Quality, Correctness, Faithfulness, and Instruction Following, score model responses against source images directly, addressing a gap where text-only evaluation cannot detect visual hallucinations or factual errors grounded in images. This addresses a growing need as Gartner predicts 80% of enterprise software will be multimodal by 2030, up from under 10% today.

TL;DR

AWS released four MLLM-as-a-Judge evaluators for image-to-text tasks in Strands Evals SDK, scoring outputs on Overall Quality, Correctness, Faithfulness, and Instruction Following
Evaluators send images directly to multimodal judge models alongside queries and responses, returning scores with reasoning for debugging and CI integration
Framework supports both reference-based and reference-free evaluation modes, with custom rubric support for domain-specific criteria
Judge model selection on Amazon Bedrock allows operators to balance accuracy, cost, and latency for their use case

Why It Matters

Text-only evaluators cannot verify whether model outputs are grounded in visual content, creating a critical gap for document understanding, visual search, and chart analysis tasks. Automated multimodal evaluation closes the gap between expensive human review and unreliable text-only proxies, enabling teams to catch hallucinations and factual errors at scale. As enterprise software rapidly shifts toward multimodal capabilities, reliable evaluation infrastructure becomes essential for production deployment.

Business Impact

For operators building visual understanding systems, automated multimodal evaluation reduces the cost and latency of quality assurance while improving confidence in model outputs. Teams can integrate these evaluators directly into CI pipelines to catch regressions early, and choose judge models that fit their accuracy and cost constraints. This is particularly valuable for document processing, invoice extraction, and UI understanding applications where hallucinations carry direct business risk.

Key Implications

Multimodal evaluation is becoming table stakes for image-to-text applications, shifting from manual review to automated assessment integrated into development workflows
Judge model selection on Bedrock introduces a new optimization dimension for teams, requiring tradeoff analysis between model capability, inference cost, and latency
Custom rubric support enables domain-specific evaluation criteria, allowing teams to encode business logic and compliance requirements into automated assessment

What to Watch

Monitor adoption patterns to see which judge models teams select and whether cost or accuracy dominates decision-making. Watch for community-contributed rubrics and domain-specific evaluators that extend the framework beyond the four baseline evaluators. Track whether multimodal evaluation becomes a standard requirement in model development workflows and CI pipelines across AWS customers.

Multimodal AI for Business AWS Coding / Dev Tools

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Google Brings Personalized Image Generation to Free Gemini Users

Google is making personalized AI image generation available to eligible free Gemini users in the U.S. The feature allows the chatbot to create images based on user interests and data from connected Google apps. This expands access to a capability previously limited to paid subscribers.

by Lauren Forristal5 days ago· TechCrunch AI

MultimodalNews

Multimodal AI turns aerial imagery into searchable data

AWS and Vexcel, an aerial imagery provider operating across 45+ countries, developed a multimodal AI system that converts billions of aerial images into natural-language-searchable data without requiring per-feature model training. The system uses embedding models, LLM captioning, and vector search to index imagery once and query it with plain English. Amazon Nova Multimodal Embeddings delivered the highest F1 scores in their evaluation, and the work evolved into Vexcel Intelligence, a commercial searchable imagery product.

by Gilbert V Lepadatu13 days ago· AWS Machine Learning Blog

MultimodalNews

Google DeepMind's Gemma 4 Now Available on AWS Bedrock

Google DeepMind's Gemma 4 model family is now available on Amazon Bedrock, offering three instruction-tuned variants ranging from 2.3B to 30.7B parameters. The models support reasoning, function calling, and multimodal input while running on AWS infrastructure with data protection guarantees. Organizations can access open-weight models through a managed service without hosting infrastructure themselves.

by Aris Tsakpinis19 days ago· AWS Machine Learning Blog

MultimodalNews

PixelRAG bypasses text parsing, cuts RAG costs 10x

Researchers from UC Berkeley, Princeton, EPFL, and Databricks introduced PixelRAG, a retrieval system that bypasses traditional text parsing by rendering web pages as screenshots and indexing them directly for vision-language models. Tested on 30 million Wikipedia screenshot tiles, PixelRAG improved accuracy by up to 18.1% over text-based RAG systems and reduced token costs by 10x. The approach addresses fundamental information loss in conventional HTML-to-text conversion pipelines.

22 days ago· VentureBeat AI

TL;DR

Why It Matters

Business Impact

Key Implications

What to Watch

Subscribe to the newsletter

Related stories

Google Brings Personalized Image Generation to Free Gemini Users

Multimodal AI turns aerial imagery into searchable data

Google DeepMind's Gemma 4 Now Available on AWS Bedrock

PixelRAG bypasses text parsing, cuts RAG costs 10x