VFF - The signal in the noise
News

AWS Adds Multimodal Evaluators to Strands Evals

Read original
Share
AWS Adds Multimodal Evaluators to Strands Evals

AWS has announced four multimodal evaluators for Strands Evals that use large language models as judges to assess image-to-text task outputs. The evaluators, Overall Quality, Correctness, Faithfulness, and Instruction Following, score model responses against source images directly, addressing a gap where text-only evaluation cannot detect visual hallucinations or factual errors grounded in images. This addresses a growing need as Gartner predicts 80% of enterprise software will be multimodal by 2030, up from under 10% today.

  • AWS released four MLLM-as-a-Judge evaluators for image-to-text tasks in Strands Evals SDK, scoring outputs on Overall Quality, Correctness, Faithfulness, and Instruction Following
  • Evaluators send images directly to multimodal judge models alongside queries and responses, returning scores with reasoning for debugging and CI integration
  • Framework supports both reference-based and reference-free evaluation modes, with custom rubric support for domain-specific criteria
  • Judge model selection on Amazon Bedrock allows operators to balance accuracy, cost, and latency for their use case

Text-only evaluators cannot verify whether model outputs are grounded in visual content, creating a critical gap for document understanding, visual search, and chart analysis tasks. Automated multimodal evaluation closes the gap between expensive human review and unreliable text-only proxies, enabling teams to catch hallucinations and factual errors at scale. As enterprise software rapidly shifts toward multimodal capabilities, reliable evaluation infrastructure becomes essential for production deployment.

For operators building visual understanding systems, automated multimodal evaluation reduces the cost and latency of quality assurance while improving confidence in model outputs. Teams can integrate these evaluators directly into CI pipelines to catch regressions early, and choose judge models that fit their accuracy and cost constraints. This is particularly valuable for document processing, invoice extraction, and UI understanding applications where hallucinations carry direct business risk.

  • Multimodal evaluation is becoming table stakes for image-to-text applications, shifting from manual review to automated assessment integrated into development workflows
  • Judge model selection on Bedrock introduces a new optimization dimension for teams, requiring tradeoff analysis between model capability, inference cost, and latency
  • Custom rubric support enables domain-specific evaluation criteria, allowing teams to encode business logic and compliance requirements into automated assessment

Monitor adoption patterns to see which judge models teams select and whether cost or accuracy dominates decision-making. Watch for community-contributed rubrics and domain-specific evaluators that extend the framework beyond the four baseline evaluators. Track whether multimodal evaluation becomes a standard requirement in model development workflows and CI pipelines across AWS customers.

Share

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Related stories

Google Brings Personalized Image Generation to Free Gemini Users
TrendingNews

Google Brings Personalized Image Generation to Free Gemini Users

Google is making personalized AI image generation available to eligible free Gemini users in the U.S. The feature allows the chatbot to create images based on user interests and data from connected Google apps. This expands access to a capability previously limited to paid subscribers.

by Lauren Forristal· TechCrunch AI
Multimodal AI turns aerial imagery into searchable data

Multimodal AI turns aerial imagery into searchable data

AWS and Vexcel, an aerial imagery provider operating across 45+ countries, developed a multimodal AI system that converts billions of aerial images into natural-language-searchable data without requiring per-feature model training. The system uses embedding models, LLM captioning, and vector search to index imagery once and query it with plain English. Amazon Nova Multimodal Embeddings delivered the highest F1 scores in their evaluation, and the work evolved into Vexcel Intelligence, a commercial searchable imagery product.

by Gilbert V Lepadatu· AWS Machine Learning Blog
Google DeepMind's Gemma 4 Now Available on AWS Bedrock

Google DeepMind's Gemma 4 Now Available on AWS Bedrock

Google DeepMind's Gemma 4 model family is now available on Amazon Bedrock, offering three instruction-tuned variants ranging from 2.3B to 30.7B parameters. The models support reasoning, function calling, and multimodal input while running on AWS infrastructure with data protection guarantees. Organizations can access open-weight models through a managed service without hosting infrastructure themselves.

by Aris Tsakpinis· AWS Machine Learning Blog
PixelRAG bypasses text parsing, cuts RAG costs 10x

PixelRAG bypasses text parsing, cuts RAG costs 10x

Researchers from UC Berkeley, Princeton, EPFL, and Databricks introduced PixelRAG, a retrieval system that bypasses traditional text parsing by rendering web pages as screenshots and indexing them directly for vision-language models. Tested on 30 million Wikipedia screenshot tiles, PixelRAG improved accuracy by up to 18.1% over text-based RAG systems and reduced token costs by 10x. The approach addresses fundamental information loss in conventional HTML-to-text conversion pipelines.

· VentureBeat AI