VFF - The signal in the noise
News

Multimodal AI turns aerial imagery into searchable data

Read original
Share
Multimodal AI turns aerial imagery into searchable data

AWS and Vexcel, an aerial imagery provider operating across 45+ countries, developed a multimodal AI system that converts billions of aerial images into natural-language-searchable data without requiring per-feature model training. The system uses embedding models, LLM captioning, and vector search to index imagery once and query it with plain English. Amazon Nova Multimodal Embeddings delivered the highest F1 scores in their evaluation, and the work evolved into Vexcel Intelligence, a commercial searchable imagery product.

  • Vexcel and AWS built a semantic search system for aerial imagery using multimodal embeddings and vector search on Amazon Bedrock and OpenSearch Serverless
  • The system eliminates per-feature model training, allowing natural-language queries across millions of images without manual tile-by-tile inspection
  • Amazon Nova Multimodal Embeddings achieved the highest F1 scores in their evaluation across benchmark queries
  • The architecture handles multi-view oblique imagery from Vexcel's fleet operating across 45+ countries and territories

Geospatial data is critical for insurance, real estate, government, infrastructure, and agriculture, but converting billions of pixels into actionable intelligence has required either manual inspection or training custom models for each new query. This work demonstrates a scalable alternative using commodity multimodal AI and vector databases, reducing the time from question to answer from weeks to seconds.

Organizations relying on aerial imagery can now answer ad-hoc geospatial questions without engineering overhead or labeled training data. The approach reduces operational friction for use cases like locating swimming pools, identifying road networks, counting solar panels, or detecting specific building features, making geospatial intelligence accessible to non-technical stakeholders.

  • Multimodal embeddings plus vector search can replace bespoke computer vision pipelines for geospatial queries, lowering barriers to entry for imagery-based analysis
  • LLM captioning of aerial imagery may add cost without proportional search quality gains, requiring careful evaluation of fusion strategies
  • The architecture is generalizable across industries and geographies, as demonstrated by Vexcel's 45+ country coverage and the emergence of Vexcel Intelligence as a commercial product
  • Index-once, query-many approaches reduce time-to-insight for exploratory geospatial analysis and enable rapid iteration on new use cases

Monitor adoption of similar multimodal embedding approaches in other geospatial and imagery-heavy domains. Track whether vector search becomes the default for large-scale imagery retrieval and whether LLM captioning proves cost-effective as embedding model quality improves. Watch for competitive offerings and whether this pattern extends to other sensor modalities beyond aerial imagery.

Share

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Related stories

Google DeepMind's Gemma 4 Now Available on AWS Bedrock

Google DeepMind's Gemma 4 Now Available on AWS Bedrock

Google DeepMind's Gemma 4 model family is now available on Amazon Bedrock, offering three instruction-tuned variants ranging from 2.3B to 30.7B parameters. The models support reasoning, function calling, and multimodal input while running on AWS infrastructure with data protection guarantees. Organizations can access open-weight models through a managed service without hosting infrastructure themselves.

by Aris Tsakpinis· AWS Machine Learning Blog
PixelRAG bypasses text parsing, cuts RAG costs 10x

PixelRAG bypasses text parsing, cuts RAG costs 10x

Researchers from UC Berkeley, Princeton, EPFL, and Databricks introduced PixelRAG, a retrieval system that bypasses traditional text parsing by rendering web pages as screenshots and indexing them directly for vision-language models. Tested on 30 million Wikipedia screenshot tiles, PixelRAG improved accuracy by up to 18.1% over text-based RAG systems and reduced token costs by 10x. The approach addresses fundamental information loss in conventional HTML-to-text conversion pipelines.

· VentureBeat AI
Google DeepMind Releases Gemma 4 12B for Laptop-Based AI
TrendingNews

Google DeepMind Releases Gemma 4 12B for Laptop-Based AI

Google DeepMind introduced Gemma 4 12B, a multimodal AI model designed to run on consumer laptops with 16GB of RAM. The model uses an encoder-free architecture that processes vision and audio inputs directly into the language model backbone, reducing latency and memory overhead. Performance approaches the larger 26B model while maintaining a smaller footprint, and it is released under an Apache 2.0 license.

· Google Deepmind
Google Launches Near Real-Time Voice Translation in Gemini 3.5
TrendingNews

Google Launches Near Real-Time Voice Translation in Gemini 3.5

Google has launched Gemini 3.5 Live Translate, a near real-time speech translation feature now available in Google AI Studio, Google Translate, and Google Meet. The system delivers natural-sounding voice translation with minimal latency. The rollout represents a significant step toward breaking down language barriers in professional and consumer communication.

· Google Deepmind