Multimodal AI at Work: Document Processing Has Scaled, Video and Vision Still Piloting

6 min read

Key Takeaways

Document AI is in production: GPT-4o with OCR pre-processing reaches 98% field-level accuracy on invoices; open-source Qwen2.5 VL 72B leads DocVQA at 96.4%, above all proprietary models.
Audio transcription is at production scale: Azure’s gpt-4o-transcribe-diarize processes 10 minutes of audio in ~15 seconds across 100+ languages, with Klarna reporting 40% cost-per-transaction reduction.
Visual reasoning for high-stakes workflows is not ready: GPT-4V scored 61% on a radiology challenge but 8.3% on a controlled clinical evaluation — the difference reflects benchmark design, not real-world capability.
On MMMU-Pro — which removes text-only-answerable questions — model scores drop to 16.8%–26.9%, exposing fragile visual reasoning that standard leaderboard performance masks.

Key Claim: The modality-by-modality pattern is clear: document AI and audio transcription have crossed the production threshold; video understanding and open-ended visual reasoning have not — and the gap between radiology benchmark scores and controlled clinical accuracy illustrates why.

When Morgan Stanley deployed a GPT-4–based assistant to its wealth management division, the share of relevant research content that advisors could find in a query jumped from 20% to 80%. Ninety-eight percent of advisor teams now use the tool actively. The system, however, works primarily on text — summarising reports, answering questions about research documents — not on images, charts, or audio recordings of client calls. That distinction is not a footnote. It is the central fact of enterprise multimodal AI in 2026: the text-adjacent applications scaled first; the genuinely cross-modal ones are still catching up.

The frontier models — GPT-4o, Gemini 2.0/2.5, Claude 3.7 Sonnet — all process image, audio, and video inputs natively. Vendor benchmark sheets show impressive numbers. What a closer read of the data reveals is that performance varies sharply by modality, by document type, and by the distance between the test set and production conditions. Enterprises trying to operationalise multimodal capabilities are not being blocked by a lack of capability; they are being blocked by reliability floors that remain too low for high-stakes workflows where human review cannot be removed.

Table of Contents

Document Understanding: The Clearest Production Win

Of all multimodal applications, document understanding has made the clearest transition from demo to deployment. The core use case — replacing rule-based OCR pipelines that require template engineering for each document layout — is genuinely in production at hundreds of companies processing invoices, contracts, and financial statements.

Benchmark data from a 2025 enterprise evaluation of five document processing services shows GPT-4o paired with a third-party OCR pre-processing layer achieving 98.0% field-level accuracy on invoices, compared to 90.5% for GPT-4o on direct image input, 93.0% for Azure Document Intelligence, and 78.0% for AWS Textract. (BusinessWareTech, 2025) On the DocVQA benchmark, which tests visual question answering over document images, Claude 3.5 Sonnet scores 95.2%, GPT-4o scores 92.8%, and Alibaba’s open-source Qwen2.5 VL 72B leads at 96.4%. (llm-stats.com DocVQA leaderboard)

The accuracy story has a latency asterisk. GPT-4o processes one invoice image in 16.9 seconds; AWS Textract takes 2.9 seconds. (BusinessWareTech, 2025) For batch-processing workflows that run overnight, this is irrelevant. For real-time AP automation where a supplier is waiting for payment confirmation, it is a hard constraint. The same evaluation found line-item extraction accuracy falls sharply for all LLM-based approaches — GPT-4o drops to 63% on direct image input for line items — while specialised document AI services retain 82–87%. The headline field-extraction number and the line-item number are measuring different problems.

A separate arXiv study comparing multimodal vision vs text-conversion approaches confirmed that native image processing consistently outperforms pipeline-based approaches: Gemini 2.5 Pro scored 87.46% on scanned receipts natively vs 47.00% via Docling text conversion, and 96.50% vs 85.14% on clean digital invoices. The implication for deployment decisions: pre-processing a document through a text-extraction layer before sending it to a language model is not a neutral choice — it can halve effective accuracy on degraded inputs.

Audio and Transcription: Volume Has Arrived

The audio modality tells a cleaner production story than vision. Speech-to-text with speaker diarisation has crossed the threshold where contact centres, compliance teams, and clinical documentation workflows are running it at production volumes with limited human review.

Azure’s gpt-4o-transcribe-diarize converts 10 minutes of audio in approximately 15 seconds across 100+ languages and dialects, positioning it directly for real-time compliance monitoring and post-call analytics. (Microsoft Azure AI Foundry Blog) Klarna’s deployment — which used OpenAI models to handle the equivalent of 700 full-time agents’ customer service workload — demonstrated what volume-scale voice AI looks like in practice: 96% of Klarna employees now use AI daily, and customer service cost per transaction dropped 40% from Q1 2023 to the point the company reported results. (OpenAI: Klarna case study)

The caveat is that call centre deployments have largely kept humans in the loop for resolution, using AI for transcription, sentiment tagging, and draft response suggestions rather than fully autonomous dialogue. The claim of 48% efficiency gains cited by some vendors lacks independent audit trails. Enterprises should treat vendor-originated efficiency figures as directional, not contractual.

Visual Reasoning: Where the Gap Is Most Visible

The most instructive case for understanding the demo-to-production problem is radiology. In a separate diagnostic challenge, GPT-4V achieved 61% accuracy on a broad 936-case set, outscoring a physician respondent pool at 49%. (PMC: Advancing Radiology with GPT-4) That result generated substantial press coverage. But in a controlled clinical study of 206 imaging cases — a more rigorous evaluation — the same model achieved 8.3% diagnostic accuracy without clinical context, rising to 29.1% when clinical notes were provided. (PMC: GPT-4V in Radiologic Image Interpretation, 2025) In that study, the model fabricated imaging findings in 258 cases and only 39% of its described findings were visible in the provided images.

The discrepancy is not a contradiction — it reflects how benchmark design and test-set composition shape reported numbers. The 61% figure came from a challenge where question framing narrowed the task; the 8.3% figure came from open-ended image interpretation without scaffolding. No LLM is currently FDA-approved for clinical imaging. Radiology AI in production uses specialised, trained models — not general-purpose multimodal LLMs.

This is the version of the reliability gap that Scale AI’s M-HalDetect research — the first comprehensive multimodal hallucination detection dataset — exposed: a model can appear to work while barely using the visual input at all, with answers remaining similar even when different images are supplied. Galileo’s evaluation research catalogued four distinct failure modes: object hallucination (inventing non-existent elements), attribute hallucination (correct object, wrong properties), relational hallucination (spatial and logical errors), and fabricated descriptions — noting these are especially dangerous because “fluent output often gets treated as trustworthy output.” The problem is compounded because fluent output reads as trustworthy output — cascading errors through downstream workflow steps before anyone notices.

On the MMMU benchmark — 11,500 multimodal questions across 183 academic subfields — Claude 3.7 Sonnet (Thinking) reaches 76.4%, narrowly above the lower bound of human expert performance at 76.2%. (vals.ai MMMU benchmark) On MMMU-Pro, a harder variant that removes text-only-answerable questions and embeds questions inside images, model scores drop to a range of 16.8%–26.9% across leading models — exposing that high standard-benchmark scores can mask fragile visual reasoning. (Artificial Analysis: MMMU-Pro leaderboard)

Video Understanding: Capable but Cost-Constrained

Video understanding is the modality with the widest gap between capability demos and production deployments. Gemini 2.0 Flash processes video at 258 tokens per second at one frame per second. At list API pricing, one hour of video at that frame rate consumes approximately 928,800 input tokens — under $0.15 at current Gemini 2.0 Flash pricing — making the per-minute cost more tractable than it was 18 months ago. (Google AI for Developers: Gemini API pricing) But video understanding in production involves more than API cost: it involves frame selection, context window management for long-form content, and validation pipelines that add latency and operational overhead.

The latency bottleneck research from Apple’s FastVLM work (CVPR 2025) identifies vision encoder processing, not LLM decoding, as the primary constraint for high-resolution multimodal inference. At high resolution, the encoder creates more visual tokens, increasing pre-fill time before the model generates a single output token. Token compression — discarding redundant visual tokens before LLM input — is emerging as the primary technique for managing this bottleneck.

Production video understanding deployments in 2025–2026 are largely in media and content moderation (automated tagging, safety filtering) rather than high-stakes analytical workflows. The broader pilot-to-production pattern holds here: only 23% of organisations report scaling AI agents in production despite 62% experimenting with them. (MIT: State of AI in Business 2025)

What to Watch

The MMMU-Pro gap is the leading indicator for production readiness. When a model scores well on MMMU but falls 40–50 points on MMMU-Pro, that spread reveals how much performance is driven by text-only answerable questions. Enterprises evaluating multimodal models should run their own domain-specific evals rather than relying on leaderboard numbers from standard benchmarks.

MIT’s finding that 95% of enterprise generative AI pilots broadly fail to deliver measurable P&L impact (Fortune / MIT Report, 2025) applies acutely to multimodal pilots, where the failure mode is rarely model capability and more often the mismatch between evaluation conditions and production document variance, lighting conditions, scan quality, and accented speech.

Document-native architectures are outperforming pipeline workarounds. The consistent finding across invoice benchmarks is that sending images directly to a multimodal model outperforms converting them to text first, except on line-item extraction where specialised document AI still leads. Architectural decisions made now about OCR pipeline design will affect accuracy floors for years.

Open-source models are competitive on document tasks. Qwen2.5 VL 72B’s 96.4% DocVQA score — higher than any proprietary model on the benchmark — signals that the cost-performance trade-off for document AI has materially shifted. For enterprises with on-premise data requirements or cost sensitivity at volume, open-source multimodal deployment is no longer a compromise.

Multimodal AI at Work: Document Processing Has Scaled, Video and Vision Still Piloting

Document Understanding: The Clearest Production Win

Audio and Transcription: Volume Has Arrived

Visual Reasoning: Where the Gap Is Most Visible

Video Understanding: Capable but Cost-Constrained

What to Watch

Further Reading

Enjoyed this analysis?

Leave a Comment Cancel reply

Document Understanding: The Clearest Production Win

Audio and Transcription: Volume Has Arrived

Visual Reasoning: Where the Gap Is Most Visible

Video Understanding: Capable but Cost-Constrained

What to Watch

Further Reading

Related posts:

Enjoyed this analysis?

Leave a Comment Cancel reply