
Q1
Financials
US commercial continues to accelerate in Q1 2025 alongside AIP revolution
+71% Y/Y
+65% Y/Y
US Commercial Revenue
US Commercial Customer Count
+19% Q/Q
+13% Q/Q
US Commercial Revenue
US Commercial Customer Count
+127% Y/Y
$810M
US Commercial Remaining Deal Value
2x Y/Y
US Commercial Total Contract Value
+30% Q/Q
US Commercial Deals Closed of $1M or Greater
+183% Y/Y
US Commercial Remaining Deal Value
US Commercial Total Contract Value
2025 Palantir Technologies Inc.
We dene a customer as an organization from which we have recognized revenue during the trailing twelve months period.
This is possibly one of the simplest documents, we haven’t even come to complex or technical documents.
Note/ Aside: You might still need to convert a document to text or a structured format, that’s essential for syncing information into structured databases or data lakes. In those cases, OCR works (with its quirks), but in my experience passing the original document to an LLM is better (we do this with Morphik workflows, learn more here).
The Traditional Approach: A House of Cards
When we started building RAG at Morphik, we did what everyone does: assembled the standard document processing pipeline. You know the one, it starts with OCR and ends with tears. Here’s what that “state-of-the-art” pipeline actually looks like:- Run OCR on your PDFs (hope it reads the numbers correctly)
- Deploy layout detection models (hope they identify table boundaries)
- Reconstruct reading order (hope it follows the visual flow)
- Caption figures with specialized models (hope they capture the nuance)
- Chunk text “intelligently” (hope you don’t split related info)
- Generate embeddings with BGE-M3 or similar (hope meaning survives)
- Store in your vector database (hope you can find it later)
The Lightbulb Moment: What If We Just… Looked at the Page?

Under the Hood: How Visual Document Retrieval Actually Works
Here’s where it gets technically interesting. The ColPali model doesn’t just “look” at documents. It understands them in a fundamentally different way than traditional approaches. The process is beautifully straightforward: First, we treat each document page as an image, essentially taking a high-resolution screenshot. This image gets divided into patches, like laying a grid over the page. Each patch might contain a few words, part of a chart, or a table cell. A Vision Transformer (specifically SigLIP-So400m) processes these patches, but here’s the clever part: instead of trying to extract text, it creates rich embeddings that understand both textual and visual elements in context. These embeddings are then refined by a language model (PaliGemma-3B) that’s been trained to understand document structure. When you search for “Q3 revenue trends,” the magic happens through what’s called “late interaction.” The model doesn’t just look for those exact words, it finds patches containing the text “Q3,” the word “revenue,” but also the relevant parts of charts showing upward trends, table cells with quarterly figures, and even color-coded elements that indicate performance. It’s like having a human expert who can instantly scan every page and understand not just what’s written, but what’s shown. An interesting overview of ColPali by Arnav here. (more to scratch the technical itch)
Real-World Showdown: Morphik vs. The Alternatives
When we built Morphik, we implemented ColPali and quickly discovered that productizing it was far more complex than the research suggested. No vector database directly supported our desired similarity function at that time, and every provider offered optimizations for single vectors instead of multi-vectors, which was painfully slow. We added optimizations like binary quantization and then using Hamming distance for calculation instead of dot product, among other performance improvements… each piece had hidden complexity. But the real test came when we compared our approach to existing solutions. Systematic Evaluation: The Numbers Don’t Lie To validate these observations beyond anecdotal evidence, we worked with TLDC (The LLM Data Company) to build an open-source financial document benchmark with 45 challenging questions across NVIDIA 10-Qs, Palantir investor presentations, and JPMorgan reports. TLDC has a complex multistep process for generating these evaluations, and their objective was frankly to humiliate us. They wanted questions that would expose the limitations of any document retrieval system. The evaluation harness is designed to be easy to use, and anyone can test their RAG system against these same questions and see how they stack up. The results were striking: while other end-to-end providers peaked at around 67% accuracy, and even a carefully optimized custom LangChain pipeline with semantic chunking and OpenAI’s text-embedding-large model achieved 72%, Morphik delivered 95.56% accuracy on the same evaluation set. For comparison, OpenAI’s file search tools, which represents a solid baseline, managed only 13.33% accuracy on these challenging financial document questions. On the ViDoRe benchmark, the first evaluation specifically designed for visual document retrieval, our approach achieves 81.3% nDCG@5 compared to 67.0% for traditional parsing methods. But as our own evaluation shows, benchmarks only tell part of the story. The real difference is in understanding documents the way they were meant to be understood. Want to test your own system? The entire evaluation framework is ready to use. Just clone the repo, add your RAG implementation, and see how it performs on the same challenging financial document questions.The Speed Problem (And How We Solved It)
Ok so we got to really good accuracy, but ‘ll be honest: our first implementation was slow. Visual understanding is computationally intensive, and ColPali’s original multi-vector approach meant searching through millions of patch embeddings at scale. At 3-4 seconds per query for retrieval, it was impressive but not production-ready for our customers querying thousands of documents and demanding something fast. The breakthrough came from the MUVERA paper. Instead of searching through all patch embeddings independently, MUVERA reduces multi-vector search to single-vector similarity using fixed-dimensional encodings. Think of it as creating a “summary fingerprint” that preserves the rich patch interactions while being orders of magnitude faster to search. We paired this with Turbopuffer, a vector database built for exactly this use case. The results transformed our system, and our query latency went from 3-4s to 30ms. The math works out beautifully: we can now search millions of documents faster than traditional systems can parse a single PDF.What This Means for You
Forget wrestling with document parsing libraries. Forget maintaining separate pipelines for text extraction, table detection, and OCR. Forget losing critical information every time a document doesn’t fit your parsing assumptions. With visual document retrieval, you just send us your documents—PDFs, images, even photos of whiteboards—and search with natural language. It works particularly well for:- Financial documents: Where charts and tables tell the real story
- Technical manuals: Where diagrams are worth thousands of words
- Invoices and receipts: Where layout and structure carry meaning
- Research papers: Where figures contain the actual findings
- Medical records: Where visual layouts indicate relationships