SOTA results on Databricks OfficeQA
Our data engine outperforms frontier models in grounded reasoning.

Nik Bozhinov
30 June 2026

We are excited to share that we achieved state-of-the-art performance on Databricks’ OfficeQA PRO Full Corpus benchmark.

On the officially published answers, we have surpassed Fable 5’s 57.90% to get 58.65% correctness. Furthermore, our engine has identified better answers for 15 out of 133 questions, which brings the revised score to 69.92% correctness. We've shared these findings with the Databricks research team to continue improving the benchmark.


We combined Parsewise’s proprietary extraction technology with Gemini 3.0 Flash for parsing and Gemini 3.5 Flash for reasoning. The results show that with a purpose-built architecture, smaller models can outperform frontier models.

The challenge for enterprise use cases

At Parsewise, we solved many complex enterprise use cases. Customers' operational data rarely has a single source of truth. It consists of heterogeneous sources in which individual data points not only appear across multiple documents with different values, but often undergo revisions over time.

Think of a company uploading years of financial tables across Excel workbooks and PDF quarterly reports. Answering questions on that data is no longer a matter of finding a single number. The system needs to understand where it came from, whether it was revised, and which sources to trust. This is what we set out to provide since day one: answers our customers can not only trust, but verify for themselves.

This is why, when OfficeQA was originally announced, we were very excited that a benchmark had managed to capture the real complexity of enterprise workflows.

Sample US Treasury page

OfficeQA comprises close to 89,000 pages split across US Treasury Bulletins spanning the past 90 years. Many of those documents also contain scanned pages with often illegible text in dense financial tables, which makes processing these documents even more difficult. The benchmark questions themselves require deep financial reasoning over information often spread across multiple bulletins, and using revised figures is actively encouraged unless the question specifies otherwise.


If you would like to see an example of a question similar to the ones from the benchmark, you can find our live demo here.

Results
We surpassed the current SOTA GPT-5.5 (52.63%) and Fable 5 (57.90%) on the hardest variant of the benchmark (full corpus). Against the officially published answers, Parsewise got 78/133 questions right, reaching 58.65% correctness.

Beyond that, our models identified 15 questions where revised bulletins provide updated data points, leading to better, more up-to-date answers. This brings our accuracy to 69.92%, reducing the residual error rate by nearly 30%.


Parsewise’s built-in traceability, combined with our Navi deep research agent, makes reviewing these differences much easier for human analysts (see example). We are working with the Databricks team to verify the revisions and will update this post once that happens. We are looking forward to collaborating on future benchmarks.


What made this possible
Below is a simple overview of the Parsewise data engine that enabled these results. You can find our deep dive on the topic here.

First, similarly to Databricks, preprocessing documents and extracting their textual contents is essential. When a user uploads a document to Parsewise, its content is extracted once and made available to agents. This gives us a consistent reasoning foundation across the full corpus.


Sample US Treasury page with parsed data

Second, each agent on Parsewise will perform deep search over all data. This is different from agentic search, where all data is provided, and the agent tries to find textual terms within it. It is also different from RAG, which relies on retrieving the top-most similar items from the corpus. We can guarantee that every agent has looked efficiently through every page, ensuring that revisions and inconsistencies aren’t missed.

One of three sources pages for Federal Budget Net Receipts in July 1953.

On top of all agent results sits Navi, our chat assistant for deep research. It has been superpowered to understand the context of the project, read extracted document text, create agents to answer specific results, process agent results, and navigate users through complexity.

Navi response to the sample question.

Finally, all of this comes together to let analysts find even the hardest-to-spot data revisions. In OfficeQA, we managed to spot changes published more than 10 years later! Navi then uses all of these value extractions, together with their context, to reason which are relevant to the individual question before proceeding to the final calculations.


Conclusion

We are excited to keep pushing the frontier of quality and efficiency for real-world use cases and to enable the best possible interactions between human experts and AI models.

Thank you to Databricks for curating the benchmark, SuperAnnotate for doing the hard work of human verification, and USAFacts for building the data trove.



Appendix



The actual benchmark is split into two sets. The PRO set contains the 133 hardest questions, each of which at least one state-of-the-art model got wrong at the time of benchmark publication. The extended FULL set adds 113 easier questions.

Additionally, the benchmark can be run in one of two settings:

  • Oracle pages: the model is told which pages are relevant for the current question. Usually, this will be between one and three pages spread across a few bulletins, though some questions may require more.

  • Full corpus: the model is not told which pages are relevant and has to identify them itself across the full set of 89,000 pages.


For the purposes of our evaluation, we focused on the PRO set in the full corpus context. In our experience, this variation of the benchmark better reflects the messy reality of real-world use cases where users often do not know which documents have the correct answer.

In terms of our results, the original paper published a supplementary metric on how well agents performed when they were allowed a small relative error: 0.1%, 1%, and 5%. These results are reported against both the raw PDFs and the PDFs processed by Databricks Parse to extract the text.

We could not find comparable results for GPT 5.5 and Fable 5. However, we are happy to report that we are outperforming GPT 5.4, Opus 4.6, and Gemini 3.1 PRO across all of the reported error thresholds.

Performance of our engine at different error thresholds.

Where we can go next

While we are excited about the results, there is room for improvement.


At the moment, all of our benchmark results are based on Gemini 3 Flash for text extraction and Gemini 3.5 Flash for deep reasoning in Navi. These models have been instrumental in getting such high results. As you can see below, Gemini 3 Flash was delivering state-of-the-art performance on our text extraction suite even when compared to larger models.

Gemini 3 Flash’s performance on our internal evaluation suite.

The main benefit of the Gemini Flash models is their cost-efficiency and speed, but we are looking forward to seeing what performance we may obtain from larger models.

Another avenue we would like to explore is improving our web search capabilities even further. For enterprise customers, we are making use of the Enterprise Grounding within Google, which ensures high-quality data and prevents context leakage. In terms of the benchmark, this excluded some websites from our search results, which we believe could have helped us achieve even better results.