Building Document Processing In-House
What It Takes to Build and Operate
Gergely Csegzi
16 January 2026

We had cases where enterprises decided to build their own unstructured document solutions in-house. Some of those became customers a few months later once the challenges were well understood.
In this guide, we will provide you with an overview of techniques and highlight challenges so that you can make an informed decision on solving your document-based use cases.
We will cover:
High-level steps of a successful pipeline
Extraction complexities: Supporting different file types, OCR, LLM engineering (retries, fallbacks, rate limits, providers)
RAG
Business complexities: Defining targets, Combining and structuring results, Continuous improvement
Validation
Maintenance
High-level steps of a successful pipeline
These are the most common steps we see:
Exploration: Which documents are available and where
Ingestion: Connecting to data stores, whether that is common office software (SharePoint, Google Drive, etc.), vertical-specific software (Datasite, Intralinks, etc.), or automated (AWS S3, Azure DataBlob, etc.)
Cleaning: Deduplication, removing broken or unsupported files
Extraction target configuration: Business users define the required information for their use case
Extraction: Getting textual structured data from multimodal unstructured inputs
Storage: Versioned storage of extracted values
Resolving & structuring data: Extraction results usually require resolving (combining, selecting, deriving) to be useful and to be in the shape required for the workflow
Validation: Users need to be able to validate entire workflow results as well as individual extraction results
Export & integration: The validated data needs to flow into other systems with a specific interface contract
Continuous improvements & business rule evolution: Over time, we see that the steps above are changed and repeated
Extraction complexities
Depending on the use case, the most common extraction challenges that we see are supporting various file types, integrating OCR solutions, or building LLM infrastructure.
Supporting file types becomes problematic as the representations of files are quite different, which means that a single unified pipeline for all extractions and downstream use is challenging, not to mention that OCR solutions and LLM API endpoints have only specific file type support. The most common file types we see are PDFs and Office suite formats (docx, xlsx, pptx), which all have a similar-shaped challenge of containing mostly well-structured digital data that is easy to extract, but also additional information in images, annotations, or complex layouts that are not captured by traditional OCR tools, but that visual LLMs can solve.
Building the right infrastructure for LLMs is challenging for the following reasons:
Retries: Providers are not 100% reliable
Fallbacks: Sometimes there are model quirks or providers experiencing outages or high load, so we commonly see the need to fallback to different models or different providers (e.g., some Gemini models can get stuck infinitely repeating the same characters)
Rate limits: Most providers will have specific rate limits depending on usage
Provider differences: All of the APIs are slightly different, with intricacies around the structure of both inputs and outputs (whether that is parameters, input formats, resolutions, streaming settings, reasoning settings, structured formats, etc.)
Model lifecycle management: Evaluating and integrating the latest models (and retiring old ones)
Eval suites: new models often come with some trade-offs, which require a robust eval suite to understand and guide downstream decisions
RAG
Retrieval augmented generation is a very common technique that we see people rely on, with its own additional set of complexities (chunking, embedding, retrieval strategies). Generally, RAG is used to power search for chatbots, and it is less used for building processing pipelines, so we will not go into too much detail.
When to use: While semantic similarity (from embeddings) is inherently more non-deterministic, it can be a reasonable solution for large-scale use cases where the stakes of missing information are not too high.
A combination that we have seen work well is using exhaustive extraction pipelines to provide structured metadata. The metadata is used for searching or filtering, which narrows down the results enough such that the context window of the LLM providing the final answer is not overwhelmed.
Extraction target configuration, Resolving results & Continuous improvement
Let’s move into the business realm, towards the reason that the entire processing pipeline is being built.
One of the biggest challenges that we have seen is defining the exact shape and goal of the workflow and keeping it up to date with the business rules evolving over time.
What compounds this difficulty is that the business experts need to communicate this with the IT experts on an ongoing basis.
The initial target definition is usually a simplified view of the business rules that gets mapped into a JSON schema that can then be passed onto an LLM.
For use cases that go beyond simple single-page extractions (e.g., getting invoice details from a single page), the extraction results require resolving. What this means is that we need to consider what happens when there are no results, missing results, multiple results, and whether a single one needs to be selected, or they need to be combined. Consider a mortgage application where the incomes of the applicants will be reflected in different documents depending on the source of income, time, and the applicant. These values need to be taken together, where careful consideration is required about which values are duplicative, which are additive, and which are potentially inconsistent. This resolution usually cuts deep into the business rules (whether captured or tacit knowledge) and is much harder to capture as part of a JSON schema.
As the workflow requirements crystallize and evolve, the entire cycle of underlying extraction definitions and downstream workflow/combination logic requires refinement. Tracking the logic versions, allowing business users to experiment, and keeping an ongoing communication channel open with IT adds ongoing maintenance costs.
Validation
The results need to be auditable all the way from the final workflow outputs, all the way down to the individual extraction values. While models and software providers have been improving rapidly, there are still a number of different sources of errors that require human oversight:
OCR/LLM errors due to non-determinism or messy inputs that lead to incorrect values with downstream impact
Incomplete or incorrect definitions - the business rules, including tacit knowledge, are hard to capture; this can lead to
False negatives: missed values as they are not within the purview of instructions perceived by LLMs
False positives: values that should have been ignored according to "common sense"
Maintenance
There are a few levels of maintenance that come with building such a solution in-house:
Infrastructure: engineers will need to ensure up-keep, be available for support, deal with cloud provider platform/hosting
Product: support for bug fixing, feature requests, product FAQs, and internal training
LLM engineering: see above
Configuration for business rules: see above
Parsewise
These challenges are the reason why Parsewise exists. We can help with the end-to-end process and empower business users to do the refinement work independently. This leads to significantly faster results and business impact.
© Parsewise Inc. 2025. All rights reserved.