Document Processing Pipelines

Building Document Processing In-House
What It Takes to Build and Operate

Gergely Csegzi
16 January 2026

High-level steps of a successful pipeline
Extraction complexities: Supporting different file types, OCR, LLM engineering (retries, fallbacks, rate limits, providers)
RAG
Business complexities: Defining targets, Combining and structuring results, Continuous improvement
Validation
Maintenance

Exploration: Which documents are available and where
Ingestion: Connecting to data stores, whether that is common office software (SharePoint, Google Drive, etc.), vertical-specific software (Datasite, Intralinks, etc.), or automated (AWS S3, Azure DataBlob, etc.)
Cleaning: Deduplication, removing broken or unsupported files
Extraction target configuration: Business users define the required information for their use case
Extraction: Getting textual structured data from multimodal unstructured inputs
Storage: Versioned storage of extracted values
Resolving & structuring data: Extraction results usually require resolving (combining, selecting, deriving) to be useful and to be in the shape required for the workflow
Validation: Users need to be able to validate entire workflow results as well as individual extraction results
Export & integration: The validated data needs to flow into other systems with a specific interface contract
Continuous improvements & business rule evolution: Over time, we see that the steps above are changed and repeated

Retries: Providers are not 100% reliable
Fallbacks: Sometimes there are model quirks or providers experiencing outages or high load, so we commonly see the need to fallback to different models or different providers (e.g., some Gemini models can get stuck infinitely repeating the same characters)
Rate limits: Most providers will have specific rate limits depending on usage
Provider differences: All of the APIs are slightly different, with intricacies around the structure of both inputs and outputs (whether that is parameters, input formats, resolutions, streaming settings, reasoning settings, structured formats, etc.)
Model lifecycle management: Evaluating and integrating the latest models (and retiring old ones)
Eval suites: new models often come with some trade-offs, which require a robust eval suite to understand and guide downstream decisions

OCR/LLM errors due to non-determinism or messy inputs that lead to incorrect values with downstream impact
Incomplete or incorrect definitions - the business rules, including tacit knowledge, are hard to capture; this can lead to
- False negatives: missed values as they are not within the purview of instructions perceived by LLMs
- False positives: values that should have been ignored according to "common sense"

Infrastructure: engineers will need to ensure up-keep, be available for support, deal with cloud provider platform/hosting
Product: support for bug fixing, feature requests, product FAQs, and internal training
LLM engineering: see above
Configuration for business rules: see above