Most document extraction pipelines are three jobs stitched together: OCR to read the text from an image, a parser to organize the layout, then an LLM call to pull out specific fields. NuExtract3 tries to collapse all three into a single model.
NuExtract3 is a 4 billion parameter vision-language model (VLM) - a model that processes both images and text at the same time - built specifically for pulling structured data out of documents. It handles OCR (reading text embedded in images rather than machine-readable text files), Markdown conversion, and structured extraction in one pass. Feed it a scanned invoice or a photographed form, get back clean Markdown or JSON with the relevant fields separated out.
The Self-Hosting Argument
The model is released as open weights, meaning you can download the model parameters and run everything locally. At 4 billion parameters, it fits on a single mid-range GPU - an Nvidia RTX 3080 or similar consumer card - without a dedicated server cluster.
For document extraction specifically, that matters. Invoice processing, contract review, patient intake forms, customer records - these are exactly the workflows where companies are reluctant to route data through third-party APIs. Running NuExtract3 on your own infrastructure keeps sensitive documents off external servers.
Commercial alternatives like AWS Textract or Azure Document Intelligence are managed services with per-page pricing. They work well, but they mean your data leaves your environment. For teams with compliance constraints or just a preference for controlling their own stack, a 4B model you can run on a $500 GPU changes the economics.
What 4B Parameters Gets You
NuExtract3 is not a general-purpose reasoning model. It won't write your emails or summarize a quarterly report. It's a narrow specialist: structured extraction from document images.
For that specific job, a purpose-built 4B model running locally can match or beat a general-purpose large model that bills by the API call and processes your client data off-site. The tradeoff is setup time and hardware cost upfront versus ongoing API spend and data sovereignty concerns.
The vision-language capability is the practical differentiator versus text-only extraction tools. Scanned archives, photographed receipts, and handwritten forms are all fair game - not just clean PDFs with readable text layers.