Within this test, the open-source framework unstructured is used to evaluate the extraction process of text from structured documents. The goal is to assess how suitable unstructured is for practical use in AI-based information systems – especially with respect to text extraction, semantic preparation (chunking/tokenization) and subsequent embedding generation for vector-based retrieval systems.

Here is an example of a PDF file that was used for analysis with unstructured.

pm-partnerschaft-stackitDownload

To run the unstructured library, the official Docker image is used. It contains all required dependencies (e.g. Tesseract, Poppler, Python libraries) and allows immediate use without a local Python installation.

As part of an initial test, the unstructured tool was run interactively in the Docker container to evaluate the quality of text recognition and structural analysis of a sample PDF. The goal was to examine how reliably the framework recognizes content such as headings, paragraphs and running text and assigns them semantic types like Title or NarrativeText. The analysis showed that unstructured was able to reconstruct the document’s logical structure largely correctly. The test provides the basis for further steps such as semantic chunking and preparing the data for embedding-based retrieval systems.

In the next step, I automated the text extraction using a Python script. The code first automatically detects the document’s language via “langdetect”, then applies the appropriate OCR configuration and performs the structured analysis via unstructured – including classification and output of the detected content.

In the extracted example, it is clear how unstructured differentiates and classifies various text blocks. The semantic typing is in many cases understandable: headings like „OPITZ CONSULTING UND STACKIT WERDEN CLOUD-PARTNER“ are correctly recognized as Title, running text paragraphs as NarrativeText, and layout elements such as page numbers or footers as generic text.

Overall, the analysis shows that unstructured already provides a solid foundation for structuring documents with simple layout structure – but only when using the hi_res strategy, since results from the standard processing are significantly less accurate and more fragmented.

PDF documents often contain more than meets the eye. Especially with scanned or automatically processed PDFs, so-called „Hidden Text Layers“ can be embedded – invisible texts that lie over the visible image of the document. These arise, for example, through OCR software such as Adobe Acrobat, which stores recognized characters as machine-readable text without making them visible to the user.

Such hidden texts can be read and processed by PDF analysis tools like unstructured – even if they do not appear in common PDF viewers. To check whether a PDF contains such embedded texts, the tool pdftotext from the poppler-utils package is suitable. With a simple command, it is possible to extract all machine-readable text and make it visible in a .txt file.

This is particularly helpful when unexpected or seemingly „incorrect“ contents appear in the analysis output – because they do not always stem from OCR errors, but sometimes from invisible text information in the original document.

At this point, I originally intended to continue with chunking, but in the course of the analysis I had to find that the chunking functionality of Unstructured is very rudimentary. For semantically clean segmentation, I therefore prefer a library like NLTK and will initially not pursue work with Unstructured further.