Introduction
Unstructured.io is an open-source framework for the structured preparation of unstructured documents such as PDFs, Word files, HTML pages, or emails. Its goal is to extract semantically usable content from these heterogeneous formats—such as headings, paragraphs, tables, or lists—and convert it into a unified, machine-readable format. The main use case lies in preparing text data for downstream AI processing, particularly for systems with retrieval-augmented generation (RAG).
The typical application is document analysis, knowledge management, or preparing inputs for embedding models. Multiple processing steps are employed. These four steps form the core of the Unstructured.io pipeline and are executed in every regular use of the library.
Partitioning
Splitting the document into logical content elements such as paragraphs, headings, or tables.Cleaning
Removing irrelevant components such as headers, footers, or watermarks.Extracting
Extracting the raw text content along with accompanying metadata.Staging
Formatting this information into a consistent structured format (e.g. JSON or Markdown).
In addition, there are optional extensions:
Chunking: Longer content is split into smaller text units that are better suited for embedding models. This step is not strictly part of the core library but is necessary for many AI applications.
Embedding: The resulting text chunks are sent to external models (e.g. OpenAI, HuggingFace, SentenceTransformers) to generate vector representations. This step occurs outside of Unstructured and must be implemented by the user or supplemented via existing example pipelines.
Thus, Unstructured.io handles document preprocessing up to structured text output. For a complete RAG pipeline—including embedding, vector database, retrieval logic, and language model—additional tools are required. Unstructured therefore represents the initial, preparatory stage of such an architecture.
Example PDF Documents for Experimenting
https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf
Supported Data Types
| Category | File Types |
|---|---|
| Apple | .cwk, .mcw |
| CSV | .csv |
| Data Interchange | .dif* |
| dBase | .dbf |
| .eml, .msg, .p7s | |
| EPUB | .epub |
| HTML | .htm, .html |
| Image | .bmp, .heic, .jpeg, .jpg, .png, .prn, .tiff |
| Markdown | .md |
| OpenOffice | .odt |
| Org Mode | .org |
| Other | .eth, .pbd, .sdp |
| Plain text | .txt |
| PowerPoint | .pot, .ppt, .pptm, .pptx |
| reStructured Text | .rst |
| Rich Text | .rtf |
| Spreadsheet | .et, .fods, .mw, .xls, .xlsx |
| StarOffice | .sxg |
| TSV | .tsv |
| Word processing | .abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw |
| XML | .xml |
Required External Libraries
| Library | Description |
|---|---|
| ibmagic-dev | Detects the exact type of a file (e.g. whether it is a PDF, a JPG, etc.). |
| poppler-utils and tesseract-ocr | Required for processing images and PDFs, especially for text recognition (OCR). |
| tesseract-lang | Provides the necessary language packages for text recognition in additional languages. |
| libreoffice | Used for processing and reading documents in Microsoft Office formats (Word, PowerPoint, etc.). |
| pandoc | Converts various document formats and is used here for .epub, .odt, and .rtf files. |
Data Pipeline
The unstructured.io workflow is a multi-stage data pipeline that systematically transforms raw documents into AI-ready information. The process begins with Partitioning, where a document is split into its logical elements such as titles, paragraphs, and tables. The subsequent Cleaning step removes disruptive content like headers or footers from these elements. Next comes Extracting, where the pure text content and key metadata are extracted from the elements. The Staging stage then structures these extracted data for the following phases. In Chunking, longer, contiguous texts are divided into smaller, model-optimized sections. The final step, Embedding, converts these text chunks into numerical vectors that can be processed and searched by AI systems such as RAG applications.
[ Partitioning ] > [ Cleaning ] > [ Extracting ] > [ Staging ] > [ Chunking ] > [ Embedding
Core Pipeline Steps (Mandatory Steps)
[ Partitioning ] > [ Cleaning ] > [ Extracting ] > [ Staging ]
Partitioning
The unstructured library offers partitioning functions to break raw documents into structured building blocks such as titles, body text, or list items. This allows users to specifically select only the content they need for their task, for example only the body text for training a summarization model.
| Document Type | Partitioning Function | Strategies | Table Support | Options |
|---|---|---|---|---|
| CSV files (.csv) | partition_csv | N/A | Yes | None |
| E-mails (.eml) | partition_email | N/A | No | Encoding; Include Headers; Max Partition; Process Attachments |
| E-mails (.msg) | partition_msg | N/A | No | Encoding; Max Partition; Process Attachments |
| EPUBs (.epub) | partition_epub | N/A | Yes | Include Page Breaks |
| Excel documents (.xlsx/.xls) | partition_xlsx | N/A | Yes | None |
| HTML pages (.html/.htm) | partition_html | N/A | No | Encoding; Include Page Breaks |
| Images (.png/.jpg/.jpeg/.tiff/.bmp/.heic) | partition_image | “auto”, “hi_res”, “ocr_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; OCR Languages; Strategy |
| Markdown (.md) | partition_md | N/A | Yes | Include Page Breaks |
| Org Mode (.org) | partition_org | N/A | Yes | Include Page Breaks |
| OpenOffice documents (.odt) | partition_odt | N/A | Yes | None |
| PDFs (.pdf) | partition_pdf | “auto”, “fast”, “hi_res”, “ocr_only” | Yes | Encoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages; Strategy |
| Plain text (.txt/.text/.log) | partition_text | N/A | No | Encoding; Max Partition; Paragraph Grouper |
| PowerPoints (.ppt) | partition_ppt | N/A | Yes | Include Page Breaks |
| PowerPoints (.pptx) | partition_pptx | N/A | Yes | Include Page Breaks |
| reStructured Text (.rst) | partition_rst | N/A | Yes | Include Page Breaks |
| Rich Text Files (.rtf) | partition_rtf | N/A | Yes | Include Page Breaks |
| TSV files (.tsv) | partition_tsv | N/A | Yes | None |
| Word documents (.doc) | partition_doc | N/A | Yes | Include Page Breaks |
| Word documents (.docx) | partition_docx | N/A | Yes | Include Page Breaks |
| XML documents (.xml) | partition_xml | N/A | No | Encoding; Max Partition; XML Keep Tags |
| Code files (.js/.py/.java/etc.) | partition_text | N/A | No | Encoding; Max Partition; Paragraph Grouper |
Example *.PDF file for testing the unstructured.io library:
pm-partnerschaft-stackitDownload
The partition function has analyzed the above PDF document. Instead of viewing the document only as a long, unformatted text sequence, it interpreted the visual structure and layout of the page to identify individual logical building blocks (Document Elements), e.g. “Title” and “NarrativeText”.
Explanation of each element type from the PDF, e.g. Title:
| Tag (Key) | JSON Example Value | Explanation |
|---|---|---|
| type | “Title” | Specifies the classified type of the element. unstructured identifies various types such as Title, NarrativeText, ListItem etc., based on the document layout. |
| element_id | “42a9c358d559f0be7034cffd155ae0b4” | A unique, auto-generated ID for each individual element. This is useful for referencing elements. |
| text | “OPITZ CONSULTING UND STACKIT…” | The raw text content extracted from the recognized element. |
| metadata | {…} | An object serving as a container for all additional data (metadata) that describe the element. All following tags are part of this metadata object. |
| metadata.detection_class_prob | 0.6857117414474487 | The “detection class probability” is a confidence score between 0 and 1. It indicates how certain the model is that the classification under type (here: “Title”) is correct. A value of 1 means the highest possible certainty. The model is 100% sure that the type assignment (e.g. “Title”) is correct. As a user, you can increase this value by selecting a strategy, e.g. “hi_res”. Strategies: - “fast”: Quick but less accurate. Often results in lower confidence scores. - “hi_res”: Uses complex models for visual analysis of the document. This is slower but recognizes layouts, titles, and paragraphs much more reliably, generally leading to higher confidence scores. - “ocr_only”: Forces OCR on all pages, even if digital text is present. Useful for “broken” PDFs, but for layout detection hi_res is superior. |
| metadata.coordinates | {…} | An object containing all information about the positioning of the element on the page. |
| metadata.coordinates.points | [[197.0, 771.66], [197.0, 903.39], …] | A list of [X, Y] coordinate pairs that define a bounding box around the element. This describes the exact position and size of the element on the page. |
| metadata.coordinates.system | “PixelSpace” | Specifies the coordinate system used. “PixelSpace” means that the values in points, layout_width, and layout_height are measured in pixels. |
| metadata.coordinates.layout_width | 1654 | The total width of the document page in the unit of the specified system (here: pixels). |
| metadata.coordinates.layout_height | 2339 | The total height of the document page in the unit of the specified system (here: pixels). |
| metadata.last_modified | “2025-05-31T08:21:16” | The timestamp of the last modification of the source file, if available. |
| metadata.filetype | “application/pdf” | The MIME type of the source file, indicating the file format (e.g. PDF, DOCX, HTML). |
| metadata.languages | [“deu”, “eng”] | A list of languages detected in the text of the element, given as three-letter ISO 639-2 codes (e.g. “deu” for German, “eng” for English). |
| metadata.page_number | 1 | The page number in the original document where this element was found. |
| metadata.filename | “test_oc.pdf” | The filename of the processed source file. |
| metadata.parent_id | (not in this example, but relevant) | This key would appear here if the element were a child element. Its value would be the element_id of the parent element (e.g. the ID of the title for a following paragraph). |
This is the list of the main document elements such as Text, Header, Image, etc. The complete list of elements is available here: Link
| Element Type | Description |
|---|---|
| Address | A text element for capturing physical addresses. |
| CodeSnippet | A text element for capturing code snippets. |
| EmailAddress | A text element for capturing email addresses. |
| FigureCaption | An element for capturing text belonging to image captions. |
| Footer | An element for capturing document footers. |
| FormKeysValues | An element for capturing key-value pairs in a form. |
| Formula | An element containing formulas in a file. |
| Header | An element for capturing document headers. |
| Image | A text element for capturing image metadata. |
| ListItem | ListItem is a NarrativeText element that is part of a list. |
| NarrativeText | NarrativeText is an element consisting of multiple well-formed sentences. This excludes elements such as titles, headers, footers, and figure captions. |
| PageBreak | An element for capturing page breaks. |
| PageNumber | An element for capturing page numbers. |
| Table | An element for capturing tables. |
| Title | A text element for capturing titles. |
| UncategorizedText | Base element for capturing free text from files. Applies to extracted text not associated with bounding boxes. |
Cleaning
The cleaning function is used to remove unwanted content from documents to prepare the data for downstream tasks such as processing by language models (LLMs). The goal is to obtain “clean” and more relevant data.
The following table provides an overview of the different cleaning functions. Source: Link
| Name | Explanation |
|---|---|
| bytes_string_to_string | Converts a bytes string into a normal string. You can imagine it like this: A computer does not store an emoji like 😊 as an image but as a sequence of bytes—a so-called “byte string.” For 😊 the code is, for example, b’\xf0\x9f\x98\x8a’. When unstructured processes an HTML file, the parser may encounter such a special character. Instead of correctly reading the character and inserting 😊 into the text, sometimes a kind of “description of the byte code” is output as normal text. The result then is a string that looks like: “b’\xf0\x9f\x98\x8a’”. This is no longer a real byte string but just normal text that looks like a byte string. The function bytes_string_to_string acts as a repair tool for this problem. It recognizes the b’…’ pattern in text and converts it back into the original correct character. |
| clean | The clean function cleans a text fragment by combining several specific cleaning actions into a single call. You can control which cleanups are performed via simple flags (True or False). The available options are: - bullets=True: Removes bullet characters (e.g. ● or *) at the beginning of the text. - extra_whitespace=True: Removes redundant whitespace, for example multiple spaces between words. - dashes=True: Cleans various types of hyphens and dashes. - trailing_punctuation=True: Removes punctuation at the end of the text. - lowercase=True: Converts all text to lowercase. Example: clean("● An excellent point!", bullets=True, lowercase=True) |
| clean_bullets | Removes leading bullet characters from the beginning of a text. |
| clean_dashes | Replaces various types of dashes (e.g. em dash, en dash) with a standard hyphen. |
| clean_non_ascii_chars | Removes all non-ASCII characters from a text. ASCII includes: - English alphabet letters (A-Z, a-z) - Numbers (0-9) - Basic punctuation and special characters like ! ? @ $ & All other characters are considered non-ASCII. This includes: - German umlauts (ä, ö, ü, ß) - Symbols like €, ®, ©, ● - Emojis like 👍 or 😊 For German texts, clean_non_ascii_chars should not be enabled. |
| clean_ordered_bullets | Removes ordered list markers like “1.”, “a.)” or “i)” from the beginning of a text. |
| clean_postfix | Checks the end of a string and, if it matches a defined pattern (usually a regular expression), removes that part. - pattern: The pattern (e.g. r"(END|STOP)") to search for and remove at the end of the text. - ignore_case=True: Ignores case when matching (so END also matches end). (Default is False) - strip=True: Removes any remaining whitespace at the end after deleting the matched pattern. (Default is True) |
| clean_prefix | Removes a specified prefix from a text if it is present. |
| clean_trailing_punctuation | Removes punctuation at the end of a text but leaves punctuation within the text intact. |
| group_broken_paragraphs | Joins lines of text that have been separated by line breaks but actually belong to the same paragraph. Very useful for texts extracted from PDFs. In other words, this function “repairs” paragraphs that have been split by line breaks (\n) for visual or formatting reasons. |
| remove_punctuation | Removes all punctuation characters (e.g. , . ; ! ?) from a text. |
| replace_unicode_quotes | Replaces outdated or problematic Unicode codes for quotation marks with modern, typographically correct “smart quotes.” Sometimes texts copied from older systems or programs like Microsoft Word use special control characters or codes (e.g. \x91, \x93) instead of standard quotation marks. These can cause display issues or disrupt further automated processing. The function replace_unicode_quotes acts as a repair tool to find these specific, outdated codes and convert them into modern smart quotes. |
| translate_text | The translate_text function uses professional translation models (Helsinki NLP) to translate text between many different languages such as Russian, Chinese, German, and more. Parameters: - text: The text to be translated. - source_lang: The language code (e.g. de for German) of the original text. If not specified, the function attempts to detect the language automatically. - target_lang: The language code of the target language. If not specified, the default target is English (en). |
Extraction
The Extracting step isolates specific information from already cleaned text elements.
For example, if a text contains the sentence: Please contact support@example.com for further information. Instead of processing the entire sentence, an extraction function like extract_email_address would pull out only the desired data point, namely support@example.com.
| Name | Explanation |
|---|---|
| extract_datetimetz | Extracts date, time, and timezone from the “Received” fields of an .eml file (email). |
| extract_email_address | Finds and extracts one or more email addresses from a text. |
| extract_ip_address | Extracts IP addresses from a text. |
| extract_ip_address_name | Extracts the names associated with each IP address in the “Received” fields of an .eml file. |
| extract_mapi_id | Extracts the “MAPI ID” from the “Received” fields of an .eml file. |
| extract_ordered_bullets | Extracts text from ordered list markers (e.g. “1.”, “a.)”). Example: extract_ordered_bullets(“1.1 This is a very important point”) Output: (“1”, “1”, None) extract_ordered_bullets(“a.1 This is a very important point”) Output: (“a”, “1”, None) |
| extract_text_after | Extracts the text that follows a specified pattern or word. Example: text = “SPEAKER 1: Look at me, I’m flying!” extract_text_after(text, r"SPEAKER \d{1}:") Output: “Look at me, I’m flying!” |
| extract_text_before | Extracts the text that precedes a specified pattern or word. |
| extract_us_phone_number | Extracts a phone number in US format from a text segment. |
| group_broken_paragraphs | Joins lines of text that have been separated by line breaks but actually belong to the same paragraph. Very useful for PDFs. This function originates from Cleaning. The only difference is the context or intention with which the function is called. |
| remove_punctuation | Removes all punctuation characters (e.g. , . ; ! ?) from a text. This function originates from Cleaning. The only difference is the context or intention with which the function is called. |
| replace_unicode_quotes | Replaces outdated or problematic Unicode codes for quotation marks with modern smart quotes. This function originates from Cleaning. The only difference is the context or intention with which the function is called. |
| translate_text | Uses professional translation models (Helsinki NLP) to translate text between many languages. Parameters: - text: The text to be translated. - source_lang: The code of the original language (e.g. de for German). If omitted, the function attempts to detect the language. - target_lang: The code of the target language. Defaults to English (en). This function originates from Cleaning. The only difference is the context or intention with which the function is called. |
Staging
Staging functions in the unstructured package prepare extracted document elements for downstream processing steps. They take a list of structured elements as input, such as Title or NarrativeText. The output is a format-specific dictionary—a structured collection of data in the form of key-value pairs. Each piece of information, such as a text fragment or a metadata entry, is assigned a unique identifier. These identifiers, called keys, can be “text,” “metadata,” or “type.” The goal is to prepare the data so that it can be directly processed by the intended target system.
Originally, specialized conversion functions were available for different use cases:
- A basic conversion like convert_to_csv for translating into tabular formats
- A conversion for machine learning and NLP platforms like stage_for_transformers for preparing training data
- A conversion for vector databases like stage_for_weaviate for semantic indexing
These functions are now deprecated. Further development focuses on the concept of destination connectors, e.g. for Kafka, Weaviate, or MongoDB. This enables automated transfer of data to external platforms after extraction.
| Connector | Description |
|---|---|
| Astra DB | A cloud-native NoSQL Database-as-a-Service (DBaaS) based on Apache Cassandra. It is optimized for high scalability and performance and includes vector search capabilities, making it useful for AI applications. |
| Azure | Refers to Azure Blob Storage, a Microsoft object storage service for storing large amounts of unstructured data in the cloud. |
| Azure AI Search | A cloud-based search service from Microsoft that provides APIs and tools to integrate advanced search features (including vector and semantic search) into applications. |
| Box | A cloud-based content management and collaboration platform. As a destination, you can transfer data and documents into a secure Box environment. |
| Chroma | An open-source vector database designed for AI and LLM applications, optimized for storing and searching vector embeddings. |
| Couchbase | A distributed NoSQL database optimized for interactive applications. It combines a fast key-value store with a flexible JSON document model and SQL-like queries. |
| Databricks Volumes | A feature within Databricks that enables access to, storage of, and management of non-tabular data (such as images, PDFs, text files) in cloud storage as if it were a local file system. |
| Delta Tables in Amazon S3 | Allows storing data in Delta Lake format directly on Amazon S3. Delta Lake is an open-source storage framework that offers ACID transactions, time travel (data versioning), and scalability for data lakes. |
| Delta Tables in Databricks | Stores data in optimized Delta Lake format within the Databricks platform. This is the native and most performant method for using Delta Tables in Databricks. |
| Dropbox | A cloud storage service that allows users to store and share files online. As a destination connector, it writes files and data into a user’s or organization’s Dropbox folder structure. |
| DuckDB | A columnar, in-process analytical database system (OLAP). It is extremely fast and designed to run directly within an application without a separate server. |
| Elasticsearch | A highly scalable open-source search and analytics engine. It is widely used for full-text search, log analysis, security information, and business analytics. |
| Google Cloud Storage | Google’s object storage service (similar to Amazon S3). It is used for storing and retrieving any amount of data in Google Cloud. |
| IBM watsonx.data | An open data lakehouse service from IBM that enables managing and analyzing data from data warehouses and data lakes with a single query engine, optimized for AI workloads. |
| Kafka | Apache Kafka is a distributed open-source event streaming platform. As a destination, you can write data streams (events) into Kafka topics for real-time consumption by other applications. |
| KDB.AI | A high-performance vector database developed for real-time AI applications such as similarity search, personalization, and retrieval-augmented generation (RAG). |
| LanceDB | An embeddable vector database for AI applications that runs serverless and is optimized for multimodal data (text, images, etc.) and fast, efficient vector search. |
| Local | Refers to storing data on the local filesystem of the machine where the process is running. |
| Milvus | An open-source vector database designed for managing and searching massive volumes of vector embeddings with high performance for similarity search. |
| MongoDB | A leading document-oriented NoSQL database. It stores data in flexible, JSON-like documents, making it popular among developers for modern applications. |
| MotherDuck | A serverless cloud analytics service built on DuckDB. It combines DuckDB’s local speed with the cloud’s scalability and sharing capabilities. |
| Neo4j | A leading graph database. Instead of storing data in tables, it stores data as nodes and relationships, which is ideal for analyzing complex connections. |
| OneDrive | Microsoft’s cloud storage service. As a destination connector, it writes files and data directly to a user’s or organization’s OneDrive cloud. |
| OpenSearch | An AWS-forked open-source search and analytics framework derived from Elasticsearch. It is used for similar use cases such as log analysis and full-text search. |
| Pinecone | A managed, cloud-based vector database that makes it easy for developers to integrate high-performance vector search into AI applications without managing infrastructure. |
| PostgreSQL | A powerful, object-relational open-source database system. It is known for its reliability, robustness, and extensive SQL-standard feature set. |
| Qdrant | An open-source vector database and search engine designed for production environments, providing a simple API for storing and querying vectors. |
| Redis | An extremely fast in-memory database that functions as a key-value store. Commonly used as a cache, message broker, or for real-time applications. |
| S3 | Amazon Simple Storage Service (S3) is a highly scalable object storage service from AWS. It is a de-facto standard for cloud data storage. |
| SFTP | SSH File Transfer Protocol is a secure file transfer protocol. As a destination connector, it allows secure uploads of data to a remote server. |
| SingleStore | A distributed, relational SQL database system known for its high-speed data ingestion, transactions, and queries, supporting both transactional and analytical workloads. |
| Snowflake | A cloud-based data platform offered as Data Warehouse-as-a-Service. It is known for its ability to scale storage and compute independently. |
| SQLite | A serverless, self-contained transactional SQL database engine that is embedded directly within an application. It is the world’s most widely used database system, especially in mobile apps and browsers. |
| Vectara | An end-to-end platform for developers to build GenAI applications focused on retrieval-augmented generation (RAG), minimizing hallucinations through precise hybrid search. |
| Weaviate | An open-source vector database that stores both data objects and their vector representations, enabling a combination of vector search with structured filters. |
The selection of supported platforms is continuously being expanded. If a desired target environment is not included, it can be suggested in the community Slack.
Advanced Pipeline Steps (Optional)
[ Chunking ] > [ Embedding
Chunking
Unstructured uses metadata and document elements recognized by partitioning functions to transform elements into more useful “chunks.”
Chunking Strategy
Currently, Unstructured offers two chunking strategies—“basic” and “by_title.” The main components of the basic strategy are:
basic
The “basic” strategy is the simplest and one of the most commonly used methods in the unstructured library. Its main task is to combine sequential document elements to create chunks that are as large as possible without exceeding a set maximum character count:
- max_characters: a hard limit that must not be exceeded
- new_after_n_chars: a soft limit. When exceeded, a new chunk is started even if the hard limit has not been reached. A new chunk is preferentially started when it makes semantic or technical sense, but this threshold can be ignored as long as the hard limit is respected.
- overlap: if a single element is too large and must be split, this parameter specifies how many characters overlap between the end of one chunk and the beginning of the next. This helps maintain context across chunk boundaries.
- overlap_all: when set to True, overlap is applied not only to oversized, split elements but between all consecutive chunks.
Special case tables: Tables (Table elements) are always treated as standalone chunks and never combined with other elements. If a table itself is too large, it is also split.
With “basic,” we reach the limits of this strategy on highly structured documents such as reports, scientific papers, or manuals.
by_title
The by_title strategy inherits all basic strategy behaviors (such as respecting max_characters) but adds three crucial new rules:
The strategy identifies Title elements (headings) as the start of a new section. When the algorithm encounters a heading, it performs the following actions:
- The current chunk is immediately closed.
- A new chunk is started beginning with this Title element.
This occurs even if the heading text would have fit perfectly into the previous chunk.
By default, the strategy does not treat page breaks as hard boundaries. This means a section can span multiple pages without being split into a new chunk. This behavior is controlled by the parameter multipage_sections=True (default).
Sometimes very short texts, such as individual list entries, are mistakenly identified as Title elements. This can lead to a flood of tiny, unwanted chunks. To address this, there is the combine_text_under_n_chars parameter:
- This parameter allows multiple consecutive small sections to be combined into a single chunk to best fill the chunking window (max_characters).
- By default, combine_text_under_n_chars is the same as max_characters. This ensures that small sections are efficiently grouped.
- A value of 0 means every tiny, title-identified section is defined as a new chunk.
The by_title strategy is the better choice for documents with a clear hierarchical structure such as reports, research articles, manuals, or contracts.
Embedding
After extracting content from documents using the open-source library unstructured, a common downstream step is converting the extracted text elements into vector embeddings. These vectors form the basis for semantic search functions and retrieval-augmented generation (RAG).
The core unstructured library itself does not include native functions for calling embedding providers. Instead, there are two primary methods for generating embeddings for unstructured outputs.
Method 1: The Unstructured ecosystem provides extended functionality in the form of the Unstructured Ingest command-line interface (CLI) and the associated Python library. These components are designed for creating complete end-to-end processing pipelines and offer built-in support for connecting to embedding providers.
In this approach, embedding generation is seamlessly integrated into the data ingestion process (“Ingest Pipeline”). This enables full automation of the workflow from raw document to final vector embedding. Configuration details are available in the official documentation.
Method 2: An alternative method is manually enriching the JSON files produced by unstructured. This approach offers high flexibility, especially in choosing the embedding model, and is suitable for scenarios without a complete ingest pipeline.
The process typically follows a fixed schema:
- Input: A JSON file generated by the unstructured library serves as the input source.
- Reading: The content of the JSON file is loaded into memory as a structured object.
- Embedding Generation: A third-party library, such as sentence-transformers, is used to generate an embedding for the value of the text field of each element in the JSON file. The sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face is a common example.
- Data Enrichment: The generated embedding is added as a new field alongside the corresponding text field in the JSON object.
- Saving: The modified JSON object with the added embeddings is written back to the original file or a new file.
This approach provides full control over the process and can be integrated into existing workflows based on exchanging JSON files.
