Introduction

Unstructured.io is an open-source framework for the structured preparation of unstructured documents such as PDFs, Word files, HTML pages, or emails. Its goal is to extract semantically usable content from these heterogeneous formats—such as headings, paragraphs, tables, or lists—and convert it into a unified, machine-readable format. The main use case lies in preparing text data for downstream AI processing, particularly for systems with retrieval-augmented generation (RAG).

The typical application is document analysis, knowledge management, or preparing inputs for embedding models. Multiple processing steps are employed. These four steps form the core of the Unstructured.io pipeline and are executed in every regular use of the library.

  • Partitioning
    Splitting the document into logical content elements such as paragraphs, headings, or tables.

  • Cleaning
    Removing irrelevant components such as headers, footers, or watermarks.

  • Extracting
    Extracting the raw text content along with accompanying metadata.

  • Staging
    Formatting this information into a consistent structured format (e.g. JSON or Markdown).

In addition, there are optional extensions:

  • Chunking: Longer content is split into smaller text units that are better suited for embedding models. This step is not strictly part of the core library but is necessary for many AI applications.

  • Embedding: The resulting text chunks are sent to external models (e.g. OpenAI, HuggingFace, SentenceTransformers) to generate vector representations. This step occurs outside of Unstructured and must be implemented by the user or supplemented via existing example pipelines.

Thus, Unstructured.io handles document preprocessing up to structured text output. For a complete RAG pipeline—including embedding, vector database, retrieval logic, and language model—additional tools are required. Unstructured therefore represents the initial, preparatory stage of such an architecture.

Example PDF Documents for Experimenting

https://github.com/Unstructured-IO/unstructured/tree/main/example-docs/pdf

Supported Data Types

CategoryFile Types
Apple.cwk, .mcw
CSV.csv
Data Interchange.dif*
dBase.dbf
E-mail.eml, .msg, .p7s
EPUB.epub
HTML.htm, .html
Image.bmp, .heic, .jpeg, .jpg, .png, .prn, .tiff
Markdown.md
OpenOffice.odt
Org Mode.org
Other.eth, .pbd, .sdp
PDF.pdf
Plain text.txt
PowerPoint.pot, .ppt, .pptm, .pptx
reStructured Text.rst
Rich Text.rtf
Spreadsheet.et, .fods, .mw, .xls, .xlsx
StarOffice.sxg
TSV.tsv
Word processing.abw, .doc, .docm, .docx, .dot, .dotm, .hwp, .zabw
XML.xml

Required External Libraries

LibraryDescription
ibmagic-devDetects the exact type of a file (e.g. whether it is a PDF, a JPG, etc.).
poppler-utils and tesseract-ocrRequired for processing images and PDFs, especially for text recognition (OCR).
tesseract-langProvides the necessary language packages for text recognition in additional languages.
libreofficeUsed for processing and reading documents in Microsoft Office formats (Word, PowerPoint, etc.).
pandocConverts various document formats and is used here for .epub, .odt, and .rtf files.

Data Pipeline

The unstructured.io workflow is a multi-stage data pipeline that systematically transforms raw documents into AI-ready information. The process begins with Partitioning, where a document is split into its logical elements such as titles, paragraphs, and tables. The subsequent Cleaning step removes disruptive content like headers or footers from these elements. Next comes Extracting, where the pure text content and key metadata are extracted from the elements. The Staging stage then structures these extracted data for the following phases. In Chunking, longer, contiguous texts are divided into smaller, model-optimized sections. The final step, Embedding, converts these text chunks into numerical vectors that can be processed and searched by AI systems such as RAG applications.

[ Partitioning ] > [ Cleaning ] > [ Extracting ] > [ Staging ] > [ Chunking ] > [ Embedding

Core Pipeline Steps (Mandatory Steps)

[ Partitioning ] > [ Cleaning ] > [ Extracting ] > [ Staging ]

Partitioning

The unstructured library offers partitioning functions to break raw documents into structured building blocks such as titles, body text, or list items. This allows users to specifically select only the content they need for their task, for example only the body text for training a summarization model.

Document TypePartitioning FunctionStrategiesTable SupportOptions
CSV files (.csv)partition_csvN/AYesNone
E-mails (.eml)partition_emailN/ANoEncoding; Include Headers; Max Partition; Process Attachments
E-mails (.msg)partition_msgN/ANoEncoding; Max Partition; Process Attachments
EPUBs (.epub)partition_epubN/AYesInclude Page Breaks
Excel documents (.xlsx/.xls)partition_xlsxN/AYesNone
HTML pages (.html/.htm)partition_htmlN/ANoEncoding; Include Page Breaks
Images (.png/.jpg/.jpeg/.tiff/.bmp/.heic)partition_image“auto”, “hi_res”, “ocr_only”YesEncoding; Include Page Breaks; Infer Table Structure; OCR Languages; Strategy
Markdown (.md)partition_mdN/AYesInclude Page Breaks
Org Mode (.org)partition_orgN/AYesInclude Page Breaks
OpenOffice documents (.odt)partition_odtN/AYesNone
PDFs (.pdf)partition_pdf“auto”, “fast”, “hi_res”, “ocr_only”YesEncoding; Include Page Breaks; Infer Table Structure; Max Partition; OCR Languages; Strategy
Plain text (.txt/.text/.log)partition_textN/ANoEncoding; Max Partition; Paragraph Grouper
PowerPoints (.ppt)partition_pptN/AYesInclude Page Breaks
PowerPoints (.pptx)partition_pptxN/AYesInclude Page Breaks
reStructured Text (.rst)partition_rstN/AYesInclude Page Breaks
Rich Text Files (.rtf)partition_rtfN/AYesInclude Page Breaks
TSV files (.tsv)partition_tsvN/AYesNone
Word documents (.doc)partition_docN/AYesInclude Page Breaks
Word documents (.docx)partition_docxN/AYesInclude Page Breaks
XML documents (.xml)partition_xmlN/ANoEncoding; Max Partition; XML Keep Tags
Code files (.js/.py/.java/etc.)partition_textN/ANoEncoding; Max Partition; Paragraph Grouper

Example *.PDF file for testing the unstructured.io library:

pm-partnerschaft-stackitDownload

The partition function has analyzed the above PDF document. Instead of viewing the document only as a long, unformatted text sequence, it interpreted the visual structure and layout of the page to identify individual logical building blocks (Document Elements), e.g. “Title” and “NarrativeText”.

Explanation of each element type from the PDF, e.g. Title:

Tag (Key)JSON Example ValueExplanation
type“Title”Specifies the classified type of the element. unstructured identifies various types such as Title, NarrativeText, ListItem etc., based on the document layout.
element_id“42a9c358d559f0be7034cffd155ae0b4”A unique, auto-generated ID for each individual element. This is useful for referencing elements.
text“OPITZ CONSULTING UND STACKIT…”The raw text content extracted from the recognized element.
metadata{…}An object serving as a container for all additional data (metadata) that describe the element. All following tags are part of this metadata object.
metadata.detection_class_prob0.6857117414474487The “detection class probability” is a confidence score between 0 and 1. It indicates how certain the model is that the classification under type (here: “Title”) is correct. A value of 1 means the highest possible certainty. The model is 100% sure that the type assignment (e.g. “Title”) is correct. As a user, you can increase this value by selecting a strategy, e.g. “hi_res”. Strategies: - “fast”: Quick but less accurate. Often results in lower confidence scores. - “hi_res”: Uses complex models for visual analysis of the document. This is slower but recognizes layouts, titles, and paragraphs much more reliably, generally leading to higher confidence scores. - “ocr_only”: Forces OCR on all pages, even if digital text is present. Useful for “broken” PDFs, but for layout detection hi_res is superior.
metadata.coordinates{…}An object containing all information about the positioning of the element on the page.
metadata.coordinates.points[[197.0, 771.66], [197.0, 903.39], …]A list of [X, Y] coordinate pairs that define a bounding box around the element. This describes the exact position and size of the element on the page.
metadata.coordinates.system“PixelSpace”Specifies the coordinate system used. “PixelSpace” means that the values in points, layout_width, and layout_height are measured in pixels.
metadata.coordinates.layout_width1654The total width of the document page in the unit of the specified system (here: pixels).
metadata.coordinates.layout_height2339The total height of the document page in the unit of the specified system (here: pixels).
metadata.last_modified“2025-05-31T08:21:16”The timestamp of the last modification of the source file, if available.
metadata.filetype“application/pdf”The MIME type of the source file, indicating the file format (e.g. PDF, DOCX, HTML).
metadata.languages[“deu”, “eng”]A list of languages detected in the text of the element, given as three-letter ISO 639-2 codes (e.g. “deu” for German, “eng” for English).
metadata.page_number1The page number in the original document where this element was found.
metadata.filename“test_oc.pdf”The filename of the processed source file.
metadata.parent_id(not in this example, but relevant)This key would appear here if the element were a child element. Its value would be the element_id of the parent element (e.g. the ID of the title for a following paragraph).

This is the list of the main document elements such as Text, Header, Image, etc. The complete list of elements is available here: Link

Element TypeDescription
AddressA text element for capturing physical addresses.
CodeSnippetA text element for capturing code snippets.
EmailAddressA text element for capturing email addresses.
FigureCaptionAn element for capturing text belonging to image captions.
FooterAn element for capturing document footers.
FormKeysValuesAn element for capturing key-value pairs in a form.
FormulaAn element containing formulas in a file.
HeaderAn element for capturing document headers.
ImageA text element for capturing image metadata.
ListItemListItem is a NarrativeText element that is part of a list.
NarrativeTextNarrativeText is an element consisting of multiple well-formed sentences. This excludes elements such as titles, headers, footers, and figure captions.
PageBreakAn element for capturing page breaks.
PageNumberAn element for capturing page numbers.
TableAn element for capturing tables.
TitleA text element for capturing titles.
UncategorizedTextBase element for capturing free text from files. Applies to extracted text not associated with bounding boxes.

Cleaning

The cleaning function is used to remove unwanted content from documents to prepare the data for downstream tasks such as processing by language models (LLMs). The goal is to obtain “clean” and more relevant data.

The following table provides an overview of the different cleaning functions. Source: Link

NameExplanation
bytes_string_to_stringConverts a bytes string into a normal string. You can imagine it like this: A computer does not store an emoji like 😊 as an image but as a sequence of bytes—a so-called “byte string.” For 😊 the code is, for example, b’\xf0\x9f\x98\x8a’. When unstructured processes an HTML file, the parser may encounter such a special character. Instead of correctly reading the character and inserting 😊 into the text, sometimes a kind of “description of the byte code” is output as normal text. The result then is a string that looks like: “b’\xf0\x9f\x98\x8a’”. This is no longer a real byte string but just normal text that looks like a byte string. The function bytes_string_to_string acts as a repair tool for this problem. It recognizes the b’…’ pattern in text and converts it back into the original correct character.
cleanThe clean function cleans a text fragment by combining several specific cleaning actions into a single call. You can control which cleanups are performed via simple flags (True or False). The available options are: - bullets=True: Removes bullet characters (e.g. ● or *) at the beginning of the text. - extra_whitespace=True: Removes redundant whitespace, for example multiple spaces between words. - dashes=True: Cleans various types of hyphens and dashes. - trailing_punctuation=True: Removes punctuation at the end of the text. - lowercase=True: Converts all text to lowercase. Example: clean("● An excellent point!", bullets=True, lowercase=True)
clean_bulletsRemoves leading bullet characters from the beginning of a text.
clean_dashesReplaces various types of dashes (e.g. em dash, en dash) with a standard hyphen.
clean_non_ascii_charsRemoves all non-ASCII characters from a text. ASCII includes: - English alphabet letters (A-Z, a-z) - Numbers (0-9) - Basic punctuation and special characters like ! ? @ $ & All other characters are considered non-ASCII. This includes: - German umlauts (ä, ö, ü, ß) - Symbols like €, ®, ©, ● - Emojis like 👍 or 😊 For German texts, clean_non_ascii_chars should not be enabled.
clean_ordered_bulletsRemoves ordered list markers like “1.”, “a.)” or “i)” from the beginning of a text.
clean_postfixChecks the end of a string and, if it matches a defined pattern (usually a regular expression), removes that part. - pattern: The pattern (e.g. r"(END|STOP)") to search for and remove at the end of the text. - ignore_case=True: Ignores case when matching (so END also matches end). (Default is False) - strip=True: Removes any remaining whitespace at the end after deleting the matched pattern. (Default is True)
clean_prefixRemoves a specified prefix from a text if it is present.
clean_trailing_punctuationRemoves punctuation at the end of a text but leaves punctuation within the text intact.
group_broken_paragraphsJoins lines of text that have been separated by line breaks but actually belong to the same paragraph. Very useful for texts extracted from PDFs. In other words, this function “repairs” paragraphs that have been split by line breaks (\n) for visual or formatting reasons.
remove_punctuationRemoves all punctuation characters (e.g. , . ; ! ?) from a text.
replace_unicode_quotesReplaces outdated or problematic Unicode codes for quotation marks with modern, typographically correct “smart quotes.” Sometimes texts copied from older systems or programs like Microsoft Word use special control characters or codes (e.g. \x91, \x93) instead of standard quotation marks. These can cause display issues or disrupt further automated processing. The function replace_unicode_quotes acts as a repair tool to find these specific, outdated codes and convert them into modern smart quotes.
translate_textThe translate_text function uses professional translation models (Helsinki NLP) to translate text between many different languages such as Russian, Chinese, German, and more. Parameters: - text: The text to be translated. - source_lang: The language code (e.g. de for German) of the original text. If not specified, the function attempts to detect the language automatically. - target_lang: The language code of the target language. If not specified, the default target is English (en).

Extraction

The Extracting step isolates specific information from already cleaned text elements.

For example, if a text contains the sentence: Please contact support@example.com for further information. Instead of processing the entire sentence, an extraction function like extract_email_address would pull out only the desired data point, namely support@example.com.

NameExplanation
extract_datetimetzExtracts date, time, and timezone from the “Received” fields of an .eml file (email).
extract_email_addressFinds and extracts one or more email addresses from a text.
extract_ip_addressExtracts IP addresses from a text.
extract_ip_address_nameExtracts the names associated with each IP address in the “Received” fields of an .eml file.
extract_mapi_idExtracts the “MAPI ID” from the “Received” fields of an .eml file.
extract_ordered_bulletsExtracts text from ordered list markers (e.g. “1.”, “a.)”). Example: extract_ordered_bullets(“1.1 This is a very important point”) Output: (“1”, “1”, None) extract_ordered_bullets(“a.1 This is a very important point”) Output: (“a”, “1”, None)
extract_text_afterExtracts the text that follows a specified pattern or word. Example: text = “SPEAKER 1: Look at me, I’m flying!” extract_text_after(text, r"SPEAKER \d{1}:") Output: “Look at me, I’m flying!”
extract_text_beforeExtracts the text that precedes a specified pattern or word.
extract_us_phone_numberExtracts a phone number in US format from a text segment.
group_broken_paragraphsJoins lines of text that have been separated by line breaks but actually belong to the same paragraph. Very useful for PDFs. This function originates from Cleaning. The only difference is the context or intention with which the function is called.
remove_punctuationRemoves all punctuation characters (e.g. , . ; ! ?) from a text. This function originates from Cleaning. The only difference is the context or intention with which the function is called.
replace_unicode_quotesReplaces outdated or problematic Unicode codes for quotation marks with modern smart quotes. This function originates from Cleaning. The only difference is the context or intention with which the function is called.
translate_textUses professional translation models (Helsinki NLP) to translate text between many languages. Parameters: - text: The text to be translated. - source_lang: The code of the original language (e.g. de for German). If omitted, the function attempts to detect the language. - target_lang: The code of the target language. Defaults to English (en). This function originates from Cleaning. The only difference is the context or intention with which the function is called.

Staging

Staging functions in the unstructured package prepare extracted document elements for downstream processing steps. They take a list of structured elements as input, such as Title or NarrativeText. The output is a format-specific dictionary—a structured collection of data in the form of key-value pairs. Each piece of information, such as a text fragment or a metadata entry, is assigned a unique identifier. These identifiers, called keys, can be “text,” “metadata,” or “type.” The goal is to prepare the data so that it can be directly processed by the intended target system.

Originally, specialized conversion functions were available for different use cases:

  • A basic conversion like convert_to_csv for translating into tabular formats
  • A conversion for machine learning and NLP platforms like stage_for_transformers for preparing training data
  • A conversion for vector databases like stage_for_weaviate for semantic indexing

These functions are now deprecated. Further development focuses on the concept of destination connectors, e.g. for Kafka, Weaviate, or MongoDB. This enables automated transfer of data to external platforms after extraction.

ConnectorDescription
Astra DBA cloud-native NoSQL Database-as-a-Service (DBaaS) based on Apache Cassandra. It is optimized for high scalability and performance and includes vector search capabilities, making it useful for AI applications.
AzureRefers to Azure Blob Storage, a Microsoft object storage service for storing large amounts of unstructured data in the cloud.
Azure AI SearchA cloud-based search service from Microsoft that provides APIs and tools to integrate advanced search features (including vector and semantic search) into applications.
BoxA cloud-based content management and collaboration platform. As a destination, you can transfer data and documents into a secure Box environment.
ChromaAn open-source vector database designed for AI and LLM applications, optimized for storing and searching vector embeddings.
CouchbaseA distributed NoSQL database optimized for interactive applications. It combines a fast key-value store with a flexible JSON document model and SQL-like queries.
Databricks VolumesA feature within Databricks that enables access to, storage of, and management of non-tabular data (such as images, PDFs, text files) in cloud storage as if it were a local file system.
Delta Tables in Amazon S3Allows storing data in Delta Lake format directly on Amazon S3. Delta Lake is an open-source storage framework that offers ACID transactions, time travel (data versioning), and scalability for data lakes.
Delta Tables in DatabricksStores data in optimized Delta Lake format within the Databricks platform. This is the native and most performant method for using Delta Tables in Databricks.
DropboxA cloud storage service that allows users to store and share files online. As a destination connector, it writes files and data into a user’s or organization’s Dropbox folder structure.
DuckDBA columnar, in-process analytical database system (OLAP). It is extremely fast and designed to run directly within an application without a separate server.
ElasticsearchA highly scalable open-source search and analytics engine. It is widely used for full-text search, log analysis, security information, and business analytics.
Google Cloud StorageGoogle’s object storage service (similar to Amazon S3). It is used for storing and retrieving any amount of data in Google Cloud.
IBM watsonx.dataAn open data lakehouse service from IBM that enables managing and analyzing data from data warehouses and data lakes with a single query engine, optimized for AI workloads.
KafkaApache Kafka is a distributed open-source event streaming platform. As a destination, you can write data streams (events) into Kafka topics for real-time consumption by other applications.
KDB.AIA high-performance vector database developed for real-time AI applications such as similarity search, personalization, and retrieval-augmented generation (RAG).
LanceDBAn embeddable vector database for AI applications that runs serverless and is optimized for multimodal data (text, images, etc.) and fast, efficient vector search.
LocalRefers to storing data on the local filesystem of the machine where the process is running.
MilvusAn open-source vector database designed for managing and searching massive volumes of vector embeddings with high performance for similarity search.
MongoDBA leading document-oriented NoSQL database. It stores data in flexible, JSON-like documents, making it popular among developers for modern applications.
MotherDuckA serverless cloud analytics service built on DuckDB. It combines DuckDB’s local speed with the cloud’s scalability and sharing capabilities.
Neo4jA leading graph database. Instead of storing data in tables, it stores data as nodes and relationships, which is ideal for analyzing complex connections.
OneDriveMicrosoft’s cloud storage service. As a destination connector, it writes files and data directly to a user’s or organization’s OneDrive cloud.
OpenSearchAn AWS-forked open-source search and analytics framework derived from Elasticsearch. It is used for similar use cases such as log analysis and full-text search.
PineconeA managed, cloud-based vector database that makes it easy for developers to integrate high-performance vector search into AI applications without managing infrastructure.
PostgreSQLA powerful, object-relational open-source database system. It is known for its reliability, robustness, and extensive SQL-standard feature set.
QdrantAn open-source vector database and search engine designed for production environments, providing a simple API for storing and querying vectors.
RedisAn extremely fast in-memory database that functions as a key-value store. Commonly used as a cache, message broker, or for real-time applications.
S3Amazon Simple Storage Service (S3) is a highly scalable object storage service from AWS. It is a de-facto standard for cloud data storage.
SFTPSSH File Transfer Protocol is a secure file transfer protocol. As a destination connector, it allows secure uploads of data to a remote server.
SingleStoreA distributed, relational SQL database system known for its high-speed data ingestion, transactions, and queries, supporting both transactional and analytical workloads.
SnowflakeA cloud-based data platform offered as Data Warehouse-as-a-Service. It is known for its ability to scale storage and compute independently.
SQLiteA serverless, self-contained transactional SQL database engine that is embedded directly within an application. It is the world’s most widely used database system, especially in mobile apps and browsers.
VectaraAn end-to-end platform for developers to build GenAI applications focused on retrieval-augmented generation (RAG), minimizing hallucinations through precise hybrid search.
WeaviateAn open-source vector database that stores both data objects and their vector representations, enabling a combination of vector search with structured filters.

The selection of supported platforms is continuously being expanded. If a desired target environment is not included, it can be suggested in the community Slack.

Advanced Pipeline Steps (Optional)

[ Chunking ] > [ Embedding

Chunking

Unstructured uses metadata and document elements recognized by partitioning functions to transform elements into more useful “chunks.”

Chunking Strategy

Currently, Unstructured offers two chunking strategies—“basic” and “by_title.” The main components of the basic strategy are:

basic

The “basic” strategy is the simplest and one of the most commonly used methods in the unstructured library. Its main task is to combine sequential document elements to create chunks that are as large as possible without exceeding a set maximum character count:

  • max_characters: a hard limit that must not be exceeded
  • new_after_n_chars: a soft limit. When exceeded, a new chunk is started even if the hard limit has not been reached. A new chunk is preferentially started when it makes semantic or technical sense, but this threshold can be ignored as long as the hard limit is respected.
  • overlap: if a single element is too large and must be split, this parameter specifies how many characters overlap between the end of one chunk and the beginning of the next. This helps maintain context across chunk boundaries.
  • overlap_all: when set to True, overlap is applied not only to oversized, split elements but between all consecutive chunks.

Special case tables: Tables (Table elements) are always treated as standalone chunks and never combined with other elements. If a table itself is too large, it is also split.

With “basic,” we reach the limits of this strategy on highly structured documents such as reports, scientific papers, or manuals.

by_title

The by_title strategy inherits all basic strategy behaviors (such as respecting max_characters) but adds three crucial new rules:

  • The strategy identifies Title elements (headings) as the start of a new section. When the algorithm encounters a heading, it performs the following actions:

    • The current chunk is immediately closed.
    • A new chunk is started beginning with this Title element.

    This occurs even if the heading text would have fit perfectly into the previous chunk.

  • By default, the strategy does not treat page breaks as hard boundaries. This means a section can span multiple pages without being split into a new chunk. This behavior is controlled by the parameter multipage_sections=True (default).

  • Sometimes very short texts, such as individual list entries, are mistakenly identified as Title elements. This can lead to a flood of tiny, unwanted chunks. To address this, there is the combine_text_under_n_chars parameter:

    • This parameter allows multiple consecutive small sections to be combined into a single chunk to best fill the chunking window (max_characters).
    • By default, combine_text_under_n_chars is the same as max_characters. This ensures that small sections are efficiently grouped.
    • A value of 0 means every tiny, title-identified section is defined as a new chunk.

The by_title strategy is the better choice for documents with a clear hierarchical structure such as reports, research articles, manuals, or contracts.

Embedding

After extracting content from documents using the open-source library unstructured, a common downstream step is converting the extracted text elements into vector embeddings. These vectors form the basis for semantic search functions and retrieval-augmented generation (RAG).

The core unstructured library itself does not include native functions for calling embedding providers. Instead, there are two primary methods for generating embeddings for unstructured outputs.

Method 1: The Unstructured ecosystem provides extended functionality in the form of the Unstructured Ingest command-line interface (CLI) and the associated Python library. These components are designed for creating complete end-to-end processing pipelines and offer built-in support for connecting to embedding providers.

In this approach, embedding generation is seamlessly integrated into the data ingestion process (“Ingest Pipeline”). This enables full automation of the workflow from raw document to final vector embedding. Configuration details are available in the official documentation.

Method 2: An alternative method is manually enriching the JSON files produced by unstructured. This approach offers high flexibility, especially in choosing the embedding model, and is suitable for scenarios without a complete ingest pipeline.

The process typically follows a fixed schema:

  • Input: A JSON file generated by the unstructured library serves as the input source.
  • Reading: The content of the JSON file is loaded into memory as a structured object.
  • Embedding Generation: A third-party library, such as sentence-transformers, is used to generate an embedding for the value of the text field of each element in the JSON file. The sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face is a common example.
  • Data Enrichment: The generated embedding is added as a new field alongside the corresponding text field in the JSON object.
  • Saving: The modified JSON object with the added embeddings is written back to the original file or a new file.

This approach provides full control over the process and can be integrated into existing workflows based on exchanging JSON files.