Retrieval Augmented Generation, or RAG for short, combines the power of language models with a company’s specific knowledge. The approach makes it possible to incorporate internal documents and data into responses in a targeted way without losing control over one’s own information. As a result, RAG is increasingly seen as a key technology for deploying language models securely and with data sovereignty. In practice, however, it quickly becomes apparent that simple vector search in combination with an LLM is not sufficient to achieve truly consistent and high-quality results. To fully exploit the potential of RAG, additional methods and optimizations are necessary.

To improve the quality of a RAG application, it is not enough to address only a single point in the process. Optimizations can take place in various phases, starting with data preparation, through how documents are chunked and embedded, to the selection and evaluation of relevant information for answer generation. A clear distinction of these categories helps to deploy the appropriate measures in a targeted way and systematically address weaknesses. Building on this, different approaches can be used to specifically optimize RAG systems.

Categories for Optimizations in the RAG Process

Document Preparation
Before documents enter a RAG system, they must be reliably prepared. This includes extracting text from various formats such as PDF, Word, PowerPoint or email. Equally important is removing irrelevant content such as headers and footers, navigation elements or recurring boilerplate text. In this phase, character set normalization and language standardization also take place. Additionally, documents can be enriched with metadata such as author, date or document type. These preliminary steps ensure that subsequent stages work with clean and structured data, which forms the basis for high quality in the entire RAG process.

Chunking
After documents have been extracted and cleaned, they are divided into smaller sections known as chunks. The way chunking is performed has a decisive impact on how effectively the knowledge can later be used for embeddings and retrieval. Chunks can be formed based on character lengths, paragraphs or headings. In more advanced scenarios, chunking is done semantically so that sentences or paragraphs are not artificially split. A certain overlap of chunks can be useful to avoid losing transitions between sections. Tabular content or lists are also treated separately in this phase to preserve their structure.

Embedding
In the embedding phase, the prepared chunks are converted into vectors. These vectors represent the semantic meaning of the texts and make them searchable in a vector database. For general use cases, models trained on broad, non-specific datasets are employed. They capture language in many contexts but are not tailored to specific domains. If content from highly specialized domains is to be processed, it may make sense to fine-tune the embeddings on industry-specific data or to train custom models. Multilingual embedding models also play a role when the company’s content exists in different languages. Normalization and quality control of embeddings are important to create a consistent basis for retrieval.

Vector indexing and storage
After embedding, chunks are stored and indexed in a vector database. The choice of index and its parameters affects quality and latency of retrieval. Important decisions include HNSW or IVF, distance function, parameters such as M and efSearch, quantization like PQ or OPQ, sharding and replication, incremental reindexing as well as index versioning with rollback capability.

Retrieval
In the retrieval phase, embeddings are searched for a specific user query. The query itself is converted into a vector and compared with the stored chunks. Typically, this results in a candidate list known as the Top K hits. To improve the quality of these results, various strategies exist. These include hybrid methods that combine vector search with classic keyword search, as well as multi-query retrieval, where a query is reformulated in several variants. Metadata can also be incorporated into the weighting, so that, for example, recent documents or certain departments are preferentially considered.

Answer Generation
After the best chunks from the candidate list have been selected, they are embedded into the prompt of the language model. The quality of the answer depends heavily on how this information is integrated into the prompt. A clear structure, the specification of sources and the avoidance of redundancies are crucial. The design of the prompt itself can also improve quality, for example by instructing the model to answer precisely and factually.

Feedback and Continuous Improvement
A RAG system is not a static product but evolves through feedback and monitoring. Users can rate whether an answer was helpful. This feedback can be used to gradually optimize the pipeline. Automated evaluation also plays a role, for example through benchmarks or test questions with known answers. This ensures and improves the system’s quality in the long term.

Security and Access
Access rights must be enforced at the document and chunk level. This includes role- or attribute-based filters already during retrieval, masking of personal data before embedding, tenant separation in indexes and full audit logging of sources used and queries.

Orchestration and Performance
Operation requires reliable control and scaling. This includes caching of embeddings and search results, batching and asynchronous processing, backpressure and time limits, retries on errors, as well as auto scaling and capacity planning for embedding, indexing and search.

Conversation Context and Session State
Multi-turn dialogues require rules for how history and context are used and stored. This includes query rewriting with respect to the conversation history, clear boundaries between short-term and long-term context, session linkage across multiple steps and controlled forgetting of sensitive content.

Knowledge Management and Data Lifecycle
Corporate knowledge changes over time. A data catalog with data lineage, versioning of documents, validity and expiry dates, removal of outdated content as well as defined retention and deletion policies ensure currency, traceability and compliance.

Document Preparation

Document Classification

Document classification assigns incoming files to one or more meaningful types. The goal is to give each document a reliable category early on, for example contract, invoice, minutes, presentation, policy or support ticket. This classification controls the further path through the pipeline. It determines the appropriate extractors, suitable chunking rules, the right embedding model, necessary security levels and later filters in retrieval. Classification is clearly distinct from text extraction, which only extracts content from the format. Here, it is about the document type as a whole.

Functionality. First, features are formed. These include text excerpts, titles, file names, metadata, simple layout signals such as the presence of tables or typical section headings, as well as domain hints like customer number or contract number. A trained model evaluates these features and returns a probability for each class. Depending on requirements, a single class is chosen or multiple assignments are given, for example invoice and confidential. Uncertain cases are intercepted with thresholds and go into manual review or another run with additional rules.

Without document classification, subsequent steps work blindly. Invoices could be treated like presentations, tables would be lost as plain text and security rules would be applied too late. The result is inappropriate chunks, weaker embeddings, higher latency due to unnecessary processing and answers that rely on incorrect sources. In audit situations, there is no reproducible justification for why a document was processed in a particular way. Reliable document classification is therefore a core component for quality, security and reproducibility in the entire RAG process.

Text Extraction from Various Formats

Text extraction from various formats means reliably converting content from source files into a uniform, machine-readable representation. The focus is on recognizing the file type and applying a suitable parser that delivers only what the format itself provides. This includes running text, simple structural cues such as paragraph boundaries and the document’s native metadata. Extraction does not perform content cleaning or layout refinement. The result of this phase is a consistent text stream with basic structure and a reference to the original positions in the document so that later steps can cite precisely.

A clear example is an email in EML format with an attachment. Extraction reads the message text from the correct text part, takes subject and date as native metadata and stores attachments separately without rewriting or shortening the content. Content cleaning or removal of navigation elements does not take place in this phase.

If this measure is missing or poorly implemented, subsequent steps suffer. Empty or corrupted texts mean important content never enters the knowledge base. Incorrect character encoding produces garbled words, rendering later embeddings unusable. Mixed text parts from different document components cause retrieval to consider wrong passages as relevant. Without reliable extraction, the system as a whole becomes imprecise, hard to cite and difficult for readers to follow.

Removal of Boilerplate and Filler Text

Removal of boilerplate and filler text means selectively filtering out text parts that add no substantive value. This includes navigation, headers, footers, cookie notices, imprint sections, sidebars, automatic email signatures and always identical confidentiality notices. This phase is clearly distinct from text extraction, which only provides raw text and basic structure. Technically, cleaning is based on recognition across many pages and on structural features from HTML or PDF objects. Sensitive areas such as tables, code blocks and quotations are protected by rules and not removed. Without this measure, embeddings are based on boilerplate and filler text instead of core information.

OCR for Scanned Documents

OCR for scanned documents means processing every source where content or text exists only as pixels. This includes scanned PDFs, TIFF, JPEG, PNG, fax files, photos and screenshots as well as embedded images in Word or PowerPoint. As soon as a document contains selectable text, text extraction is used. In mixed documents, zones are processed separately. Image areas go through OCR, text areas are extracted.

The process starts with image preprocessing. Skew and distortion are corrected, contrast and sharpness are adjusted and noise is reduced. This is followed by layout analysis, which segments text zones, lines and words and establishes the reading order. Recognition converts the image segments into character sequences, guided by language settings and appropriate dictionaries. The result is text and metadata such as page number, coordinates and confidence score for each recognized unit.

Without this measure, image content remains invisible. Important passages do not enter the embeddings, retrieval cannot find technically relevant sections and answers appear incomplete.

Language and Encoding Detection

Language and encoding detection means reliably determining, before any further processing, in which language content is written and which character encoding the bytes use. The goal is to correctly decode the input text to Unicode and assign each document or section a reliable language tag. This phase is distinct from text extraction, which only reads formats.

Table and Form Detection

Table and form detection means identifying structured content in documents as such and converting it into a machine-readable form instead of treating it as plain text. The goal is to reliably recognize rows, columns, headers, merged cells, summary rows and units, and in forms to establish the relationship between field label and field value. This phase is clearly distinct from pure text extraction, because here it is not only about characters but about structure. Technically, the process begins with layout analysis using table boundaries, cell grids or lines as well as white space and alignment. From these cues, cells are generated with coordinates, order and data types. For forms, key-value pairs are created; checkboxes are recognized by their marking and selection fields by the chosen entry. Numbers, currencies and dates are additionally typed. Each detected cell or field retains references to page and position so that later responses can be precisely sourced.

Without this measure, essential relationships are lost. Tables become unstructured text, columns merge, numbers and currencies are misinterpreted in later steps so that queries about sums, quantities or specific items fail or yield contradictory answers.

Named Entity Recognition (NER) for Metadata

Named Entity Recognition for metadata identifies mentions of significant items in texts and stores them as structured features. These include persons, companies, products, projects, contract numbers, customer numbers, locations, amounts, currencies, time expressions and references to standards or laws. The goal is not to classify the entire document but to capture individual mentions per document and per chunk so that precise foundations for search, filtering and citation are created. Without NER, these relations remain hidden in plain text. Retrieval cannot filter by customer, project or period; reranking relies only on coarse text similarity; important documents slip down the list; and answer generation loses reliable evidence. Citations become imprecise and costs as well as latency increase because unnecessary chunks are examined. NER for metadata thus creates the basis for more precise hits, reproducible citations and reliable answers.

Duplicate Detection and Consistency Control

Duplicate detection and consistency control ensure that identical and very similar content enters the knowledge base only once and that versions and metadata are consistent. First, documents are transformed into a comparable form, for example by removing technical artifacts and normalizing whitespace. Then text fingerprints are generated using word or character shingles as well as hash methods like SimHash or MinHash. A similarity score determines whether it is an exact duplicate or a near duplicate. For detected duplicates, the system selects a canonical version based on clear rules, such as highest release level, newest date or most complete metadata. All other versions are discarded or stored as references to the canonical source. At chunk level, the procedure can be applied again so that redundant sections are not embedded multiple times. Consistency control additionally checks mandatory fields and value ranges per document type, detects conflicting numbers, dates or totals, and embeds version and release information as metadata so that retrieval and citation remain unambiguous later.

Without this measure, the index grows with redundant content, storage requirements and embedding costs increase, retrieval shows duplicate or outdated results, reranking is distorted by many almost identical chunks and answers can cite contradictory information. In audits, there is no traceable provenance; users lose trust; and corrections propagate only partially because old versions remain in circulation.

Validation and Data Quality Checks

Validation and data quality checks ensure that only usable and complete content enters the pipeline. The structure and content of a document as well as its metadata are checked. This includes minimum text lengths, correct character encoding, detectable language, consistent page order, successful table and form detection, valid date and number formats, mandatory fields per document type and sufficiently high quality for OCR results. The outcomes of these checks are stored as status and metrics. Documents that violate rules are moved to quarantine or reprocessed. Only verified content passes the intake gate and is released for chunking and embedding.

Without this measure, empty or corrupted texts enter the index. Embeddings then represent noise instead of technical content, retrieval yields inappropriate hits, citations point to wrong locations and answers lose precision. In addition, costs and latency increase because faulty documents are embedded and searched, and traceability suffers because defects are discovered only late.

Layout Analysis and Reading Order for PDFs and Scans

Layout analysis and reading order for PDFs and scans describes identifying a document’s visual structure and reconstructing the correct text sequence for later machine reading. The goal is to reliably detect related blocks such as headings, paragraphs, columns, footnotes, marginalia and graphical elements and to linearize them in an order that matches human reading. This phase is separate from OCR. OCR converts pixels to characters; layout analysis assigns those characters to meaningful blocks and determines their reading order. It is also distinct from table and form detection, which additionally types specific structures.

Processing begins by segmenting the page into text and non-text. Lines, white spaces, alignment, font sizes and spacing provide clues to columns and sections. Within each block, lines and words are detected and tagged with coordinates. Rules for column breaks, heading hierarchies and reading paths ensure correct linearization. References to page and position remain intact so that later steps can cite precisely. In multipage documents, analysis ensures that recurring elements such as headers and footers are not mixed into the main text and that continuations of sections are correctly linked. The result is clean text with basic structure and a set of anchors that describe the original visual layout.

Without this measure, typical errors occur. Columns are mixed, captions land mid-sentence, footnotes interrupt the flow, line breaks cause incorrect hyphenation and related paragraphs are split. Embeddings then represent a jumbled signal, retrieval finds irrelevant hits and answer generation cannot provide clean citations because source anchors point to incorrect positions. Latency and costs rise as more rework is needed and corrections appear only late. Reliable layout analysis with correct reading order is therefore a prerequisite for precise chunking, reproducible citation and consistent answers.

Entity Resolution on Company IDs for Customers, Projects, Products

Entity resolution on company IDs for customers, projects and products means linking text mentions unambiguously to master data. The goal is that “Müller Maschinenbau” in text appears not only as a string but as a reliable reference to the customer ID from the CRM, with known legal form, location, responsibilities and validity period. This phase differs from named entity recognition. NER marks text spans as organization or product. Entity resolution assigns that marking to a concrete company ID and records the decision with confidence score, timestamp and provenance.

Processing starts with normalizing names, variants and characters. Then candidates from master data tables are retrieved using name, email domain, tax number, article number or project key. A matcher scores candidates with rules and statistical similarities. Unclear cases are resolved by name similarity, address, language, department and co-mentions. The result is a unique assignment or labeled uncertainty. The assignment is stored as metadata per document and per chunk, including the version of the master data so that later search filters, permissions and citations function stably.

Without this measure, duplicates and confusions arise. Retrieval cannot reliably filter by customer or project, identical companies with slight name variations are treated as different entries, customer-specific permissions fail and answers cite incorrect transactions. In audits, there is no clear provenance, corrections propagate incompletely and usage and quality metrics are distorted because events are not linked to the correct company ID. Entity resolution is therefore the foundation for precise filtering, clean tenant separation, reproducible citations and reliable reporting.

Taxonomy and Ontology Alignment with Company Vocabulary

Taxonomy and ontology alignment with company vocabulary means linking terms used in documents to an organization-wide agreed term system. A taxonomy classifies technical terms into classes and subclasses; an ontology additionally describes relationships such as belongs to, causes, applies to or replaces. The goal is that the same or related concepts are findable under a canonical term, regardless of spelling, abbreviation, language or domain jargon. This step differs from named entity recognition, which only marks mentions, and from entity resolution, which maps names to concrete master data records like customer numbers. Here, it is about unifying the level of meaning so that search, filtering, permissions and reporting run on a common vocabulary.

In practice, it works as follows. First, business units create a controlled glossary. Each term has a preferred spelling, allowed synonyms, common abbreviations, available translations and explicitly disallowed variants. Each entry receives a stable concept ID. During document processing, found terms are aligned with this glossary. Per chunk, the concept ID, preferred term, original spelling and text position are stored as metadata. An ontology augment the system with relationships between concepts such as belongs to, part of, replaces or is version of. These metadata are utilized later in the pipeline. Retrieval can filter by concepts and automatically include synonyms. Reranking can favor hits with exact concept matches. Answer generation uses the current designation and can explain the relationship. It is important to note the separation. Content cleaning, chunking and embedding remain individual steps. They run after alignment and benefit from the well-stored concept metadata.

Without this alignment, systematic gaps occur. Search misses content because synonyms, abbreviations and translations are not unified. Reranking sorts by raw text similarity rather than domain equivalence. Answers use inconsistent designations and appear contradictory. Topic filters work unreliably. Metrics on frequencies and trends are not comparable because identical concepts appear under different names. Compliance and domain rules cannot be enforced securely because terms are interpreted inconsistently. Alignment with company vocabulary creates a shared level of meaning, increases recall and precision and makes results reproducible.

Page and Section Anchors for Precise Citations

Page and section anchors for precise citations ensure that every assertion in the system can be traced back unambiguously to a location in the original document. This means a stable link composed of document identifier, page or slide number, section label and optionally paragraph, line or character position. These anchors are generated immediately during extraction or OCR and stored unchanged as metadata on text and chunks. They differ from layout analysis, which only recognizes structure, and from named entity recognition, which marks terms. Anchors provide the address under which a source can later be cited exactly.

Technically, a unique identifier is generated per page and section and linked to positions. In PDFs this includes page, section title and coordinate ranges. In presentations it includes slide number and placeholder label. On websites it includes URL and element identifier. These identifiers remain stable even when content is embedded, chunked or stored in a vector database. Answer generation and evaluation can then cite every statement with document, page or slide and section. Updates can also be handled securely because old and new anchors can be compared and migrated if necessary.

Without this measure, there is no unique source anchor and the exact location in the original document cannot be verified. Readers and reviewers cannot reliably look up the passage, audit and compliance evidence is weakened and confusions between versions or identical sections are more likely. During updates, references break because no stable anchors exist for migration. In sum, traceability and reproducibility decline, even though the quoted wording remains unchanged.

Chunking

Structure-Oriented Chunking

Structure-oriented chunking relies on reliable structure markers from extraction and layout analysis. A chunk corresponds to an existing unit such as a heading block, section, paragraph, list, table or code block. Very short paragraphs of the same level can be merged, very long paragraphs may be split at sentence boundaries within this unit. Cuts across units do not occur. Fixed character limits serve at most as a technical upper bound and do not determine the cut. If such structure markers are missing or unreliable, this method is not used. In that case, other procedures are employed, for example semantic chunking.

Without structure-oriented chunking, mixed chunks occur in which a heading, half a paragraph and unrelated elements collide. Definitions are torn apart, references are lost and retrieval delivers excerpts without a complete statement. Answer generation needs more context, consumes more tokens and can cite sources less accurately. With clean structure orientation, statement units remain intact, citations are precise and subsequent steps work on a stable foundation.

Semantic Chunking with NLP

Semantic chunking with NLP divides text along semantic units rather than external formats. The cut follows the theme and communicative intent, even if a document formally has no clear sections or if existing sections are too broad in content. This complements structure-oriented chunking. Where reliable structure is missing, semantic segmentation delivers more precise sense units.

Implementation begins with sentence segmentation. NLP methods then analyze text flow and look for thematic breaks. Indicators of a shift include signal words, transitions, contrasts or a clear change in content. Chunks are formed at these points so that each unit remains semantically closed. Very short segments can be merged with neighboring segments of the same topic, while very long sections are split at natural sentence boundaries.

Without semantic chunking, mixed sections arise containing too much irrelevant content. Retrieval yields imprecise hits and answer generation must process unnecessary context. As a result, accuracy and efficiency decline, even though the information is present. Semantic chunking creates clearly defined sense units and significantly improves result quality.

Adaptive Chunk Size Based on Context

Adaptive chunk size based on context means that chunks are formed not rigidly by a fixed length but flexibly according to content and technical constraints. While semantic chunking with NLP determines where a cut should be made, adaptive chunk size focuses on how large a chunk should be. The goal is for each chunk to remain a coherent unit but neither too small nor too large.

Fixed chunk sizes have the advantage of simplicity and predictability. They ensure that the embedding model’s maximum input length is not exceeded and produce uniform index sizes in the vector database. However, they often cut texts in the middle of paragraphs or tables. This creates breaks without semantic meaning, making retrieval imprecise and burdening answer generation with unnecessary ballast.

Adaptive chunk size solves this by considering document properties such as paragraph length, text density, table structures and sentence boundaries. Short paragraphs or list elements are combined until a reasonable length is reached. Very long paragraphs are split at logical points before the context window of a language model is exceeded. Token limits still serve as a technical upper bound but no longer solely determine the cut point.

Without adaptive chunk size, fragments lacking coherence or oversized chunks arise, which are processed inefficiently and reduce accuracy. With an adaptive approach, units remain intact, retrieval quality increases and answer generation can work with fewer but more relevant context chunks.

Sliding Window Chunking

Sliding window chunking describes a method in which chunks overlap to avoid losing transitions between sections. Instead of strictly dividing text into consecutive blocks, a fixed-length window is slid over the text, and the next chunk begins not at the end of the previous one but somewhat earlier. This way, chunks share passages that serve as a buffer and ensure that coherent information is not split by a cut.

Unlike structure-oriented or semantic chunking, sliding window chunking is based not primarily on paragraphs or topic shifts but on a technical strategy for redundancy assurance. It is often used in addition when text is processed linearly and no clear structure is available. The advantage is that even at unfortunate cut points, a piece of context occurs in both chunks and thus remains semantically complete in at least one.

Without sliding window chunking, such transitions are lost. Embeddings then represent only fragments that appear isolated out of context. Retrieval yields hits that fit formally but lack crucial connections. Answer generation works with half information and must guess or supplement, leading to inaccuracy. Sliding window chunking prevents these breaks by deliberately retaining context redundantly.

Hierarchical Chunking

Hierarchical chunking means that a document is decomposed into chunks on multiple granularities simultaneously. Instead of working only with paragraphs or sections, chunks are created at different levels of the document structure, which remain linked. For example, there may be chunks at chapter level, section level and paragraph level. All units contain source anchors and hierarchy references so that later processing can flexibly decide which granularity to use in retrieval.

This approach differs from structure-oriented chunking, which only adopts the obvious document structure, and from semantic chunking, which searches for thematic breaks. Hierarchical chunking combines both perspectives and creates a multi-layered framework. This allows queries to be answered both broadly and in detail.

Example: An internal project report is divided into chapters, subchapters and numbered paragraphs. In hierarchical chunking, each paragraph is stored as its own chunk, but each subsection and each chapter is also stored as a chunk. All levels are linked by metadata. If a query requires a high-level overview, larger units can be considered. For detailed questions, the system accesses the fine-grained paragraphs.

Without hierarchical chunking, one must choose a fixed granularity. If large chunks are chosen, answers remain vague and citations are imprecise. If small chunks are chosen, overview and context are lost. With a hierarchical approach, both perspectives are available and the pipeline can choose the appropriate level per query. This increases precision, flexibility and traceability.

Document-Specific Pipelines

Document-specific pipelines mean that not all documents are processed with the same standard logic for chunking, but that each document type has its own rules and procedures or pipelines. The idea is that a contract, a manual, an email or a table have completely different structures and therefore need to be segmented differently. A pipeline can define for each type how extraction, chunking, enrichment and metadata assignment proceed.

Unlike generic procedures that use the same routine for all content, document-specific pipelines ensure that the peculiarities of a format are respected. For example, contracts might be cut by paragraphs and clauses, manuals by chapters and subchapters, presentations by slides and bullet points, or tables by rows and columns. All pipelines produce chunks that can be embedded and stored uniformly, but their segmentation is optimally tailored to the respective document class.

Without document-specific pipelines, chunks arise that do not reflect important domain structures. Paragraphs are split, table rows lose their column associations, slides are broken into unusable text blocks. Retrieval then finds only keywords rather than relevant domain units. Answer generation loses precision and traceability. With tailor-made pipelines for each document type, structure is preserved and the RAG system can work both legally precise and technically robust.

Entity-Focused Chunking

Entity-focused chunking means that a document is segmented not only by formal structures or semantic sections but specifically at points where important entities occur. Entities include persons, companies, products, projects, contract numbers, locations, amounts or time expressions. The idea is that each entity is fully captured in a chunk and its context preserved. This enables targeted search queries and filtering directly on these entities later.

The difference from other methods is that here the entity itself determines the cut point. While structure-oriented chunking follows paragraphs and semantic chunking follows topic shifts, entity-focused chunking ensures that all text passages related to an entity are kept together or at least clearly marked.

Without entity-focused chunking, information about an entity is scattered across multiple unconnected chunks. Retrieval then finds individual fragments but not the overall context. An answer about project ORN 2025 may contain only half the information or mix data from different projects. Entity-focused chunking creates a clear collection per entity that can be searched, filtered and correctly embedded in responses.

Table- and List-Aware Chunking

Table- and list-aware chunking ensures that structured content such as tables or lists is not dissolved into linear text but remains as independent chunks. While normal procedures divide text into paragraphs or semantic units, this method explicitly recognizes tabular structures, numbering and bullet points and forms logical segments from them. The goal is that columns, rows or list items retain their relationships so that retrieval later clearly understands which information belongs together.

A key difference from purely text-oriented methods is that here the internal structure is preserved. A table is not treated as a long text block but either stored as a complete table chunk or broken down into individual row or column chunks, depending on the use case. Lists can be stored both as a whole chunk and with each list item as its own chunk with a reference to the parent list.

Without table- and list-aware chunking, structure is lost in linear text. Numbers stand side by side without context, bullets get lost in the flow and retrieval only recognizes relationships via words, not tabular logic. Answer generation then cannot provide precise citations but only vague quotes from long text blocks. With table- and list-aware chunking, these structures remain intact and make answers significantly more precise and traceable

Domain-Driven Rules

Domain-driven rules for chunking mean that document segmentation is based not only on general structures or semantic units but also on specific guidelines from the relevant domain. These rules reflect the organization’s workflows and terminology and ensure that domain-relevant units are represented as separate chunks.

The difference from generic methods is that here not only technical criteria such as character length or paragraph boundaries apply but explicit domain-specific requirements. In a legal context, this might mean that paragraphs and clauses are never split. In medicine, each diagnosis or report block could be treated as its own chunk. In software development, code functions or configuration blocks could be treated as indivisible units.

Without domain-driven rules, arbitrary cuts occur within important domain structures. A clause is torn apart, a diagnosis is spread across multiple fragments or a code function appears in disjointed parts. Retrieval then finds only fragments and answer generation loses domain coherence and traceability. Domain-driven rules ensure that segmentation follows domain requirements and preserves domain consistency.

Token Budget, Driven Chunking Depending on the Target Model

Token budget, driven chunking depending on the target model means that chunk size is chosen to optimally fit the language model’s maximum context length. Language models can process only a certain number of tokens at once. With small context windows, chunks must be very compact in order to fit alongside the user question and planned answer. With large context windows, chunks can be larger because multiple of them can be considered simultaneously without exceeding technical limits. The aim is not to build a hierarchy of large and small chunks but to adapt the size of each chunk to the available token budget.

A manual with long chapters would be divided into smaller segments of a few hundred tokens for a small context window. With a large context window, segments can remain longer so that each chunk represents a complete concept such as a configuration step. Nevertheless, granularity remains fine enough for retrieval to select relevant segments without having to handle an unwieldy ten-page block as one chunk.

Without this adjustment, chunks would either be cut off or processed inefficiently. With model-aware segmentation, the balance between precision, efficiency and model capacity utilization is maintained.

Multi-Core Storage of the Same Location Both as Section and as Paragraph with Linkage

Multi-core storage of the same location both as section and as paragraph with linkage means that a document passage is stored in multiple granularities. A longer section is stored not only as one large chunk but also broken down into smaller chunks such as paragraphs. Both variants are indexed and linked so that retrieval can select the appropriate level of detail depending on the query.

The difference from hierarchical chunking is that here not the entire document is built up in many levels but specific passages are stored multiple times where it is known that a section can be relevant both as a whole and in its details.

An example is a contract with a liability clause spanning several paragraphs. This clause is stored as a large chunk so it can be retrieved in full for a general search on “liability.” At the same time, the individual paragraphs are stored as their own chunks so that a targeted question about “liability limit to amount X” returns only the relevant paragraph. Metadata maintain that both variants have the same source.

Without multi-core storage, one must choose a granularity. Only large chunks produce answers that are vague and overloaded, giving the model too much context. Only small chunks lose overall context and weaken traceability. With multi-core storage, both options are available; retrieval and answer generation can choose as needed and results remain both precise and complete.

Strict Block Boundaries for Code, Tables and Formulas

Strict block boundaries for code, tables and formulas means that these special structures are never split during chunking but treated as closed units. The reason is that such blocks only make sense in their entirety. A half code block is no longer executable, a split table loses its column associations and a formula missing parts is worthless.

This approach differs from methods that split text freely by character length or paragraph boundaries. While splitting prose mid-sentence may cause less damage, cutting structured blocks almost always leads to loss of information. Therefore, hard boundaries are defined for code, tables and formulas that chunking respects.

Without this measure, unusable fragments arise. A formula might contain only the left side of an equation, a table only half its columns or a code block only its header without body. Retrieval then yields incomplete hits and answer generation must guess what is missing. Strict block boundaries ensure that complex structures remain intact and enter the RAG process correctly as units of knowledge.

Conclusion

Optimizing a RAG application begins with document preparation and chunking. Both phases lay the groundwork for all subsequent steps such as embedding, indexing, retrieval and answer generation. Only when texts are reliably extracted, cleaned, semantically structured and sensibly segmented can later methods work precisely. Errors or omissions in these early phases affect the entire process. The consequences range from inaccurate embeddings to inefficient retrieval to weak or contradictory answers.

This article is Part 1 of a multi-part series. In the next parts, further methods covering embedding, vector indexing and storage, retrieval, answer generation as well as feedback and continuous improvement will be described. The goal is to show step by step how a RAG system can be made more robust, precise and traceable through targeted measures.

Continue with Part 2.

Categories for Optimizations in the RAG Process#

Document Preparation#

Document Classification#

Text Extraction from Various Formats#

Removal of Boilerplate and Filler Text#

OCR for Scanned Documents#

Language and Encoding Detection#

Table and Form Detection#

Named Entity Recognition (NER) for Metadata#

Duplicate Detection and Consistency Control#

Validation and Data Quality Checks#

Layout Analysis and Reading Order for PDFs and Scans#

Entity Resolution on Company IDs for Customers, Projects, Products#

Taxonomy and Ontology Alignment with Company Vocabulary#

Page and Section Anchors for Precise Citations#

Chunking#

Structure-Oriented Chunking#

Semantic Chunking with NLP#

Adaptive Chunk Size Based on Context#

Sliding Window Chunking#

Hierarchical Chunking#

Document-Specific Pipelines#

Entity-Focused Chunking#

Table- and List-Aware Chunking#

Domain-Driven Rules#

Token Budget, Driven Chunking Depending on the Target Model#

Multi-Core Storage of the Same Location Both as Section and as Paragraph with Linkage#

Strict Block Boundaries for Code, Tables and Formulas#

Conclusion#