In Part 1 we saw how crucial clean document preparation and thoughtful chunking are to the quality of Retrieval Augmented Generation. These basics form the starting point for a whole range of further optimizations that shape the entire process. In Part 2 we continue the series and focus on the next building blocks that build on this foundation and further develop the use of RAG in the enterprise.

Embedding

Domain-specific Embeddings

Domain-specific embeddings mean that vector representations of texts are not generated with generally trained embedding models, but with models adapted to the technical language and content of a specific industry or company. General models are trained on very large, unspecific text corpora, including books, websites, Wikipedia and other sources. They understand everyday language and many standard concepts, but often miss the nuances in, for example, legal contracts, technical manuals or medical reports. Domain-specific embeddings are created either by fine-tuning an existing model with data from the respective domain or by training a custom model on a corpus of internal documents, guidelines, protocols and manuals.

An example is the medical field. When a general model processes the word “bleeding,” it only recognizes the general meaning. In medical documents, however, it is crucial whether it is internal bleeding or postoperative bleeding. An embedding model fine-tuned on medical texts precisely captures these distinctions in the vector space.

Without domain-specific embeddings, technical terms remain vague, abbreviations are misunderstood and important relationships are not captured. The retrieval may then find formally similar passages but often misses the truly relevant sections. The answer generation loses precision because the semantic space does not properly represent the specialized language. With domain-specific embeddings, however, the system becomes more robust, the hit quality increases, and answers can address the actual context of the company with higher accuracy.

Multilingual Embeddings

Multilingual embeddings mean that texts in different languages are mapped into a common vector space. Such a model understands that a sentence in German, English or French with the same content has the same semantic meaning, even if the words and grammar are completely different. The advantage is that users can search regardless of their language and the system still finds the relevant documents. At the same time, documents in multiple languages can be processed without maintaining a separate model for each language.

An example is an internationally active company that stores manuals in English, protocols in German and contracts in French. If an employee searches in German for “Lieferverzug” (delivery delay), the system should also correctly recognize and return the English “delivery delay” and French “retard de livraison.” Multilingual embeddings ensure that all three variants lie close together in the vector space and are thus treated as semantically equivalent.

Without this measure, language silos emerge. A search covers only documents in the same language, even though other languages contain the same content. The retrieval misses relevant hits, answer generation appears incomplete, and users lose trust in the system’s comprehensiveness. With multilingual embeddings, knowledge is linked across languages and accessible to all users, regardless of the language in which the query is made or the document is stored.

Dimensionality Reduction

Dimensionality reduction means reducing the number of dimensions in embeddings without losing the semantic core of the information. Embedding models typically generate several hundred to a thousand dimensions for each text vector. While this high dimensionality improves expressiveness, it makes search and storage resource-intensive. Reduction decreases the vector size so they can be compared faster in vector databases and stored with less memory.

An example is an embedding model with 1024 dimensions that covers all company documents. If these vectors are reduced to 256 dimensions before storage, the memory requirement drops to a quarter and similarity search runs significantly faster. Techniques such as Principal Component Analysis (PCA) ensure that the most important information axes are preserved while noise or less relevant dimensions are removed.

Without this measure, latency and costs increase with the index size. Queries take longer because each comparison requires more compute operations, and the database scales poorly as many unnecessary dimensions are stored. With dimensionality reduction, embeddings remain manageable, retrieval and ranking become more efficient, and the results stay precise.

Embedding Normalization

Embedding normalization means bringing embedding vectors to a uniform length after computation, before storing them in the vector database or using them for retrieval. Language models generate vectors with numerical values across many dimensions. These vectors can have different lengths even if their meaning is similar. Length is a technical by-product of computation.

The benefit is seen in the similarity measure used. With cosine similarity, only the angle between vectors matters, not their length. To ensure this measure works correctly, vectors are normalized to unit length. Even with distance measures like Euclidean distance, normalization prevents semantically similar texts from appearing far apart just because their embeddings are scaled differently.

Without normalization, retrieval results can be distorted. Contents that are semantically very similar appear further apart than they actually are. By normalizing to unit vectors, the semantic proximity is preserved and comparisons become stable and reliable.

Hybrid Embeddings (Text + Metadata)

Hybrid embeddings with text and metadata mean that not only the pure document text is converted into a vector but also descriptive information is incorporated. Metadata includes, for example, document type, creation date, author, department, project assignment, language or security level. These details are combined with the text and embedded as a joint vector. The goal is to reflect not only the content meaning but also the organizational attributes in the semantic space.

Technically, this is done by merging text and metadata into a single input.

“[Document type: guideline] [Effective from: 2024] The actual text of the guideline…”

The metadata part is included as a prefix or structured supplementary information in the input text before the embedding model computes the vector. Alternatively, separate vectors can be created for text and metadata and then fused, for example by concatenation or weighted averaging.

Without hybrid embeddings, metadata remains separate from the main vector space and can only be used as additional filters in the database. This means each query must run on two tracks: semantic similarity of the texts and conditions such as document type or department. With hybrid embeddings, text content and metadata are merged into a single representation. This enables retrieval to work more directly and precisely, and results not only match semantically but also organizationally.

Special Embeddings for Tables and Code

Special embeddings for tables and code mean that content of these types is not handled by the same models as regular prose but by specially trained models that capture their structure. Tables consist of cells arranged in rows and columns, and their meaning arises from the relationship between headers, values and units. A general language model interprets a table as a sequence of words and numbers. A table-specific embedding, however, recognizes that a number in a certain column belongs to a header and can thus be understood as a metric. This allows queries like revenue per quarter or average duration to be answered correctly because the vector representation captures these relations.

The same principle applies to source code. Regular language models process code as ordinary text, seeing variables, brackets and keywords only as character strings. Code-specific embeddings, on the other hand, take into account syntax, the hierarchy of function calls and dependencies between variables and modules. This makes it possible to search specifically for function definitions, parameters or used libraries and generate answers reflecting the logical structure of code.

An important aspect is that embeddings from different models are not directly comparable. Each vector space has its own geometry, so cosine similarity only works reliably within one model. To overcome this challenge, there are several approaches. Either a unified model is used that processes text, tables and code, or contents are separated into different indexes and results are merged after retrieval. Alternatively, embedding spaces can be mapped onto each other via additional mapping models or adapters so they become comparable in a common space. Without one of these measures, vectors from different models would be incompatible and retrieval would yield random or irrelevant hits.

Without special embeddings for tables and code, underlying structures are lost. Retrieval then returns vague hits because numbers or function names are processed without their contextual relationships. Answers appear imprecise, refer to wrong values or provide incomplete code snippets. With specialized embeddings, structure and logic are preserved, and with consistent handling of different vector spaces, these contents can be reliably searched and used for robust answers.

Evaluation and Fine-tuning of Embeddings

Evaluation and fine-tuning of embeddings means regularly checking the quality of the generated vectors and adapting them as needed to the requirements of your domain. Embeddings are the foundation of retrieval; they determine which chunks are considered similar and which are not. If vectors do not reliably reflect semantic relations in the enterprise context, false hits and imprecise answers result.

Evaluation is done using test datasets with known pairs of queries and relevant documents. Typical metrics include recall at K or MRR. This measures whether relevant content actually appears among the top hits. Additionally, qualitative tests are conducted with domain experts who assess if the results meet their information needs.

Fine-tuning follows evaluation. An existing embedding model is retrained with additional examples from the proprietary domain. This way it learns to include terms, abbreviations or relationships that do not occur in general training data. In medicine, for example, these might be specific diagnoses; in finance, technical terms from contracts and balance sheets.

Without evaluation and fine-tuning, it remains unclear whether embeddings truly hold up in the specific domain. A model that performs well on general benchmarks can still deliver poor results in an enterprise because industry-specific concepts are not captured. Retrieval then shows irrelevant hits, important documents sink down and answer generation relies on unsuitable foundations. With regular evaluation and targeted fine-tuning, hit accuracy can be systematically improved and the reliability of the entire RAG system secured.

Document Pooling Strategy

Document pooling strategy means clearly documenting how the representation of a text is derived from the token embeddings of the model. Common methods include using the CLS token or mean pooling over all tokens. The choice of method directly affects the resulting vectors, as it determines which level of information the embedding emphasizes.

If these differences are not documented, inconsistencies arise once different teams or pipelines use different strategies. In large enterprises with multiple departments, various teams often develop their own RAG pipelines, sometimes with different frameworks or model versions. If one team uses CLS pooling and another Mean pooling, each produces valid embeddings but they are no longer directly comparable. This leads to benchmarks losing their significance, shared indexes containing inconsistent data and retrieval results varying by source.

When upgrading a model version, the default pooling setting may also change. Without documentation, this often goes unnoticed, yet vectors subtly change and cause breaks in representation.

With a clearly documented and mandated pooling strategy, embeddings remain consistent, traceable and reproducible. Benchmarks are reliable, comparisons between teams are possible and retrieval yields stable results regardless of which pipeline embedded a document.

Out-of-Distribution Detection for Inputs Outside the Training Domain

Out-of-distribution detection for inputs outside the training domain means that the system checks whether a text or chunk actually falls within the scope the embedding model covered during training. Embedding models are always trained on specific datasets, for example news, Wikipedia articles or technical documentation. Content that deviates significantly, such as legal clauses, medical diagnoses or internal abbreviations, may be processed by the model, but the resulting vectors do not necessarily capture the semantic meaning correctly.

Technically, detection occurs via methods such as distance measures in the vector space, uncertainty metrics or additional classifiers specifically trained for OOD detection. If a chunk is very far from all known training vectors, it suggests the model cannot provide a reliable representation.

An example is an embedding model trained on everyday English suddenly processing chemical structural formulas or legal paragraphs. The vectors for these contents end up in areas of the vector space that never occurred during training. Retrieval may then select wrong neighbors because distances no longer reflect true semantic proximity.

There are three established approaches for OOD detection in embeddings. Distance-based methods check how far a new vector is from known training or reference vectors. If the distance to the nearest neighbor or several neighbors is significantly beyond the usual range, the input is flagged. Density- or probability-based methods model the distribution of training vectors and assess whether a new input likely comes from the same distribution. Methods like Gaussian mixture models, kernel density estimation or likelihood scores are used. A third option is classifiers explicitly trained for OOD detection. They learn to distinguish in-distribution from out-of-distribution by training with negative examples from foreign domains and measuring uncertainty via techniques like softmax entropy or Monte Carlo dropout. These three approaches can be applied individually or combined to reliably detect problematic inputs.

Without out-of-distribution detection, such problematic embeddings are added to the index. This leads to imprecise retrieval, inappropriate answers and, in the worst case, misinformation. With consistent OOD control, the system can label these chunks, post-process them with specialized models or exclude them from embedding. This keeps the vector database quality high and the RAG system delivers more reliable results.

Language and Format Normalization of Numbers, Dates, Units

Language and format normalization of numbers, dates and units means converting these expressions into a consistent and comparable form before embedding. The background is that identical content can be notated very differently, for instance 1.000 € vs. EUR 1000, 12.03.2025 vs. 2025-03-12 or “5 kg” vs. “5 kilograms”. For a language model, these notations are initially different character strings, even if they mean the same. Normalization puts them into a uniform format so embeddings can correctly reflect semantic equivalence.

Technically, this is done via parsers and transformation rules that recognize number formats, currencies, measurement units and dates and convert them into standardized representations. For example, the ISO format for dates or the SI system for units can be chosen as targets. Normalization is clearly distinct from named entity recognition because it is not about identifying entities but about unifying the representation of values already recognized.

Without this measure, erroneous or skewed embeddings arise. Retrieval treats semantically identical values as different, queries like “all invoices from March 2025” miss hits if dates are not standardized. Answer generation may produce conflicting citations because one document writes “3/12/2025” and another “12.03.2025”. With consistent normalization, meanings remain clearly recognizable and results precise and reliable.

Vector Indexing and Storage

After embedding, chunks are stored and indexed in a vector database. The choice of index and its parameters affects retrieval quality and latency.

An index is a data structure that accelerates access to data, similar to a table of contents in a book. Indexing means that new vectors are sorted on insertion so they can be found faster later.

In a vector database this happens as follows: Each chunk of a document is converted into a vector. Instead of storing these vectors as a simple list, they are organized in an index. An example is the IVF index (Inverted File Index). It forms clusters, i.e. groups of similar vectors. When a search query arrives, the system scans only the relevant clusters, not all data. Another example is HNSW (Hierarchical Navigable Small World Graph). Here, vectors are connected as nodes in a network. The search navigates this network like a road map, “jumping” from one vector to the next until the most similar ones are found.

Without indexing, the system would have to compare every stored vector with each search query. With thousands of entries this is still feasible, but with millions or billions it becomes practically impossible. A RAG system would then take minutes instead of milliseconds to deliver an answer. The index makes the difference between a theoretically correct but unusable system and a fast, production-ready solution.

Choosing the Right Index Type for the Search Profile

Choosing the right index type for the search profile means selecting the vector index method that matches the system’s requirements for accuracy, speed and data volume. Different index types have different strengths. A Hierarchical Navigable Small World (HNSW) index, for example, offers very fast and precise queries but requires a lot of memory. An IVF index (Inverted File Index) is more memory-efficient and suited to very large datasets but uses approximate methods that can miss hits if misconfigured.

The search profile describes the RAG system’s retrieval requirements. A system that frequently searches small datasets with high precision benefits more from HNSW. A system that must index billions of vectors needs scalable methods like IVF or hybrid variants with quantization. Factors such as query frequency, response latency and available hardware also play a role.

Without choosing a suitable index type, systematic issues arise. An overly complex index slows down queries and causes unnecessary memory and operational costs. An overly simplistic index misses relevant hits, causing answer generation to rely on incorrect or incomplete information. Therefore, selecting the index type is a cornerstone for balancing efficiency, accuracy and cost in the RAG system.

Systematically Tuning HNSW Parameters

Systematically tuning HNSW parameters means adjusting the knobs of the HNSW index so as to strike a good balance between search speed, accuracy and memory usage. HNSW stands for Hierarchical Navigable Small World Graph and is one of the most widely used algorithms for approximate nearest neighbor search in vector spaces. It constructs a multi-layered network of vectors where each vector is connected to its neighbors. The search navigates this network: starting loosely and then progressively moving to closer neighbors until the best hits are found.

Important parameters include M, which controls the maximum number of connections per node, and efSearch, which determines the number of candidates considered during a query. A higher M means more connections and thus better accuracy but increases memory use and index construction time. A higher efSearch increases the chances of finding the best neighbors but makes the search slower. efConstruction is another parameter influencing how meticulously neighbors are chosen during index building. A higher value improves index quality but lengthens the build time.

Calibrating IVF Parameters

An Inverted File Index, or IVF, is a method for achieving faster search in very large vector databases. The basic idea is not to compare all vectors individually but to first partition the data into groups called clusters. Each vector belongs to exactly one cluster, and each cluster has a centroid. When a query arrives, the system first identifies which clusters are closest to the query vector and then searches only within those clusters. This drastically speeds up search since not all vectors need to be examined.

Calibrating IVF parameters means carefully deciding how many clusters to create and how deep to search within those clusters. Too few clusters cause many vectors to accumulate in each cluster, making search imprecise due to too much aggregation. Too many clusters mean the system must inspect many small groups, slowing down search and increasing memory usage.

For example, consider a database with 500 million documents. If only 1,000 clusters are created, an average of 500,000 vectors end up in each cluster, and search precision suffers. If instead 10 million clusters are created, the clusters are small but the system takes a long time to manage them and to check enough relevant clusters for each query. A middle ground, such as 100,000 clusters with a well-chosen search strategy, ensures that search remains both fast and accurate.

Without such calibration, the IVF index works either too coarsely or too expensively. This results in missed hits or long response times. For large RAG systems this careful tuning is crucial so that vector search functions reliably and efficiently.

Using Quantization and Compression Wisely

Using quantization and compression wisely means storing vectors in a space-saving way without unnecessarily degrading search quality. Embeddings often consist of hundreds or thousands of floating-point numbers with high precision. However, this level of precision is usually not entirely needed for semantic search. Instead, values can be reduced or condensed to use less memory and speed up queries.

A commonly used technique is product quantization (PQ). The vector is split into smaller sub-vectors and each sub-vector is represented by a compact code from a codebook. This drastically reduces memory usage while largely preserving semantic closeness in the vector space. A variant, optimized product quantization (OPQ), first applies a rotation to better distribute vector information across the sub-vectors, improving accuracy.

For example, a database with 500 million vectors in 768 dimensions would occupy many terabytes in raw format. With product quantization, the same dataset can shrink to a fraction of its size, often by more than 90%, without rendering hit quality unusable. This makes even extremely large knowledge bases searchable in a performant manner.

Without quantization and compression, RAG systems quickly hit technical limits. Memory requirements and costs skyrocket, backups become unwieldy and queries become too slow. With a fitting compression strategy, large vector volumes can be managed efficiently, balancing accuracy and performance.

Maintaining Filter-friendly Payload Indexes

Maintaining filter-friendly payload indexes means not only storing the vectors themselves but also organizing additional metadata so it can be used for filtering quickly and reliably. In practice, finding semantically similar chunks via vector search alone is rarely sufficient. Results often need to be restricted by attributes such as department, document type, publication date or access level. To perform such filtering efficiently, the vector database must maintain its own structures that support these filters with minimal overhead.

For example: A company wants answers only from documents of a certain department, say accounting. The vector database contains chunks from all departments, but each chunk has a metadata field “department”. A filter-friendly payload index ensures that this information is stored directly in the index and applied efficiently to each query. Without such indexes, the system would first retrieve all vector hits and then filter them afterward, consuming memory and time.

Without filter-friendly payload indexes, queries slow down as data volumes grow. Users see irrelevant hits from other areas or wait long for answers because filtering happens too late. With a well-maintained payload index, attributes can be considered during retrieval, making the hit set leaner, improving quality and enabling answer generation to work on a much more targeted basis.

Enforcing Pre-filtering Before ANN

Enforcing pre-filtering before Approximate Nearest Neighbor (ANN) means applying constraints such as document type, language, project context or access rights before starting the ANN search itself. ANN is very efficient at finding the most similar vectors among millions but becomes slower and less precise as the search space grows. Therefore, it is crucial to limit the candidate set from the outset to only the truly relevant vectors.

For example: A company stores contracts, support tickets and manuals together in a vector database. A user searches for information on a product issue, which is only contained in support tickets. If pre-filtering is enforced before the ANN search, the system only searches vectors with metadata “Document type = Support Ticket.” Without this measure, the ANN search would scan the entire corpus and filter afterward, leading to higher compute costs, longer response times and the risk of irrelevant hits like contract clauses in the candidate list.

Without consistent pre-filtering, unnecessary costs emerge, the system loses speed and users risk retrieving inappropriate information. With properly implemented pre-filtering, queries remain lean, hit quality improves and relevance and compliance requirements are reliably met.

Designing Sharding and Replication Cleanly

The core problem in vector databases is that a single server quickly hits physical and technical limits as data volume and usage grow. With millions or even billions of embeddings, memory runs out, index structures like HNSW or IVF become too large for one machine and parallel queries cause CPU and I/O bottlenecks. Additionally, availability becomes an issue: if the only server goes down, the entire vector search is unavailable.

Sharding solves this by splitting the complete dataset into subsets. Each subset holds a portion of the embeddings and is stored on its own server or cluster node. No server needs to hold the full index, distributing memory and compute load across nodes. However, a search query must run against all relevant shards since the closest vectors could reside in any shard. Partial results are then merged. Without replication, a shard outage leads to quality loss because its embeddings are missing from the search.

Replication complements sharding by keeping each subset on multiple servers. Every shard has at least one copy. Redundancy brings two benefits. First, data remains available even if a server or shard fails because a replica can take over. Second, read queries can be distributed across replicas, increasing performance under heavy load. This prevents a single shard failure from causing incomplete or incorrect search results.

The difference is clear: sharding distributes, replication duplicates. Together they create a scalable and fault-tolerant system.

The comparison to RAID on hard drives illustrates this well. Sharding corresponds to “striping” like RAID 0, where data is split into blocks across disks for speed. Replication resembles RAID 1, where each disk holds an exact copy of the other for redundancy. In vector databases, both are often combined to manage large data volumes efficiently and ensure high availability.

Without sharding and replication, a vector database quickly hits scaling and availability limits in large projects. With both, it remains high-performing, fault-tolerant and consistent even with billions of vectors, avoiding immediate quality loss from individual failures.

Defining Consistency and Read Isolation

Defining consistency and read isolation in a distributed vector database means specifying how queries and updates interact and which data state users see when they search. Because embeddings are distributed across shards and replicas, new entries or changes may arrive at different times on each server. Consistency defines whether a query must return the same state across all nodes or if temporary differences are allowed. Strong consistency guarantees that after an update all nodes immediately reflect the same state, while eventual consistency allows brief discrepancies between replicas.

Read isolation refers to whether queries during an ongoing write operation can see intermediate results. Under low isolation, a search can access data that is still being modified and not yet fully stable. High isolation ensures queries see either the old or the new state but never a mixture.

Without defined rules for consistency and isolation, vector searches yield contradictory hits. Users get different results for the same query, answer generation relies on incomplete data and audit trails become unreliable. Clear specifications ensure predictability, traceability and trust in the system.

Incremental Index Updates

Incremental index updates mean adding or replacing embeddings in the existing vector index without rebuilding it entirely. For large datasets, a full rebuild is extremely compute-intensive and can take hours or days. An incremental process updates only the affected parts of the index while the rest remains unchanged and immediately usable.

The workflow is: Once a document is processed and embedded, its vector is inserted at the appropriate place in the index. When a document is replaced or deleted, the corresponding vectors are marked or removed. Techniques like background processes or delta indexes ensure changes are continuously applied without interrupting vector search operations.

Without incremental index updates, the index would have to be periodically fully rebuilt. This leads to long downtimes, outdated search results and high operational costs. With incremental updates, the index stays current, vector search remains performant and the system can reliably handle continuously growing data volumes.

Using Results and Centroid Caches

Using results and centroid caches means storing intermediate calculations that are frequently repeated during vector searches. This accelerates queries without altering result quality. There are two levels where caching is beneficial.

On the first level, complete search queries are cached. When many users ask the same or very similar question, identical or near-identical result lists occur. For instance, in an internal support portal, employees repeatedly search for “VPN guide”. Without a cache, the nearest neighbor search runs every time even though the same five documents are consistently found. With a result cache, this list is computed once and then directly reused.

The second level involves internal steps within the index. In index types like IVF, queries must first assign the query vector to the nearest cluster centroids before fine search within clusters begins. These distance calculations to centroids repeat for many similar queries. With a cache, distance values or cluster selections can be retrieved instantly instead of being recalculated each time.

Without such caches, the system redundantly performs trivial queries or internal steps, leading to unnecessary latency and growing load at high query volumes. Thoughtful caching delivers significant performance gains, especially in scenarios with frequently recurring queries or topic-focused searches.

Optimizing Storage Layout

Optimizing storage layout means organizing how embeddings and index structures are physically stored in memory to make search operations as fast and resource-efficient as possible. In vector databases, performance heavily depends on how data is organized in RAM or on disk.

A central concept is whether data is stored row-wise or column-wise. Row-wise storage means each row contains a full vector. This speeds up distance calculations because all values of a vector reside contiguously in memory and can be loaded in a single access. Column-wise storage can be beneficial when only certain dimensions are needed or when applying compression to specific columns.

Alignment also matters. CPUs and GPUs load data not byte by byte but in fixed-size blocks, e.g. 64 bits for CPUs or 256 bits for GPUs. If a vector does not start on such a boundary, hardware must load multiple blocks and assemble them even if the vector would fit in one block. This is called memory alignment. Vectors aligned to these boundaries can be fetched in a single access and processed much faster. Techniques like memory mapping are also used, where large index structures are mapped directly into memory without a full copy.

Without optimizing the storage layout, avoidable bottlenecks emerge. Queries slow down due to fragmented memory access and hardware underutilization. With an optimized layout, vector search speed increases significantly and even very large datasets can be handled efficiently.

Managing Data Lifecycle in the Index

Managing the data lifecycle in the index means that entries in a vector database are not static forever but follow a defined lifecycle. Embeddings represent knowledge at the time of creation. When documents change, are corrected or expire, the associated vectors must be updated or removed. Otherwise, outdated or erroneous knowledge remains in the index, leading to inappropriate hits during retrieval.

Technically, lifecycle is managed with metadata. Each vector gets attributes like creation date, version number, validity period or expiration date. Updates replace old embeddings with new ones when the source document changes. Obsolete chunks can be automatically archived or deleted. Deletion obligations under regulations like GDPR are also implemented through this management.

Without data lifecycle control, the index grows uncontrolled, retrieval quality degrades due to conflicting or outdated hits and compliance risks arise as deleted or expired content remains discoverable. Active lifecycle management ensures only current, consistent and valid content is indexed, improving precision and trustworthiness of the RAG system.

Corpus Routing and Index Selection per Query

Corpus routing and index selection per query means directing each search not blindly over the entire dataset but specifically to the relevant sub-corpus or index. Large enterprises often maintain parallel data worlds, e.g. an index for technical manuals, one for legal contracts, one for support tickets and one for marketing materials. Merging all this into one giant index makes retrieval imprecise and inefficient.

Routing decides based on the query which index to address or whether multiple indexes should be used in parallel. Query attributes such as detected keywords, language, document type or organizational hints like department or project can be evaluated. Manually set parameters or metadata in the user query also play a role. A routing module then forwards the query only to the relevant indexes, saving resources and search time.

For example, a query like “Show me maintenance intervals for machine X” is clearly technical and is routed only to the manuals and maintenance documents index. The marketing or contracts index is not searched. This reduces response time, avoids irrelevant hits and significantly improves result quality.

Without corpus routing, every query would search all indexes. This leads to unnecessary load, higher latency and many irrelevant hits. It also raises the risk of unrelated chunks climbing in the ranking and distorting the answer. With targeted routing, search stays focused, efficient and delivers more relevant responses.

Hybrid Environment with Synchronized IDs

The core issue in enterprise search systems is that no single technology covers all requirements simultaneously. Users often need results from various sources because a query rarely concerns just one level. Searching for “ISO 27001” not only requires semantically similar content from policies but also exact text matches, and organizational info like approval status or affected departments from metadata search. Only when these results are combined does a complete picture emerge. Responses become semantically precise, textually verifiable and organizationally correct.

Strictly speaking, RAG works only with vector search and a language model. In practice, it is often extended by integrating full-text and metadata search as well. This covers all relevant aspects: semantic proximity, exact text excerpts and organizational context information. To prevent duplicate or conflicting results, a hybrid environment with synchronized IDs is needed. Each source refers via a common identifier to the same origin so hits can be consolidated and consistently fed into answer generation. Without this synchronization, duplicate or incomplete results reduce answer quality.

Without synchronized IDs, duplicate, conflicting or incomplete results arise. With a hybrid environment, search remains precise, comprehensive and consistent.

Setting and Documenting Distance Function: Cosine, L2, Inner Product

Setting and documenting the distance function means defining the mathematical criterion by which similarity between two embeddings is calculated. Three established methods are cosine distance, L2 distance and inner product. Cosine distance measures the angle between two vectors and is well suited when vector direction matters and length is irrelevant. L2 distance, also called Euclidean distance, considers geometric distance in space, taking into account both direction and length. Inner product or dot product evaluates how strongly two vectors point in the same direction, with vector length also influencing the result.

Choosing the distance function depends heavily on the use case and the embedding model used. Some models are explicitly trained for cosine similarity, others for Euclidean distance or dot product. Using an unsuitable distance function can cause relevant neighbors to be overlooked or irrelevant hits to be preferred. Therefore, it is crucial to clearly document the distance function and apply it consistently across all system components.

Without documented and consistent distance function selection, inconsistencies arise. Two teams might query the same index with different methods and get different results. Benchmarks lose their significance and answer quality is no longer reproducible. With a clearly chosen and documented distance function, search stays reliable and results traceable.

Conclusion

In Part 2 it became clear that the quality of a RAG system depends not only on the language model but to a large extent on embeddings and how they are stored. Domain-specific adaptation, multilingual representation, normalization, out-of-distribution detection and the choice of suitable index structures determine how reliably and efficiently semantic search works and what foundation answer generation has. Without clean embeddings and robust indexing, even the most powerful model remains imprecise.

In the next parts of the series, the retrieval process moves into the spotlight. It will cover how to improve the selection and weighting of relevant chunks, which strategies are used for re-ranking and hybrid methods and how metadata and context are systematically integrated into the search. Step by step, a complete picture emerges of how RAG systems can be optimized from the data base to the final answer.

Embedding#

Domain-specific Embeddings#

Multilingual Embeddings#

Dimensionality Reduction#

Embedding Normalization#

Hybrid Embeddings (Text + Metadata)#

Special Embeddings for Tables and Code#

Evaluation and Fine-tuning of Embeddings#

Document Pooling Strategy#

Out-of-Distribution Detection for Inputs Outside the Training Domain#

Language and Format Normalization of Numbers, Dates, Units#

Vector Indexing and Storage#

Choosing the Right Index Type for the Search Profile#

Systematically Tuning HNSW Parameters#

Calibrating IVF Parameters#

Using Quantization and Compression Wisely#

Maintaining Filter-friendly Payload Indexes#

Enforcing Pre-filtering Before ANN#

Designing Sharding and Replication Cleanly#

Defining Consistency and Read Isolation#

Incremental Index Updates#

Using Results and Centroid Caches#

Optimizing Storage Layout#

Managing Data Lifecycle in the Index#

Corpus Routing and Index Selection per Query#

Hybrid Environment with Synchronized IDs#

Setting and Documenting Distance Function: Cosine, L2, Inner Product#

Conclusion#