This blog post covers the concept of embeddings and vector databases. Initially, it explains what embeddings are and how they are used in the field of Natural Language Processing (NLP). Then, an explanation of vectors in a three-coordinate space and their extension to multidimensional vectors follows. Finally, ChromaDB is introduced, a specialized vector database.

What is an Embedding?

An Embedding is a technique in the field of machine learning and data processing that aims to transform objects such as words, sentences, or documents into a continuous vector space. In this vector space, similar objects are represented by similar vectors, meaning they lie close together. Embeddings are frequently used to capture and analyze the semantic meaning of texts.

Vectors and the Numerical Space

Vectors in Three Dimensions

A vector is a list of numbers that can be viewed as coordinates in a space. In everyday life, three dimensions are usually known: x, y, and z. These dimensions can be easily visualized:

  • x: horizontal axis

  • y: vertical axis

  • z: depth

A point in this three-dimensional space can be described by a vector such as (x, y, z). For example, a point could be represented by the vector (1, 2, 3), indicating its position in that space. In three-dimensional space, the word “living being” would be positioned closer to the word “human” than to the word “handball”.

Multidimensional Vectors

In mathematics and machine learning, high-dimensional spaces with more than three dimensions are often used. Each additional dimension represents another independent attribute or feature of the data. For example:

  • 4 dimensions: (x, y, z, w)

  • 100 dimensions: (x1, x2, x3, …, x100)

Although these additional dimensions cannot be visually imagined, they help to capture complex data and their relationships more precisely. Each dimension adds a new type of information to the overall representation of the point (or word), e.g., temperature, weight, color, etc. Therefore, in multidimensional vectors, the x, y, and z coordinates no longer have to be used for spatial description. Instead, they express the relationship between individual words.

This formulation makes it clear that in high-dimensional vector spaces, the dimensions represent different features or properties and are not limited to spatial descriptions. Rather, the coordinates show how the words are semantically related to each other.

Text Corpus Analysis and Embeddings

To maximize the efficiency and accuracy of ChromaDB’s vector database, a careful analysis of the text corpus is essential. This analysis serves to prepare the texts and extract relevant features that are then embedded into the vector space. Through this process, semantic similarities between texts can be captured and utilized more precisely, which significantly enhances ChromaDB’s performance in practice. This process includes several steps:

  1. Corpus Preparation: collection and cleaning of texts.
  2. Tokenization: splitting the text into words or sentences.
  3. Normalization: unifying the words (e.g., lowercase, stemming).
  4. Stop Words Removal: removing frequently occurring but uninformative words.
  5. Feature Extraction: methods like Bag of Words or TF-IDF for weighting words.
  6. Modeling and Analysis: using word embeddings or topic modeling to capture semantic meanings and topics.

ChromaDB

ChromaDB is a specialized vector database designed to store, manage, and query vectors. Vector databases are especially useful for applications in NLP, where semantic similarity between texts needs to be captured. ChromaDB allows you to add documents, which are converted into vectors, and enables you to perform queries based on these vectors.

ChromaDB Example

The following is an example showing how ChromaDB can be used to add documents and perform a query. Note: There were numerous bugs in Visual Studio Code when running the examples. PyCharm, on the other hand, worked without any issues.

FieldDescriptionValue
idsList of the IDs of the returned documents[[‘id1’, ‘id2’]]
distancesList of distances between the query and the returned documents. The smaller the value, the more similar.[[1.0404009819030762, 1.2430799007415771]]
metadatasList of the metadata of the returned documents[[None, None]]
embeddingsEmbeddings of the documents. Not included in this caseNone
documentsList of the actual document texts returned as the result of the query[[‘This is a document about pineapple’, ‘This is a document about oranges’]]
urisURIs of the returned documents. Not included in this caseNone
dataAdditional data, if present. Not included in this caseNone
includedList of fields included in the results[‘metadatas’, ‘documents’, ‘distances’]

Distance Measurement to Determine Similarity Between Vectors

The distance values in the output represent the similarity between the query and the returned documents. ChromaDB uses embedding models to project documents into a multidimensional vector space, where similarity is measured by the distances between these vectors.

The distance between vectors in a multidimensional space is usually calculated using various metric methods. One of the most common methods is the calculation of the Euclidean distance. Other methods include cosine similarity, Manhattan distance, and others.

Persistence of Entries