Understanding Vector Embeddings: Applications, Creation, and Similarity Search

Posted on Dec 16, 2024 in Computers

Week 10: Vector Databases, Cloud Architecture, and Operations

Introduction: Vector embeddings are lists of numbers, a core concept in machine learning. They are central to many applications, including:

Natural Language Processing (NLP)
Recommendation systems
Search algorithms

Examples of real-world systems that rely on embeddings:

Recommendation engines
Voice assistants
Language translators

wPQVubGECaJjgAAAABJRU5ErkJggg==

8BzJXD96QfYTwAAAAASUVORK5CYII=

r7HpDTb3veEAAAAASUVORK5CYII=

What Are Vector Embeddings?

Vector embeddings are lists of numbers representing complex data structures. They allow us to perform various operations on abstract data like:

Entire paragraphs of text
Images
Numerical data (transformed for easier operations)

Embeddings reduce high-dimensional data into meaningful numerical representations.

Why Are Vector Embeddings Useful?

Vectors are special because they enable translating semantic similarity (as perceived by humans) into proximity in vector space. By representing real-world objects and concepts as vector embeddings, we can measure their semantic similarity by how close they are in the vector space. Some datasets naturally consist of numeric values, such as:

Ordinal data
Categorical data

Abstract data like an entire document of text needs to be converted into a numeric format. Machine learning algorithms operate on numerical data. Vector embeddings are suitable for many machine learning tasks, such as:

Clustering
Recommendation
Classification

Examples of Vector Embedding Applications

Clustering Tasks

Clustering algorithms group similar points into the same cluster. Points from different clusters are kept as dissimilar as possible.

Recommendation Systems

For an unseen object, find the most similar objects based on vector embeddings. Recommend these similar objects.

Classification Tasks

Classify the label of an unseen object by majority vote over the labels of its most similar neighbors.

Creating Vector Embeddings: Feature Engineering

Creating vector embeddings involves engineering the vector values using domain knowledge. One approach is through feature engineering:

Uses domain knowledge to design vector values.
Example: In medical imaging, features like:
- Shape
- Color
- Regions in an image are quantified to capture semantics.

Limitation: Requires significant domain expertise and is expensive to scale.

Creating Vector Embeddings: Model Training

Instead of engineering embeddings, we can train models to generate vectors. A common tool: Deep Neural Networks (DNNs). Characteristics of model-generated embeddings:

High dimensional (up to 2,000 dimensions).
Dense (all values are non-zero).

Embedding Models for Text Data

Text data can be transformed into vector embeddings using models like:

Word2Vec: Captures word relationships.
GloVe: Encodes global word co-occurrence.
BERT: Embeds words, sentences, and paragraphs with deep context.

Embedding Models for Image and Audio Data

Images

Use Convolutional Neural Networks (CNNs). Examples:

VGG
Inception

Audio Recordings

Transform into vector embeddings using image-based approaches. Example: Apply transformations to a spectrogram representation of audio frequencies.

Distance Metrics

When creating an index, we can choose from the following similarity metrics:

Euclidean
Cosine
Dot Product

For the most accurate results, choose the similarity metric used to train the embedding model for your vectors.

AAAAAAyFoEHgAAAABZi8ADAAAAIGv9HycXkawSw7bmAAAAAElFTkSuQmCC

aPvhEKua73UAAAAASUVORK5CYII=

bUxvC5mr1mmueEEIIIYQQQjxKT20Iy2aMIZ1OSwgTQgghhBBCPHZPfQhTSqG1xvM8LMvCsp76SyKEEEIIIYR4jCRxAJFIhHnz5vHaa6+xYMECbNvOXUQIIYQQQgghHgllpA8eBGXqjTEopaQ1TAghhBBCCPHYSAgTQgghhBBCiCdImnyEEEIIIYQQ4gmSECaEEEIIIYQQT5CEMCGEEEIIIYR4giSECSGEEEIIIcQT9P8HUbYmBzcgRBoAAAAASUVORK5CYII=

Euclidean Distance Metric

Querying indexes with this metric returns a similarity score equal to the squared Euclidean distance between the result and query vectors. When using metric=’euclidean’, the most similar results are those with the lowest similarity score.

Cosine Similarity Metric

Often used to find similarities between different documents. Normalizes the scores to the range [-1, 1].

Dot Product Metric

Used to multiply two vectors to determine their similarity. The more positive the result, the closer the two vectors are in terms of their directions.

Serverless Index Architecture

Pinecone serverless runs as a managed service on the AWS, GCP, and Azure cloud platforms. Within a given cloud region, client requests go through an API gateway to either a control plane or data plane. All vector data is written to highly efficient, distributed object storage.

API Gateway

Requests to Pinecone serverless contain an API key assigned to a specific project. Each incoming request is load-balanced through an edge proxy to an authentication service. The authentication service verifies that the API key is valid for the targeted project. Requests are routed to either the control plane or the data plane, depending on the type of work.

Control Plane

Handles requests to manage organizational objects such as:

Projects

A project is a logical grouping of resources like indexes and API keys. Projects help in isolating data and configurations, enabling multiple teams or applications to work independently within the same Pinecone account. API keys are scoped to specific projects, ensuring access control at the project level.

Indexes

An index is a collection of vector data optimized for similarity search. Indexes are partitioned into namespaces to logically organize data for better query management. Each index can auto-scale and is backed by high-performance, distributed storage.

API Keys

API keys are credentials used to authenticate and authorize client requests. Each API key is tied to a specific project, ensuring fine-grained access control.

Data Plane

Handles requests to write and read records in indexes. Indexes are partitioned into logical namespaces, and all requests are scoped by namespace. Writes and reads follow separate paths: Compute resources auto-scale independently based on demand. Queries do not impact write throughput and vice versa.

Object Storage

For each namespace in a serverless index:

Pinecone clusters records likely to be queried together.
Identifies a centroid dense vector to represent each cluster.
Clusters and centroids are stored as data files in distributed object storage.

Provides virtually limitless data scalability and guaranteed high-availability.

Generating Vectors in MongoDB

Load the nomic-embed-text-v1 embedding model. Create a function named get embedding that uses the model to generate an embedding for a given text input. Test the function by generating a single embedding for the string foo. Get a subset of documents from the sample airbnb.listingsAndReviews collection that have a non-empty summary field. Generate embeddings from each document’s summary field by using the get embedding function that you defined. Update each document with a new embedding field that contains the embedding value by using the MongoDB PyMongo Driver.

Steps of Implementing Similarity Search

Similarity search is one of the most popular uses of vector embeddings. Search algorithms like KNN and ANN require us to calculate distance between vectors to determine similarity. Vector embeddings can be used to calculate these distances. Nearest neighbor search in turn can be used for tasks like de-duplication, recommendations, anomaly detection, reverse image search, etc.

Step 1: Embed Data

Use an embedding model to convert data into vector embeddings. These embeddings are required for similarity search. Examples of embedding models include:

Sentence Transformers
OpenAI Embeddings

Step 2: Create an Index

Create an index to store your vector embeddings. Specify the following:

Dimension: Matches the output of the embedding model.
Similarity Metric: For example, cosine similarity or Euclidean distance.

Step 3: Ingest Data

Load vector embeddings and metadata into your index. Use Pinecone’s import or upsert feature. Partition data using namespaces for:

Faster queries.
Multitenant isolation between customers.

Step 4: Search

Convert queries into vector embeddings. Use these embeddings to search your index for semantically similar vectors. Similarity search enables efficient retrieval based on meaning rather than exact matches.

Step 5: Optimize Performance

Filter Queries: Use metadata to limit the scope of your search. Rerank Results: Improve relevance based on query context. Hybrid Search: Combine:

Similarity-based searching.
Keyword-based searching.

Ensure scalability and low latency for large datasets.

Conclusion

Implementing similarity search involves: 1. Embedding data. 2. Creating and managing an index. 3. Ingesting and querying data efficiently. Tools like Pinecone facilitate efficient vector-based search. Optimize for performance by leveraging filtering, reranking, and hybrid methods.

RAG Architecture

Retrieval-augmented generation aims to improve the quality of pre-trained LLM generation using data retrieved from a knowledge base. The success of RAG lies in retrieving the most relevant results from the knowledge base. This is where embeddings come into the picture.

19YIYQQAoiPj+ehhx5i2bJlmM1mvv76a6ZPn056erqETCHETUlFUwghxC0JBoO43W7MZjOHDx+ms7OThx9+mNWrV0vbHyHETUnQFEIIcctCi5smJyfDTdxD21wKIcSNJGgKIYS4bdf+6ZCQKYT4NhI0hRBCCCHEHSGLgYQQQgghxB0hQVMIIYQQQtwREjSFEEIIIcQdIUFTCCGEEELcERI0hRBCCCHEHSFBUwghhBBC3BESNIUQQgghxB0hQVMIIYQQQtwREjSFEEIIIcQd8X+Ax2WxI1BfFwAAAABJRU5ErkJggg==