In the previous post, you developed an image similarity search app using Jupyter Notebook. Utilizing the pgvector extension on Azure Cosmos DB for PostgreSQL, you were able to detect images that are semantically similar to a reference image or a text prompt.
By default, pgvector performs exact nearest neighbor search, calculating the similarity between the query vector and every vector in the database. While this type of search provides perfect recall, it often leads to longer search times. To enhance efficiency for large datasets, you should create indexes to enable approximate nearest neighbor search, which trades off result quality for speed. Pgvector supports two types of approximate indexes:
In this tutorial, you will:
The entire functional project is available in my GitHub repository. If want to follow along, just fork the repository and clone it to have it locally available.
To proceed with this tutorial, ensure that you have the following prerequisites installed and configured:
Additionally, make sure you have set up your workspace according to the guidelines provided on the project’s repository.
The IVFFlat algorithm accelerates vector search by grouping the vectors in the dataset into clusters (also known as Voronoi regions or cells) and limiting the search scope to the few nearest clusters for each query rather than the entire dataset.
Let’s gain an intuitive understanding of how IVFFlat works. Consider that we place our high-dimensional vectors in a two-dimensional vector space. We then apply k-means clustering to compute the cluster centroids. After identifying the centroids, we assign each vector in our dataset to the closest centroid based on proximity. This process results in the partition of the vector space into several non-intersecting regions (Voronoi Diagram), as depicted in the following image:
Now, each vector falls within a region. In the context of similarity search, each region consists of vectors that are semantically similar. Since vectors within the same region are more likely to be similar to each other than those in different regions, IVFFlat makes the search process more efficient.
Let’s consider the query vector, as shown in the following image. To find the nearest neighbors to this query vector using IVFFlat, we perform the following steps:
Although the IVFFlat algorithm accelerates the search process and provides good search quality, it can lead to errors. For example, the above image illustrates the scenario where the query vector resides at the edge of two cells. Despite the query vector being close to a datapoint in the blue region, this vector will not be considered a nearest neighbor candidate since the search scope is limited to the orange region.
To address this issue and enhance search quality, we can expand the search scope by selecting several regions to search for nearest neighbor candidates. However, this approach comes with a trade-off: it increases search time.
The code for creating an IVFFlat index and inserting data into a PostgreSQL table can be found at data_upload/upload_data_to_postgresql_ivfflat.py.
To create an IVFFlat index using pgvector, two parameters need to be specified:
vector_l2_ops
, vector_ip_ops
, and vector_cosine_ops
, respectively. It is essential to select the same distance metric for both the creation and querying of the index.lists
parameter specifies the number of clusters that will be created. Pgvector suggests that an appropriate number of lists
is rows/1000
for datasets with up to 1 million rows and sqrt(rows)
for larger datasets. It is also advisable to create at least 10 clusters.To create an IVFFlat index in a PostgreSQL table, you can use the following statement:
CREATE INDEX ON <table_name> USING ivfflat (<vector_column_name> <distance_method>) WITH (lists = <number_of_clusters>);
It is important to note that the index should be created once the table is populated with data. If new vectors are added to the table, the index should be rebuilt to update the cluster centroids and accurately represent the new dataset.
The number of regions to consider during search is determined by the probes
parameter. According to pgvector documentation, the recommended value for the probes
parameter is sqrt(lists)
. The default value is 1. To specify the value of the probes
parameter, you can execute the following statement:
SET ivfflat.probes = <number_of_probes>;
To specify the number of probes to use for a single query, you should use the LOCAL
keyword.
The HNSW index is based on the construction of a multi-layered graph structure, that is optimized for performing approximate nearest neighbor search. In this graph structure, the datapoints (also referred to as nodes or vertices) are connected to each other by edges, which make it possible to navigate through the graph by following these edges.
The base layer of the multi-layer graph essentially represents the entire dataset, while the higher layers consist of fewer nodes, providing a simplified overview of the layers below. The higher layers contain longer links, allowing for longer jumps between nodes for faster search, while the lower layers contain shorter links, enabling more accurate search. Nearest neighbor search begins at the top layer, where the longest links are present. We then navigate to the nearest node and gradually move to lower layers until a local minimum is found. This process is illustrated in the following image:
The search process in an HNSW graph can be compared to the process of planning a trip between two cities. Much like how we start our journey with major roads and gradually transition to smaller ones as we approach our destination, the HNSW search procedure begins with longer links at the top layer and gradually moves to lower layers as we approach the desired data points.
The HNSW algorithm is based on two fundamentals techniques: the probability skip list and the navigable small world (NSW) graphs. A detailed explanation of the process of constructing an index and search through the graph is beyond the scope of this article. For further information, refer to the resources provided at the end of this article.
Compared to the IVFFlat index, the HNSW index generally provides better query performance in terms of the tradeoff between recall and speed, but at the expense of higher build time and more memory usage. Additionally, it doesn't require a training step to build the index. This means you can create an HNSW index even before any data is inserted into the table, unlike the IVFFlat index, which needs to be rebuilt when data changes to accurately represent new cluster centroids.
The code for creating an HNSW index and inserting data into a PostgreSQL table can be found at data_upload/upload_data_to_postgresql_hnsw.py.
To create an HNSW index through the pgvector extension, three parameters need to be specified:
vector_l2_ops
, vector_ip_ops
, and vector_cosine_ops
, respectively. It is essential to select the same distance metric for both the creation and querying of the index.m
specifies the maximum number of connections with neighboring datapoints per point per layer. Its default value is 16
.ef_construction
defines the size of list that holds the nearest neighbor candidates when building the index. The default value is 64
.To create an HNSW index in a PostgreSQL table, you can use the following statement:
CREATE INDEX ON <table_name> USING hnsw (<vector_column_name> <distance_method>) WITH (m = <m>, ef_construction = <ef_construction>);
For approximate nearest neighbor search, an additional parameter needs to be considered to use the HNSW index. The ef_search
parameter specifies the size of the list that holds the nearest neighbor candidates during query execution. The default value is set to 40
. The ef_search
parameter can be altered (for a single query or a session) using the following command:
SET hnsw.ef_search = <value>;
To search for similar images through the HNSW index of the pgvector extension, we can use SQL SELECT
statements and the built-in distance operators. The structure of a SELECT
statement was explained in the Exact Nearest Neighbor Search blog post.
It's important to note that PostgreSQL does not guarantee the use of an approximate index, as it may determine that a sequential scan could be more efficient for a query. To check whether PostgreSQL utilizes the index in a query, you can prefix the SELECT
statement with the EXPLAIN ANALYZE
keywords.
EXPLAIN ANALYZE SELECT image_title, artist_name FROM paintings ORDER BY vector <=> '[0.001363333, …, -0.0010466448]' LIMIT 12;
An example of a query plan that utilizes the IVFFlat index is provided below:
Limit (cost=1537.64..1539.15 rows=12 width=72) (actual time=1.803..1.984 rows=12 loops=1)
-> Index Scan using paintings_ivfflat_vector_idx on paintings (cost=1537.64..2951.85 rows=11206 width=72) (actual time=1.801..1.981 rows=12 loops=1)
Order By: (vector <=> '[0.001363333, ..., -0.0010466448]'::vector)
Planning Time: 0.135 ms
Execution Time: 2.014 ms
In the Jupyter Notebooks for the IVFFlat index and the HNSW index, you'll explore text-to-image and image-to-image search scenarios. You will use the same text prompts and reference images as in the Exact Nearest Neighbors search example, allowing for a comparison of the accuracy of the results.
Here is an example of the images retrieved using the IVFFlat and the HNSW index.
The pgvector extension and PostgreSQL provide additional features that you can leverage to build AI-powered search applications. For example, you can integrate vector search with conventional keyword-based search methods into hybrid search systems, which generally have better performance.
If you want to learn more about the HNSW algorithm, check out these learning resources:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.