Inspired by GraphRAG and PostgreSQL Integration in Docker with Cypher Query and AI Agents, which demonstrated how Apache AGE brings Cypher based graph querying into PostgreSQL for GraphRAG pipelines. This post takes that idea further combining AGE's graph traversal with pgvector's semantic search to build a unified analytical engine where vectors and graphs reinforce each other in a single PostgreSQL instance.
This post targets workloads where entity types, relationship semantics, and schema cardinality are known before ingestion. Embeddings are generated from structured attribute fields; graph edges are typed and written by deterministic ETL. No LLM is involved at any stage. You should use this approach when you have structured data and need operational query performance, and deterministic, auditable, sub-millisecond retrieval.
The problem nobody talks about the multi database/ multi hop tax
If you run technology for a large enterprise, you already know the data problem. It is not that you do not have enough data. It is that your data lives in too many places, connected by too many fragile pipelines, serving too many conflicting views of the same reality.
Here is a pattern that repeats across industries. One team needs to find entities "similar to" a reference item — not by exact attribute match, but by semantic meaning derived from unstructured text like descriptions, reviews, or specifications. That is a vector similarity problem.
Another team needs to traverse relationships trace dependency chains, map exposure paths, or answer questions like "if this node is removed, what downstream nodes are affected?" That is a graph traversal problem.
Meanwhile, the authoritative master data of IDs, attributes, pricing, transactional history already lives in Postgres.
Now you are operating three databases. Three bills. Three sets of credentials. Three backup strategies. A fragile ETL layer stitching entity IDs across systems, breaking silently whenever someone adds a new attribute to the master table. And worst of all, nobody can ask a question that spans all three systems without custom application code.
Azure PostgreSQL database can already do all three jobs. Two extensions pgvector for vector similarity search and Apache AGE extension for graph traversal bringing these capabilities natively into the database. No new infrastructure. No sync pipelines. No multi database tax!
This post walks through exactly how to combine them, why each piece matters at scale, and what kinds of queries become possible when you stop treating vectors and graphs as separate concerns.
The architecture: Two extensions, One engine
pgvector adds a native vector data type and distance operators (<=>, <->, <#>) with HNSW and IVFFlat index support.
pg_diskann adds a third index type that keeps the index on disk instead of in memory, enabling large scale vector search without proportional RAM.
example 1 - to run a product similarity query such as the one below which corelates products sold across multiple markets which are related (cosine similarity).
- The limit clause in sub query limits the similarity search to closest 1 product recommendation
- High similarity score of > 0.75 (aka 75% similarity in embeddings)
-- Table DDL - for illuatration purposes only
CREATE TABLE IF NOT EXISTS products (
id SERIAL PRIMARY KEY,
sku TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
brand TEXT NOT NULL,
category TEXT NOT NULL,
subcategory TEXT,
market TEXT NOT NULL,
region TEXT,
description TEXT,
ingredients TEXT,
avg_rating FLOAT DEFAULT 0.0,
review_count INT DEFAULT 0,
price_usd FLOAT,
launch_year INT,
status TEXT DEFAULT 'active',
embedding vector(384)
);
SELECT us.name AS us_product,
us.brand AS us_brand,
in_p.name AS india_match,
in_p.brand AS india_brand,
Round((1 - (us.embedding <=> in_p.embedding))::NUMERIC, 4) AS similarity
FROM products us
cross join lateral
(
SELECT name,
brand,
embedding
FROM products
WHERE market = 'India'
AND category = us.category
ORDER BY embedding <=> us.embedding limit 1 ) in_p
WHERE us.market = 'US'
AND us.category = 'Skincare'
AND us.avg_rating >= 4.0
AND round((1 - (us.embedding <=> in_p.embedding))::NUMERIC, 4)> 0.75
ORDER BY similarity DESC limit 20;
AGE adds a cypher() function that executes cypher queries against a labeled property graph stored in the database managed and maintained under the ag_catalog schema. Vertices and edges become first class PostgreSQL rows with agtype properties.
The age extension supports MATCH, CREATE, MERGE, WITH, and aggregations.
example 2 - to run a product similarity query such as the one below which returns common products sold via multiple retail channels.
SET search_path = ag_catalog, "$user", public;
SELECT * FROM cypher('cpg_graph', $$
MATCH (p:Product)-[:SOLD_AT]->(walmart:RetailChannel {name: 'Walmart'})
MATCH (p)-[:SOLD_AT]->(target:RetailChannel {name: 'Target'})
MATCH (b:Brand)-[:MANUFACTURES]->(p)
RETURN b.name AS brand,
p.name AS product,
p.category AS category,
p.market AS market,
p.price_usd AS price
ORDER BY p.category, b.name
$$) AS (brand agtype, product agtype, category agtype,
market agtype, price agtype);
The critical point and takeaway here is that both extensions participate in the same query planner and executor. A CTE that calls pgvector's <=> operator can feed results into a cypher() call in the next CTE all within a single transaction, sharing all available processes and control the database has to offer.
Finally, you are looking at code that looks like -
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS age;
SET search_path = ag_catalog, "$user", public;
SELECT create_graph('knowledge_graph');
The bridge: pgvector → Apache AGE
This is the architectural centrepiece where the mechanism that turns vector similarity scores into traversable graph edges. Without this “bridge” pgvector and AGE are two isolated extensions.
Why bridge at all?
pgvector answers: "What is similar to X?"
AGE answers: "What is connected to Y, and how?"
These are fundamentally different questions operating on fundamentally different data structures. pgvector works on a flat vector space and every query is a distance calculation against an ANN index.
AGE works on a labelled property graph where every query is a pattern match across typed nodes and edges.
What if now the question is – What is like X and connected to Y and how?
This is where the bridge gets activated comes into life.
This takes cosine similarity distance scores from pgvector and writes them as SIMILAR_TO edges in the AGE property graph turning a distance computation into a traversable relationship.
Once similarity is an edge, cypher queries can then combine it with structural edges in a single declarative pattern.
for ind_prod_id, us_prod_id, similarity in pairs:
execute_cypher(cur, f"""
MATCH (a:Product {{product_id: { ind_prod_id }}}),
(b:Product {{product_id: { us_prod_id }}})
CREATE (a)-[:SIMILAR_TO {{score: {score:.4f},
method: 'pgvector_cosine'}}]->(b)
CREATE (b)-[:SIMILAR_TO {{score: {score:.4f},
method: 'pgvector_cosine'}}]->(a)
""")
The cypher() function translates Cypher into DML against ag_catalog tables under the hood, these are plain PostgreSQL heap inserts just like another row.
The score property is the edge weight on the SIMILAR_TO relationship. Its value is the similarity score computed from pgvector using cosine similarity, so a higher score means the two products are more semantically similar.
The method property is metadata on that same edge. It records how the score was produced. In this case, pgvector_cosine is just a string label indicating that the relationship was derived using pgvector based cosine similarity.
Cosine similarity is symmetric, but property graph traversal is directional i.e. MATCH (a)-[:SIMILAR_TO]->(b) won't find the reverse path unless both directional edges exist.
Why this combination matters
One backup strategy. One monitoring stack. One connection pool. One failover target. One set of credentials. One database restore considerations - for teams already running Az PostgreSQL databases in production this adds capabilities without adding any net new infrastructure.
Unified cost model
The planner assigns cost estimates to index scan for both execution engines using the same cost framework it uses for B-tree lookups and sequential scans. It can decide whether to use the HNSW index or fall back to a sequential scan based on table statistics and server parameters.
As you have learnt so far, there is no separate storage or database engine to learn.
Bringing all this knowledge together
Examples 1 and 2 were all about native vector search and native graph search example in a classic product catalog scenario, respectively. Now, let’s bring this to life - What if now the question is – What is like X and connected to Y and how?
In this use case - pgvector finds the cross market matches (as shown in example 1), then Cypher checks which of those matches are sold at both Walmart and Target:
SET search_path = ag_catalog, "$user", public;
-- Cross-market matching (pgvector) → Retail channel overlap (graph)
WITH cross_market AS (
SELECT us.id AS us_id,
us.name AS us_product,
us.brand AS us_brand,
in_p.id AS india_id,
in_p.name AS india_match,
in_p.brand AS india_brand,
ROUND((1 - (us.embedding <=> in_p.embedding))::numeric, 4) AS similarity
FROM products us
CROSS JOIN LATERAL (
SELECT id, name, brand, embedding
FROM products
WHERE market = 'India'
AND category = us.category
ORDER BY embedding <=> us.embedding
LIMIT 1
) in_p
WHERE us.market = 'US'
AND us.category = 'Skincare'
AND us.avg_rating >= 4.0
AND ROUND((1 - (us.embedding <=> in_p.embedding))::numeric, 4) > 0.75
),
dual_channel AS (
SELECT (pid::text)::int AS product_id,
brand::text AS brand
FROM cypher('cpg_graph', $$
MATCH (p:Product)-[:SOLD_AT]->(w:RetailChannel {name: 'Walmart'})
MATCH (p)-[:SOLD_AT]->(t:RetailChannel {name: 'Target'})
MATCH (b:Brand)-[:MANUFACTURES]->(p)
RETURN p.product_id AS pid,
b.name AS brand
$$) AS (pid agtype, brand agtype)
)
SELECT cm.us_product,
cm.us_brand,
cm.india_match,
cm.india_brand,
cm.similarity,
CASE WHEN dc.product_id IS NOT NULL
THEN 'Yes' ELSE 'No'
END AS india_match_at_walmart_and_target
FROM cross_market cm
LEFT JOIN dual_channel dc ON dc.product_id = cm.india_id
ORDER BY cm.similarity DESC
LIMIT 20;
Conclusion
The Azure PostgreSQL database ecosystem has quietly assembled the components for a unified semantic + structural analytics engine in form of extensions.
pgvector with pg_diskann delivers production grade approximate nearest-neighbour search with ANN indexes.
Apache AGE delivers cypher based property graph traversal. Together with a “bridge,” they enable query patterns that are impossible in either system alone and they do it within the ACID guarantees, operational tooling, and SQL vocabulary knowledge you already have.
Stop paying for three databases when one will do!