Blog Post

Microsoft Blog for PostgreSQL
7 MIN READ

Combining pgvector and Apache AGE - knowledge graph & semantic intelligence in a single engine

Raunak's avatar
Raunak
Icon for Microsoft rankMicrosoft
Apr 15, 2026

Inspired by GraphRAG and PostgreSQL Integration in Docker with Cypher Query and AI Agents, which demonstrated how Apache AGE brings Cypher based graph querying into PostgreSQL for GraphRAG pipelines. This post takes that idea further combining AGE's graph traversal with pgvector's semantic search to build a unified analytical engine where vectors and graphs reinforce each other in a single PostgreSQL instance.

This post targets workloads where entity types, relationship semantics, and schema cardinality are known before ingestion. Embeddings are generated from structured attribute fields; graph edges are typed and written by deterministic ETL. No LLM is involved at any stage. You should use this approach when you have structured data and need operational query performance, and deterministic, auditable, sub-millisecond retrieval.

The problem nobody talks about the multi database/ multi hop tax

If you run technology for a large enterprise, you already know the data problem. It is not that you do not have enough data. It is that your data lives in too many places, connected by too many fragile pipelines, serving too many conflicting views of the same reality.

Here is a pattern that repeats across industries. One team needs to find entities "similar to" a reference item — not by exact attribute match, but by semantic meaning derived from unstructured text like descriptions, reviews, or specifications. That is a vector similarity problem.

Another team needs to traverse relationships trace dependency chains, map exposure paths, or answer questions like "if this node is removed, what downstream nodes are affected?" That is a graph traversal problem.

Meanwhile, the authoritative master data of IDs, attributes, pricing, transactional history already lives in Postgres.

Now you are operating three databases. Three bills. Three sets of credentials. Three backup strategies. A fragile ETL layer stitching entity IDs across systems, breaking silently whenever someone adds a new attribute to the master table. And worst of all, nobody can ask a question that spans all three systems without custom application code.

Azure PostgreSQL database can already do all three jobs. Two extensions pgvector for vector similarity search and Apache AGE extension for graph traversal bringing these capabilities natively into the database. No new infrastructure. No sync pipelines. No multi database tax!

This post walks through exactly how to combine them, why each piece matters at scale, and what kinds of queries become possible when you stop treating vectors and graphs as separate concerns.

The architecture: Two extensions, One engine

pgvector adds a native vector data type and distance operators (<=>, <->, <#>) with HNSW and IVFFlat index support. 

pg_diskann adds a third index type that keeps the index on disk instead of in memory, enabling large scale vector search without proportional RAM.

example 1 - to run a product similarity query such as the one below which corelates products sold across multiple markets which are related (cosine similarity).

-       The limit clause in sub query limits the similarity search to closest 1 product recommendation

-       High similarity score of > 0.75 (aka 75% similarity in embeddings)

-- Table DDL - for illuatration purposes only 
CREATE TABLE IF NOT EXISTS products (
    id              SERIAL PRIMARY KEY,
    sku             TEXT UNIQUE NOT NULL,
    name            TEXT NOT NULL,
    brand           TEXT NOT NULL,
    category        TEXT NOT NULL,
    subcategory     TEXT,
    market          TEXT NOT NULL,
    region          TEXT,
    description     TEXT,
    ingredients     TEXT,
    avg_rating      FLOAT DEFAULT 0.0,
    review_count    INT DEFAULT 0,
    price_usd       FLOAT,
    launch_year     INT,
    status          TEXT DEFAULT 'active',         
    embedding       vector(384)               
);

 

SELECT     us.name                                                    AS us_product,
           us.brand                                                   AS us_brand,
           in_p.name                                                  AS india_match,
           in_p.brand                                                 AS india_brand,
           Round((1 - (us.embedding <=> in_p.embedding))::NUMERIC, 4) AS similarity
FROM       products us
cross join lateral
           (
                    SELECT   name,
                             brand,
                             embedding
                    FROM     products
                    WHERE    market = 'India'
                    AND      category = us.category
                    ORDER BY embedding <=> us.embedding limit 1 ) in_p
WHERE      us.market = 'US'
AND        us.category = 'Skincare'
AND        us.avg_rating >= 4.0
AND        round((1 - (us.embedding <=> in_p.embedding))::NUMERIC, 4)> 0.75
ORDER BY   similarity DESC limit 20;

 

AGE adds a cypher() function that executes cypher queries against a labeled property graph stored in the database managed and maintained under the ag_catalog schema. Vertices and edges become first class PostgreSQL rows with agtype properties.

The age extension supports MATCH, CREATE, MERGE, WITH, and aggregations.

example 2 - to run a product similarity query such as the one below which returns common products sold via multiple retail channels.

SET search_path = ag_catalog, "$user", public;

SELECT * FROM cypher('cpg_graph', $$
    MATCH (p:Product)-[:SOLD_AT]->(walmart:RetailChannel {name: 'Walmart'})
    MATCH (p)-[:SOLD_AT]->(target:RetailChannel {name: 'Target'})
    MATCH (b:Brand)-[:MANUFACTURES]->(p)
    RETURN b.name     AS brand,
           p.name     AS product,
           p.category AS category,
           p.market   AS market,
           p.price_usd AS price
    ORDER BY p.category, b.name
$$) AS (brand agtype, product agtype, category agtype,
        market agtype, price agtype);

The critical point and takeaway here is that both extensions participate in the same query planner and executor. A CTE that calls pgvector's <=> operator can feed results into a cypher() call in the next CTE all within a single transaction, sharing all available processes and control the database has to offer.

Finally, you are looking at code that looks like -

CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS age;

SET search_path = ag_catalog, "$user", public;
SELECT create_graph('knowledge_graph');

The bridge: pgvector → Apache AGE

This is the architectural centrepiece where the mechanism that turns vector similarity scores into traversable graph edges. Without this “bridge” pgvector and AGE are two isolated extensions.

Why bridge at all?

pgvector answers: "What is similar to X?" 
AGE answers: "What is connected to Y, and how?" 

These are fundamentally different questions operating on fundamentally different data structures. pgvector works on a flat vector space and every query is a distance calculation against an ANN index.

AGE works on a labelled property graph where every query is a pattern match across typed nodes and edges.

What if now the question is – What is like X and connected to Y and how?

This is where the bridge gets activated comes into life.

This takes cosine similarity distance scores from pgvector and writes them as SIMILAR_TO edges in the AGE property graph turning a distance computation into a traversable relationship.

Once similarity is an edge, cypher queries can then combine it with structural edges  in a single declarative pattern.

for ind_prod_id, us_prod_id, similarity in pairs:
    execute_cypher(cur, f"""
        MATCH (a:Product {{product_id: { ind_prod_id }}}),
              (b:Product {{product_id: { us_prod_id }}})
        CREATE (a)-[:SIMILAR_TO {{score: {score:.4f},
                                  method: 'pgvector_cosine'}}]->(b)
        CREATE (b)-[:SIMILAR_TO {{score: {score:.4f},
                                  method: 'pgvector_cosine'}}]->(a)
    """)

The cypher() function translates Cypher into DML against ag_catalog tables under the hood, these are plain PostgreSQL heap inserts just like another row.

The score property is the edge weight on the SIMILAR_TO relationship. Its value is the similarity score computed from pgvector using cosine similarity, so a higher score means the two products are more semantically similar.

The method property is metadata on that same edge. It records how the score was produced. In this case, pgvector_cosine is just a string label indicating that the relationship was derived using pgvector based cosine similarity.

Cosine similarity is symmetric, but property graph traversal is directional i.e. MATCH (a)-[:SIMILAR_TO]->(b) won't find the reverse path unless both directional edges exist.

Why this combination matters

One backup strategy. One monitoring stack. One connection pool. One failover target. One set of credentials. One database restore considerations - for teams already running Az PostgreSQL databases in production this adds capabilities without adding any net new infrastructure.

Unified cost model

The planner assigns cost estimates to index scan for both execution engines using the same cost framework it uses for B-tree lookups and sequential scans. It can decide whether to use the HNSW index or fall back to a sequential scan based on table statistics and server parameters.

As you have learnt so far, there is no separate storage or database engine to learn.

Bringing all this knowledge together

Examples 1 and 2 were all about native vector search and native graph search example in a classic product catalog scenario, respectively. Now, let’s bring this to life - What if now the question is – What is like X and connected to Y and how?

In this use case - pgvector finds the cross market matches (as shown in example 1), then Cypher checks which of those matches are sold at both Walmart and Target:

SET search_path = ag_catalog, "$user", public;

-- Cross-market matching (pgvector) → Retail channel overlap (graph)
WITH cross_market AS (
    SELECT us.id    AS us_id,
           us.name  AS us_product,
           us.brand AS us_brand,
           in_p.id    AS india_id,
           in_p.name  AS india_match,
           in_p.brand AS india_brand,
           ROUND((1 - (us.embedding <=> in_p.embedding))::numeric, 4) AS similarity
    FROM products us
    CROSS JOIN LATERAL (
        SELECT id, name, brand, embedding
        FROM products
        WHERE market = 'India'
          AND category = us.category
        ORDER BY embedding <=> us.embedding
        LIMIT 1
    ) in_p
    WHERE us.market = 'US'
      AND us.category = 'Skincare'
      AND us.avg_rating >= 4.0
      AND ROUND((1 - (us.embedding <=> in_p.embedding))::numeric, 4) > 0.75
),
dual_channel AS (
    SELECT (pid::text)::int AS product_id,
           brand::text       AS brand
    FROM cypher('cpg_graph', $$
        MATCH (p:Product)-[:SOLD_AT]->(w:RetailChannel {name: 'Walmart'})
        MATCH (p)-[:SOLD_AT]->(t:RetailChannel {name: 'Target'})
        MATCH (b:Brand)-[:MANUFACTURES]->(p)
        RETURN p.product_id AS pid,
               b.name       AS brand
    $$) AS (pid agtype, brand agtype)
)
SELECT cm.us_product,
       cm.us_brand,
       cm.india_match,
       cm.india_brand,
       cm.similarity,
       CASE WHEN dc.product_id IS NOT NULL
            THEN 'Yes' ELSE 'No'
       END AS india_match_at_walmart_and_target
FROM cross_market cm
LEFT JOIN dual_channel dc ON dc.product_id = cm.india_id
ORDER BY cm.similarity DESC
LIMIT 20;

Conclusion

The Azure PostgreSQL database ecosystem has quietly assembled the components for a unified semantic + structural analytics engine in form of extensions.

pgvector with pg_diskann delivers production grade approximate nearest-neighbour search with ANN indexes.

Apache AGE delivers cypher based property graph traversal. Together with a “bridge,” they enable query patterns that are impossible in either system alone and they do it within the ACID guarantees, operational tooling, and SQL vocabulary knowledge you already have.

Stop paying for three databases when one will do!

 

Updated Apr 14, 2026
Version 1.0
No CommentsBe the first to comment