Public Preview of Shortest Path on SQL Server 2019
Published Jun 26 2019 02:14 AM 13K Views
Microsoft

SQL Server 2017 and Azure SQL Database introduced native graph database capabilities used to model many-to-many relationships. The first implementation of SQL Graph introduced support for nodes to represent entities, edges to represent relationships and a new MATCH predicate to support graph pattern matching and traversal.

 

We are expanding the graph database capabilities with several new features. In this blog we discuss one of these features that is now available for public preview in SQL Server 2019, Shortest Path, which can be used to find a shortest path between two nodes in a graph. The shortest path function can also be used to compute a transitive closure or for arbitrary length traversals.

 

With CTP3.1, we are introducing a new function SHORTEST_PATH, which can be used inside MATCH to find a shortest path between any 2 nodes in a graph or to perform arbitrary length traversals. Users can specify a pattern they would like to search for in the graph using a regular expression style syntax.

 

To understand the new syntax and how shortest path works, from Adventure Works Cycles, a bicycle wholesaler that sells and manufactures bicycles. We have converted their bill of materials hierarchy into a graph, which has Product node and IsPartOf edge table. Download sample scripts here.

 

AdventureWorksBOM.png

Now, let’s look at some sample queries to see how shortest path can help with recursive or arbitrary length traversals in a graph schema.  

 

Find all assemblies where a given product is used

Imagine that the inventory of a given product (Bearing Ball) at Adventure Works Cycles is running low and that they would like to find all the assemblies in their manufacturing pipeline, where a Bearing Ball is used. This will tell them what assemblies will be affected due to the scarcity of a given product.

 

The following query demonstrates how shortest path can be used to find all assemblies that require a Bearing Ball in the bill of materials:

 

-- Find all the assemblies where Bearing Ball is used
SELECT 
	P1.ProductID, 
	P1.Name,
	STRING_AGG(P2.Name,'->') WITHIN GROUP (GRAPH PATH) AS [Assembly],
	LAST_VALUE(P2.ProductID) WITHIN GROUP (GRAPH PATH) AS [Final ProductID]
FROM
	PRODUCT P1,
	PRODUCT FOR PATH P2,
	ISPARTOF FOR PATH IPO
WHERE 
	MATCH(SHORTEST_PATH(P1(-(IPO)->P2)+))
	AND P1.ProductID = 2
       ORDER BY P1.ProductID	

This query starts from ProductID = 2 (Bearing Ball) and keeps traversing up the BOM tree until it finds all products that contain a Bearing Ball.

 

Here are first few lines from the result set:

      image.png

 

The STRING_AGG function on P2.Name fetches all the nodes that were traversed in this path, thereby, giving us a good view of assembly line up to ‘Touring-3000 Yellow, 58’ bike. Other rows in the result set show the other assemblies which have dependency on Bearing Ball at some level.

 

This query uses a lot of new syntax. Let’s understand what this new syntax does.

 

SHORTEST_PATH

The SHORTEST_PATH function lets you find:

  1. A shortest path between two given nodes/entities
  2. Single source shortest path(s).
  3. Shortest path from multiple source nodes to multiple target nodes.

It takes an arbitrary length pattern as input, described below. It returns a shortest path that exists between two nodes. This function can only be used inside MATCH. It accepts an arbitrary length pattern and finds a shortest path in the graph, which matches that pattern. If there exist two or more shortest paths of the same length between any pair of source and destination node(s), the function will return the one that was found first during traversal. Note that, an arbitrary length pattern can only be specified inside a SHORTEST_PATH function. Finding weighted shortest paths, all paths or all shortest paths is not supported yet.

 

Arbitrary Length Pattern

This pattern includes the nodes and edges that must be traversed repeatedly until the desired node is reached or until the maximum number of iterations as specified in the pattern is met. Each time the query is executed, the result of executing this pattern will be a collection of nodes and edges traversed along the path from the start node to the end node. This is a regular expression style syntax pattern and the following two pattern quantifiers are supported in this release:

 

‘+’  :   Repeat the pattern 1 or more times. Terminate as soon as a shortest path is found.

We recommend that users provide a start and an end node, along with the pattern, whenever ‘+’ is used in the pattern predicate. For a big graph, a query with ‘+’ and no terminating condition may take a long time to finish. Refer to the example shown in ‘Find Shortest path between 2 given nodes’ section below, to see how filters on start and end node can be applied.

{1,n} : Repeat the pattern 1 to ‘n’ times. Terminate as soon as a shortest  is found or when the max number of iterations ‘n’ specified in the pattern is reached.

 

FOR PATH

FOR PATH must be used with any node or edge table name in the FROM clause, which will participate in an arbitrary length pattern. FOR PATH tells the engine that the node or edge table will return an ordered collection representing the list of nodes or edges along the path from the start node to the end node. The attributes from these tables cannot be projected directly in the SELECT clause and cannot be filtered directly in the WHERE clause. To project or filter attributes from these tables, graph path aggregate functions must be used.

 

Graph Path Aggregate Functions

Since the nodes and edges involved in arbitrary length pattern return a collection (of node(s) and edge(s) traversed in that path), users cannot project the attributes directly using the conventional tablename.attributename syntax. For queries where it is required to project attribute values from the intermediate node or edge tables, in the path traversed, we have extended aggregate functions to work with graph node or edge tables. In this release, the following graph path aggregate functions are available: STRING_AGG, LAST_VALUE, SUM, AVG, MIN, MAX and COUNT. These functions can be used in SELECT or ORDER BY clause.

 

Find all the products which are up to 3 levels away from a given product.

Now, if Adventure Works Cycles wants to find products in their BOM, which are 1 to 3 levels away from Bearing Ball, they can just modify the query shown in example above as follows:

 

SELECT 
	P1.ProductID, 
	P1.Name,
	STRING_AGG(P2.Name,'->') WITHIN GROUP (GRAPH PATH) AS [Assembly],
	LAST_VALUE(P2.ProductID) WITHIN GROUP (GRAPH PATH) AS [Final ProductID]
FROM
	PRODUCT P1,
	PRODUCT FOR PATH P2,
	ISPARTOF FOR PATH IPO
WHERE 
	MATCH(SHORTEST_PATH(P1(-(IPO)->P2){1,3}))
	AND P1.ProductID = 2
ORDER BY P1.ProductID	

Here is a sample result set:

      image.png

 

Find Shortest Path between 2 given nodes

The queries shown above, start from a given node (Bearing Ball) and then search all paths in the graph starting from that node, until they cannot find any other paths that originate from this node and end at other nodes.

 

Imagine that Adventure Works Cycles wants to find the shortest path between 2 given nodes in the assembly. They can use following query to do this. This query will again start at ProductID = 2 (Bearing Ball) but terminates as soon as it finds ProductID = 768 (bike ‘Road-650 Black, 44’).

 

SELECT
	ProductID, Name, [Assembly], FinalProductID, Levels
FROM (
	SELECT 
		P1.ProductID, 
		P1.Name,
		STRING_AGG(P2.Name,'->') WITHIN GROUP (GRAPH PATH) AS [Assembly],
		LAST_VALUE(P2.ProductID) WITHIN GROUP (GRAPH PATH) AS FinalProductID,
		COUNT(P2.ProductID) WITHIN GROUP (GRAPH PATH) AS Levels
	FROM
		PRODUCT P1,
		PRODUCT FOR PATH P2,
		ISPARTOF FOR PATH IPO
	WHERE 
		MATCH(SHORTEST_PATH(P1(-(IPO)->P2)+))
		AND P1.ProductID = 2
 ) AS Q
 WHERE Q.FinalProductID = 768

Note that in this query, we used a subquery to apply filter on the last node traversed in the path (FinalProductID = 768). The filter on the FinalProductID in the outer query is pushed down while computing the shortest path. Hence, the query will terminate as soon as the FinalProductID is found. LAST_VALUE() function, which fetches the attribute values from last node traversed, cannot be used in the WHERE clause, since it is a window aggregate function.  

 

By using the COUNT graph path aggregate function in select clause, we can see the number of hops/levels between the 2 ProductIDs. Here is the result set of this query:

 

      image.png

 

In this query, the predicate/filter on last node (Q.FinalProductID) is pushed down. That is, the shortest path query terminates as soon as it finds ProductID = 768.

Writing a shortest path query with a terminating condition on the last node, as shown in the query above, is highly recommended. This is a much more optimal query than the ones which start at one node and then traverse the entire graph to find all the connections that exist from that node to others. For huge graphs, finding the shortest path from a given start node to all the other nodes in the graph, will be a very expensive operation.

 

Conclusion

Starting with SQL Server 2019 CTP3.1, you can use SHORTEST_PATH function to compute shortest path(s) between nodes in a graph, using a regular expression style search pattern. SHORTEST_PATH can only be used inside a MATCH function and works only with graph node and edge tables. This function also allows you to write transitive closure or arbitrary length traversal query from a given ‘start’ node.

 

Next Steps

Please download and give this feature a try.  We are looking forward to hearing your comments or feedback for this feature.

 

9 Comments
Copper Contributor

This all sounds very promising, but how can it be used for weighted paths, eg: distances on a map? It seems it is only useful for counting nodes, but not distances. Am I missing something?

 

Microsoft

@freom You cannot use this to calculate a weighted shortest path. But, you can store weights as edge properties/attributes and calculate the weight for the shortest path by distance using the aggregate functions (sum, in this case) over edge attributes. 

Copper Contributor
@Shreya Verma Are there any plans to allow extra edge columns to be used in WHERE clauses? SHORTEST_PATH will not work properly if you have cycles in your graph and use an extra column (i.e. a column type) to differentiate between the different edge types (i.e. parent/child)
Copper Contributor

@Shreya Verma - Thanks for introducing to the new features from SQL Graph DB, I have been trying to use this in my latest project for storing hierarchy / tree structures where we join multiple individual hierarchies to create a combined tree structure with intersections at leaf levels from one or more hierarchies to store the Cartesian product by only creating a separate Edge Table to store that relationship, which is very powerful.

 

While working with my example of creating separate Edge Table to store  combined tree structure, I have observed limitations of SHORTEST_PATH, in my use case when we have multiple parent nodes for same child node, SHORTEST_PATH is not able to produce output from all possible child nodes, it will find path to the last node for only first iteration of Parent to Child, please see my example below

 

In this example I have 2 tree structures of "Region" and "Vehicles", purpose is to create "Sales by Region" hierarchy

 

When SHORTEST_PATH query is executed, I am expecting to get full path from all nodes of Region to all possible nodes of Vehicles, but the path stops at the first region of "Illinois" and does not show data for "Wisonsin"

 

Please suggest any corrections which can help generate full path  .. many thanks

 

---Create Sample Schema
CREATE SCHEMA graph
GO

--Tree Nodes
CREATE TABLE graph.Tree_Nodes
(
	tree_id bigint
	,tree_parent bigint
	,tree_name NVARCHAR(50)
) AS NODE


--Combined Tree Edges
CREATE TABLE graph.Combined_Tree_Edges
(
	tree_id bigint
) AS EDGE

--Tree Insert
INSERT INTO graph.Tree_Nodes VALUES (1,null,'Region')
INSERT INTO graph.Tree_Nodes VALUES (2,1,'USA')
INSERT INTO graph.Tree_Nodes VALUES (3,2,'Illinois')
INSERT INTO graph.Tree_Nodes VALUES (4,2,'Wisonsin')
INSERT INTO graph.Tree_Nodes VALUES (5,null,'Vehicles')
INSERT INTO graph.Tree_Nodes VALUES (6,5,'Cars')
INSERT INTO graph.Tree_Nodes VALUES (7,5,'Trucks')
INSERT INTO graph.Tree_Nodes VALUES (8,6,'BMW')
INSERT INTO graph.Tree_Nodes VALUES (9,6,'Merc')
INSERT INTO graph.Tree_Nodes VALUES (10,6,'Tesla')
INSERT INTO graph.Tree_Nodes VALUES (11,7,'Volvo')
INSERT INTO graph.Tree_Nodes VALUES (12,7,'Daimler')


--Create Edge relations on graph.Combined_Tree_Edges
INSERT INTO graph.Combined_Tree_Edges ($from_id, $to_id,tree_id) 
select	 ParentFromNode.$node_id
		,ChildToNode.$node_id
		,ChildToNode.tree_id
from graph.Tree_Nodes ParentFromNode 
INNER JOIN graph.Tree_Nodes ChildToNode ON ChildToNode.tree_parent=ParentFromNode.tree_id
WHERE ChildToNode.tree_id IN (2,3,4,6,7,8,9,10,11,12)

--Intersection Edges
INSERT INTO graph.Combined_Tree_Edges ($from_id, $to_id,tree_id) 
VALUES
		(
			(SELECT $node_id FROM graph.Tree_Nodes WHERE tree_id=3) --Illinois
			,(SELECT $node_id FROM graph.Tree_Nodes WHERE tree_id=5) --Vehicles
			,3
		)

INSERT INTO graph.Combined_Tree_Edges ($from_id, $to_id,tree_id) 
VALUES
		(
			(SELECT $node_id FROM graph.Tree_Nodes WHERE tree_id=4) --Wisonsin
			,(SELECT $node_id FROM graph.Tree_Nodes WHERE tree_id=5) --Vehicles
			,4
		)


--Final Select Query
SELECT
	Parent.tree_id AS ParentId,
	Parent.tree_name AS ParentName,
	STRING_AGG(CAST(Child.tree_id AS NVARCHAR(10)), ' | ') WITHIN GROUP (GRAPH PATH) AS PathIds,
	STRING_AGG(CAST(Child.tree_name AS NVARCHAR(10)), ' | ') WITHIN GROUP (GRAPH PATH) AS PathNames,
	LAST_VALUE(Child.tree_parent) WITHIN GROUP (GRAPH PATH) AS LastNodeParentId,
	LAST_VALUE(Child.tree_id) WITHIN GROUP (GRAPH PATH) AS LastNodeId,
	LAST_VALUE(Child.tree_name) WITHIN GROUP (GRAPH PATH) AS LastNodeName
FROM
	graph.Tree_Nodes AS Parent,
	graph.Combined_Tree_Edges FOR PATH AS ChildOf,
	graph.Tree_Nodes FOR PATH  AS Child
WHERE MATCH(SHORTEST_PATH(Parent(-(ChildOf)->Child)+))
AND Parent.tree_id IN (1) ---Region

/*
--Cleanup
DROP TABLE graph.Combined_Tree_Edges
GO

DROP TABLE graph.Tree_Nodes
GO

DROP SCHEMA graph
GO

*/
 

Final Select OutputFinal Select Output

 

Excel ExampleExcel Example

 

Copper Contributor

Thank you  @Shreya Vermathis is a great article and inspires hopes for using graph db for BOM traversing in SQL Server 2019.

However in all my testing it performed poorly comparing to using recursive-CTE-based tree explosion.

The execution times were 10 to 20 times longer, output Rows estimates 4 to 10 times worse, plan cost comparison always 0% for recursion and 100% for the graph query using SHORTEST_PATH().

---------

Is this a known issue?

At this time it looks like Graph Db in SQL Server was created to support MATCH clause rather then improve performance of hierarchical data querying.

---------

Can you recommend any remedies?

- I played with different indexing on the node and edge tables. An index on the edge ($from_id, $to_id) helps, but not dramatically.

- OPTION (HASH JOIN) helps when querying for All product trees, but not for One product tree.

The sample data I used had 5 levels of hierarchy.

Microsoft

@alex_io is it possible for you to share a reproduction? without looking at the query it is hard to tell why it is performing poorly for you. 

 

Thanks,

shreya

Copper Contributor

@Shreya Verma ,

I figured out my main mistake:
For the recursive explosion I, of course, filter the root records - because this is how you seed the recursive query.

For the Graph query I was not doing that. 
When I added filtering of the roots, the graph query performance became very close to recursive, thou still worse.

My trees data has only about 1% or products/roots with more then 2 levels of hierarchy. So I suspect when I test with more levels of nesting for most of the data the performance of the graph queries may win.

I will post my updates.

Copper Contributor

Hi @Shreya Verma,
can you provide an example of Last_Node() concerning this explanation?
"More than one shortest path patterns are used in a query and one pattern begins at the LAST node of the previous pattern".
(from https://docs.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-shortest-path?view=sql-se...

I can't find one in www and I can't get along with syntax diagrams provided on

https://docs.microsoft.com/en-us/sql/t-sql/queries/match-sql-graph?view=sql-server-ver15

 

Thanks

Copper Contributor
Are there any workarounds to filtering fields from the Edge Table in the current SQL version (2019) for the Shortest Path query?
Version history
Last update:
‎Jun 26 2019 01:09 PM
Updated by: