Photo by Brandon Green on Unsplash
Model Tree Structures with CockroachDB
Model hierarchies using the Nested Set Model with CockroachDB
Overview
Modeling hierarchical tree structures in SQL databases has traditionally been difficult. At least until the introduction of Recursive Common Table Expressions (Recursive CTEs) that made it much simpler to craft queries to traverse a recursive path within a hierarchy, aka a tree structure.
A tree is an undirected, connected graph with no cycle, so when modeling product catalogs with categories for example, it's formally a special form of graph:
You don't need to use a graph database for that though. There are two main approaches to modeling tree structures in relational SQL databases:
- Adjacency List Model
- Nested Set Model
The Adjacency List Model with recursive CTEs provides the easiest approach. It's much less taxing on mutations of the tree structure, more performant and supports arbitrarily deep nestings / levels. The SQL is also much simpler and client code cleaner as well.
The Nested Set Model requires tracking two references per item instead of a single one in the adjacency model, making things less efficient and more difficult to implement in client code and SQL.
The Nested Set Model is not commonly used in SQL databases anymore since the introduction of recursive CTEs decades ago. That doesn't make it less interesting though, and it's likely still in use in many systems, in particular MySQL. Although completely unrelated to this, it's like exploring shortest path algorithms like Dijkstra's. Just of those things interesting to know a bit more about.
Let's look at an implementation of the nested set model using Spring Boot and CockroachDB to follow the historical order. We could have used the adjacency list model with recursive CTEs instead, but that's for a follow up post.
Source Code
The source code for this example is available in GitHub.
The Adjacency List Model
The adjacency list model is pretty straightforward. It's a parent-child relationship where each item in the table contains a pointer to its parent. The benefits are that it's simple, provides referential integrity and its also easy and efficient to update the structure using SQL.
The main drawback is that it performs poorly with deeply nested tree's (many levels), unless using recursive CTEs. Combined with CTEs, this is the preferred and portable approach to use in modern databases over the nested set model.
create table category
(
id int not null default unordered_unique_rowid(),
name varchar(64) not null,
description varchar(256),
parent_id int,
primary key (id)
);
Listing all the categories:
select id,
name,
parent_id
from category order by name;
id | name | parent_id |
109012213791195137 | 0-60 | 5980017278022057985 |
1854157069397262337 | 101-250 | 5980017278022057985 |
3771564610750251009 | 251-500 | 5980017278022057985 |
1342998511690711041 | 61-80 | 5980017278022057985 |
8686117704118304769 | 81-100 | 5980017278022057985 |
9138447991692328961 | Anis | 1692308957788635137 |
7371629562879541249 | Argentina | 9184328412896165889 |
4597412192419315713 | Australia | 9184328412896165889 |
7882788120586092545 | Austria | 3785638359585783809 |
1066449347072491521 | Available in store | 1942399474596052993 |
3447586912556285953 | Box | 1468043770094419969 |
5176406219513135105 | Bread | 1692308957788635137 |
888979374256422913 | Butter | 1692308957788635137 |
3175119135100370945 | Champagne | 1468043770094419969 |
5261939428061085697 | Chardonnay | 2664488342975152129 |
9124655717833506817 | Country | null |
939081920110919681 | Dessert | 1468043770094419969 |
3785638359585783809 | Europe | 9124655717833506817 |
7803834389618753537 | Findings under 150 | 1942399474596052993 |
2379389375939346433 | France | 3785638359585783809 |
6464154237964386305 | Germany | 3785638359585783809 |
2664488342975152129 | Grape | null |
7704051510374825985 | Honey | 1692308957788635137 |
412442238685282305 | Italy | 3785638359585783809 |
4060217199367028737 | Kanonkop | 3346396658428805121 |
1692308957788635137 | Label | null |
9197698474289922049 | Masi | 3346396658428805121 |
4113521523081609217 | Merlot | 2664488342975152129 |
5410171187671334913 | Miguel Torres | 3346396658428805121 |
9184328412896165889 | New World | 9124655717833506817 |
5980017278022057985 | Price | null |
3346396658428805121 | Producer | null |
6128495328236929025 | Recent News | 1942399474596052993 |
4539709822193631233 | Red | 1468043770094419969 |
3694686757736153089 | Riesling | 2664488342975152129 |
2098758824158822401 | Rose | 1468043770094419969 |
2381641175753031681 | South Africa | 9184328412896165889 |
2032612204631818241 | Spain | 3785638359585783809 |
6512286458981908481 | Sparkling | 1468043770094419969 |
1942399474596052993 | Special | null |
1468043770094419969 | Types | null |
3501067158131310593 | Vanilla | 1692308957788635137 |
7836344749428834305 | White | 1468043770094419969 |
4807111050068754433 | Young | 1692308957788635137 |
Tree listing using recursive CTEs
Just to show the usefulness of recursive CTEs, this query will list the entire tree structure in one single query:
with recursive category_tree (id, name, parent_id, level, path) as
(select id,
name,
parent_id,
0,
concat('/', name, '/')
from category
where parent_id is null -- anchor
union all
select c.id,
c.name,
c.parent_id,
ct.level + 1,
concat(ct.path, c.name, '/')
from category c
join category_tree ct
on c.parent_id = ct.id)
select *
from category_tree
order by path;
To filter down on a particular sub-tree name, just change the CTE anchor query:
with recursive category_tree (id, name, parent_id, level, path) as
(select id,
name,
parent_id,
0,
concat('/', name, '/')
from category
where parent_id is null
and name = 'Label' -- anchor
union all
select c.id,
c.name,
c.parent_id,
ct.level + 1,
concat(ct.path, c.name, '/')
from category c
join category_tree ct
on c.parent_id = ct.id)
select *
from category_tree
order by path;
Now, forget about the CTE approach and let's do it the hard way.
The Nested Set Model
In the Nested Set Model, each item in the table contains two pointers to track the containment of sublevels. This allows for a query to extract sub-trees in one single query rather than use recursion or one self-join per level, which is the drawback of the adjacency model without a recursive CTE.
As such, a tree hierarchy is not expressed as vertices and edges in a parent-child relationship, but instead as nested containers. Instead, picture the wine catalog like this:
Let's update the schema:
create table category
(
id int not null default unordered_unique_rowid(),
name varchar(64) not null,
description varchar(256),
lft int not null,
rgt int not null,
parent_id int,
primary key (id)
);
select id,
name,
lft,
rgt
from category order by name;
id | name | lft | rgt |
109012213791195137 | 0-60 | 2 | 3 |
1854157069397262337 | 101-250 | 4 | 5 |
3771564610750251009 | 251-500 | 6 | 7 |
1342998511690711041 | 61-80 | 8 | 9 |
8686117704118304769 | 81-100 | 10 | 11 |
9138447991692328961 | Anis | 14 | 15 |
7371629562879541249 | Argentina | 65 | 66 |
4597412192419315713 | Australia | 67 | 68 |
7882788120586092545 | Austria | 53 | 54 |
1066449347072491521 | Available in store | 28 | 29 |
3447586912556285953 | Box | 36 | 37 |
5176406219513135105 | Bread | 16 | 17 |
888979374256422913 | Butter | 18 | 19 |
3175119135100370945 | Champagne | 38 | 39 |
5261939428061085697 | Chardonnay | 82 | 83 |
9124655717833506817 | Country | 51 | 72 |
939081920110919681 | Dessert | 40 | 41 |
3785638359585783809 | Europe | 52 | 63 |
7803834389618753537 | Findings under 150 | 30 | 31 |
2379389375939346433 | France | 55 | 56 |
6464154237964386305 | Germany | 57 | 58 |
2664488342975152129 | Grape | 81 | 88 |
7704051510374825985 | Honey | 20 | 21 |
412442238685282305 | Italy | 59 | 60 |
4060217199367028737 | Kanonkop | 74 | 75 |
1692308957788635137 | Label | 13 | 26 |
9197698474289922049 | Masi | 76 | 77 |
4113521523081609217 | Merlot | 86 | 87 |
5410171187671334913 | Miguel Torres | 78 | 79 |
9184328412896165889 | New World | 64 | 71 |
5980017278022057985 | Price | 1 | 12 |
3346396658428805121 | Producer | 73 | 80 |
6128495328236929025 | Recent News | 32 | 33 |
4539709822193631233 | Red | 42 | 43 |
3694686757736153089 | Riesling | 84 | 85 |
2098758824158822401 | Rose | 44 | 45 |
2381641175753031681 | South Africa | 69 | 70 |
2032612204631818241 | Spain | 61 | 62 |
6512286458981908481 | Sparkling | 46 | 47 |
1942399474596052993 | Special | 27 | 34 |
1468043770094419969 | Types | 35 | 50 |
3501067158131310593 | Vanilla | 22 | 23 |
7836344749428834305 | White | 48 | 49 |
4807111050068754433 | Young | 24 | 25 |
The left and right values for each item is determined by traversing the graph starting from the root (there are multiple roots here) and descending to each leaf node, before assigning a number to the right and moving on. Gradually progressing from left to right. This is known as the modified preorder tree traversal (MPTT) algorithm.
Category structure example with the left and right numbers laid out:
Tree listing using Nested Set Model
To show the nested set model in action, this is the query that will list the entire tree structure using a cross join:
select category0_.name as col_0_0_, count(category1_.name) - 1 as col_1_0_
from category category0_
cross join category category1_
where category0_.lft between category1_.lft and category1_.rgt
group by category0_.name, category0_.lft
order by category0_.lft;
col_0_0_ | col_1_0_ |
Price | 0 |
0-60 | 1 |
101-250 | 1 |
251-500 | 1 |
61-80 | 1 |
81-100 | 1 |
Label | 0 |
Anis | 1 |
Bread | 1 |
Butter | 1 |
Honey | 1 |
Vanilla | 1 |
Young | 1 |
Special | 0 |
Available in store | 1 |
Findings under 150 | 1 |
Recent News | 1 |
Types | 0 |
Box | 1 |
Champagne | 1 |
Dessert | 1 |
Red | 1 |
Rose | 1 |
Sparkling | 1 |
White | 1 |
Country | 0 |
Europe | 1 |
Austria | 2 |
France | 2 |
Germany | 2 |
Italy | 2 |
Spain | 2 |
New World | 1 |
Argentina | 2 |
Australia | 2 |
South Africa | 2 |
Producer | 0 |
Kanonkop | 1 |
Masi | 1 |
Miguel Torres | 1 |
Grape | 0 |
Chardonnay | 1 |
Riesling | 1 |
Merlot | 1 |
To filter down on a particular sub-tree name:
SELECT node.name AS name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depth
FROM category AS node,
category AS parent,
category AS sub_parent,
(SELECT node.name, (COUNT(parent.name) - 1) AS depth
FROM category AS node,
category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.name = 'Label'
GROUP BY node.name, node.lft
ORDER BY node.lft) AS sub_tree
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt
AND sub_parent.name = sub_tree.name
GROUP BY node.name, node.lft, depth
ORDER BY node.lft;
name | depth |
Label | 0 |
Anis | 1 |
Bread | 1 |
Butter | 1 |
Honey | 1 |
Vanilla | 1 |
Young | 1 |
The subquery could also be extracted to a CTE to improve readability a bit by rewriting the query:
with sub_tree as (SELECT node.name, (COUNT(parent.name) - 1) AS depth
FROM category AS node,
category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.name = 'Label'
GROUP BY node.name, node.lft
ORDER BY node.lft)
SELECT node.name AS name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depth
FROM category AS node,
category AS parent,
category AS sub_parent,
sub_tree
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt
AND sub_parent.name = sub_tree.name
GROUP BY node.name, node.lft, depth
ORDER BY node.lft;
Demo Project
To explore the nested set model in a semi-realistic context, let us use a service with the following product catalog schema:
The entity model contains of products, product variations with attributes and categories organized into a tree structure. A fairly common layout for an online webshop and similar.
The main artifacts of interest are:
- Category - Domain entity that models a hierarchy of categories.
- JpaCategoryRepository - The JPA implementation using Criteria API to craft the nested set model operations.
- ProductCatalogTest - Functional tests of the product catalog.
This implementation is using JPA with Hibernate and partly also the dreaded but type safe Criteria API which happens to be fairly suitable for this problem domain.
To take one query example, finding all products within a given non-leaf category such as "Europe" that has several country categories below it:
Client code:
@Test
@Order(11)
@Transactional
@Commit
public void whenFindingBeveragesByInheritedCategory_expectResults() {
// Expects beverages under 'Europe'
Category category = categoryRepository.getByTypeAndName(DistrictCategory.class, "Europe");
List<Product> products = productRepository.findByCategory(category);
Assertions.assertEquals(products.size(), 1);
}
SQL:
First for the category:
select distinct districtca0_.id as id2_1_,
districtca0_.description as descript3_1_,
districtca0_.lft as lft4_1_,
districtca0_.name as name5_1_,
districtca0_.parent_id as parent_i8_1_,
districtca0_.rgt as rgt6_1_,
districtca0_.country as country7_1_,
categorize1_.category_id as category1_0_0__,
categorize1_.expires_at as expires_2_0_0__,
categorize1_.product_id as product_3_0_0__
from category districtca0_
left outer join categorized_product categorize1_ on districtca0_.id = categorize1_.category_id
where districtca0_.category_type = 'DISTRICT'
and districtca0_.name='Europe';
Then for the products in that category:
select product2_.id as id1_2_,
product2_.description as descript2_2_,
product2_.name as name3_2_,
product2_.sku_code as sku_code4_2_
from category category0_
inner join categorized_product categorize1_ on category0_.id = categorize1_.category_id
inner join product product2_ on categorize1_.product_id = product2_.id
where category0_.lft between 52 and 63;
Conclusion
Modeling tree structures in SQL databases used to be a pain in the past until recursive CTEs were introduced. Before that, the nested set model offered a more efficient method in contrast to the adjacency list model without CTEs.