@
Mithril 我做了下『闭包表』和『改良后的邻接表』的测试 *(结尾附上一键建表和查询的 `SQL` 供测试)*
- 数据源:[2022 年中国全国 5 级行政区划](
https://github.com/adyliu/china_area/blob/master/area_code_2022.csv.gz)
- 数据库:`MySQL 8.0.29` 和 `SQLite 3.39.0`
- 表结构:『闭包表』和『 `(<pid, id>, is_leaf)` 型邻接表』
- 测试项:『查询根节点所有后代』和『查询根节点第 5 层后代』
结果如下 *(多次测试稳定后)*:
## 『查询根节点所有后代』速度对比
表结构 MySQL SQLite
——————————————————————————
闭包表 1.3秒 0.13秒
递归邻接表 1.2秒 0.60秒
理想中递归损耗很小的邻接表 0.6秒 0.12秒
Markdown 表格:
| 表结构 | `MySQL` | `SQLite` |
| :------------------------: | :-----: | :------: |
| 闭包表 | 1.3 秒 | 0.13 秒 |
| 递归邻接表 | 1.2 秒 | 0.60 秒 |
| 理想中递归损耗很小的邻接表 | 0.6 秒 | 0.12 秒 |
## 『查询根节点第 5 层后代』速度对比
表结构 MySQL SQLite
——————————————————————————
闭包表 1.2秒 0.12秒
递归邻接表 0.5秒 0.13秒
理想中递归损耗很小的邻接表 0.4秒 0.10秒
Markdown 表格:
| 表结构 | `MySQL` | `SQLite` |
| :------------------------: | :-----: | :------: |
| 闭包表 | 1.2 秒 | 0.12 秒 |
| 递归邻接表 | 0.5 秒 | 0.13 秒 |
| 理想中递归损耗很小的邻接表 | 0.4 秒 | 0.10 秒 |
## 目前观点
1. 4W 多次的 `ref` 级 `WHERE pid = ?`,还是能和 66W 次 `eq_ref` 级的 `WHERE id = ?` 过过招,甚至更快的。而且,磁盘 IO 越慢,这个差异应该越大。
2. 数据库们的 `WITH RECURSIVE` 查询,损耗有点大。
- `MySQL` 好歹每次递归都将上一次所有结果当作一张表来计算。但大概 5 次递归的耗时,就比非递归的多一倍了
- `SQLite` 最摆烂,每次递归只取以前结果的一行来计算,直到取完为止。所以有 66W 次的递归,耗时大概 5 倍。。😡
> Extract a single row from the queue.
>
> Pretend that the single row just extracted is the only row in the recursive table and run the recursive-select, adding all results to the queue.
## 『查询根节点所有后代』通用 `SQL`
*( V 站排版原因,行首有全角空格)*
```sql
PRAGMA cache_size = -204800; -- 允许 SQLite 缓存 200 MB
-- 闭包表查询
SELECT COUNT(*), SUM(code), SUM(CHAR_LENGTH(name)) -- SQLite 写法:SUM(LENGTH(name))
FROM closure_tree
FORCE INDEX (idx_closure_tree) -- 我这测试,MySQL 不加这行,耗时翻好几倍。SQLite 需去掉此行
JOIN closure ON id = descendant
WHERE ancestor = 0;
-- 递归邻接表查询
WITH RECURSIVE
find(id, code, name, is_leaf) AS (
SELECT id, code, name, is_leaf
FROM adjacent
WHERE pid = 0
UNION ALL
SELECT
b.id, b.code,
b.name, b.is_leaf
FROM find a
JOIN adjacent b ON NOT a.is_leaf AND b.pid =
a.id )
SELECT COUNT(*), SUM(code), SUM(CHAR_LENGTH(name)) -- SQLite 写法:SUM(LENGTH(name))
FROM find;
-- 理想中,没有递归损耗的邻接表查询
SELECT COUNT(*), SUM(b.code), SUM(CHAR_LENGTH(
b.name)) -- SQLite 写法:SUM(LENGTH(name))
FROM adjacent a
CROSS JOIN adjacent b ON b.pid =
a.id -- SQLite 需要 CROSS JOIN ,否则耗时翻几倍
WHERE NOT a.is_leaf;
```
## 『查询根节点第 5 层后代』通用 `SQL`
*( V 站排版原因,行首有全角空格)*
```sql
PRAGMA cache_size = -204800; -- 允许 SQLite 缓存 200 MB
-- 闭包表查询
SELECT COUNT(*), SUM(code), SUM(CHAR_LENGTH(name)) -- SQLite 写法:SUM(LENGTH(name))
FROM closure_tree
FORCE INDEX (idx_closure_tree) -- 我这测试,MySQL 不加这行,耗时翻好几倍。SQLite 需去掉此行
JOIN closure ON id = descendant
WHERE ancestor = 0
AND distance = 5;
-- 递归邻接表查询
WITH RECURSIVE
var(depth) AS (
SELECT 5
),
-- 递归部分查前 N - 1 层
find(id, is_leaf, depth) AS (
SELECT 0, FALSE, var.depth - 1
FROM var
UNION ALL
SELECT
b.id, b.is_leaf, a.depth - 1
FROM find a
JOIN adjacent b ON b.pid =
a.id WHERE a.depth > 0
AND NOT a.is_leaf
)
-- 最后一次性 JOIN 第 N 层
SELECT COUNT(*), SUM(b.code), SUM(CHAR_LENGTH(
b.name)) -- SQLite 写法:SUM(LENGTH(
b.name))
FROM find a
CROSS JOIN adjacent b ON
a.id = b.pid -- SQLite 要加 CROSS ,否则耗时翻几倍
WHERE a.depth = 0;
-- 理想中,没有递归损耗的邻接表查询(需要根据层数 N ,动态生成 SQL )
SELECT COUNT(*), SUM(t5.code), SUM(CHAR_LENGTH(
t5.name)) -- SQLite 写法:SUM(LENGTH(
t5.name))
FROM adjacent t1
JOIN adjacent t2 ON t2.pid =
t1.id JOIN adjacent t3 ON t3.pid =
t2.id JOIN adjacent t4 ON t4.pid =
t3.id JOIN adjacent t5 ON t5.pid =
t4.id WHERE t1.pid = 0;
```
## `MySQL` 一键建表 `SQL`
*(在我低配笔记本和固态上,大约执行了 1 分钟)*
*( V 站排版原因,行首有全角空格)*
```sql
-- 允许 200 MB 的内存表
SET max_heap_table_size = 200 << 20;
-- 建临时数据表,装载 csv 数据,以及计算序号和父子关系
CREATE TABLE data (
code BIGINT NOT NULL,
p_code BIGINT NOT NULL,
type TINYINT NOT NULL,
name VARCHAR(25) NOT NULL,
id INT NOT NULL,
pid INT NOT NULL,
PRIMARY KEY (code) USING BTREE,
INDEX USING BTREE (id),
INDEX USING BTREE (pid, id)
) ENGINE = MEMORY;
-- 加载 csv
LOAD DATA INFILE 'area_code_2022.csv'
INTO TABLE data
CHARACTER SET UTF8MB4
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
(code, name, type, p_code);
-- 按照 code 顺序计算 id
UPDATE data
JOIN (SELECT code, ROW_NUMBER() OVER win row_num
FROM data
WINDOW win AS (ORDER BY code)) t USING(code)
SET id = row_num;
-- 计算 parent_id (不存在的标 0 )
UPDATE data a
LEFT JOIN data b ON b.code = a.p_code
SET a.pid = IFNULL(
b.id, 0);
-- 建邻接表,并从临时数据表填充数据
CREATE TABLE adjacent (
id INT NOT NULL,
pid INT NOT NULL,
is_leaf BOOL NOT NULL,
type TINYINT NOT NULL,
code BIGINT NOT NULL,
name VARCHAR(25) NOT NULL,
PRIMARY KEY (pid, id)
)
SELECT -1 pid, 0 id, FALSE is_leaf, 0 type, 0 code, '' name
UNION ALL
SELECT pid, id, type = 5 is_leaf, type, code, name
FROM data;
-- 建闭包表主表,并从临时数据表填充数据
CREATE TABLE closure (
id INT NOT NULL,
type TINYINT NOT NULL,
code BIGINT NOT NULL,
name VARCHAR(25) NOT NULL,
PRIMARY KEY (id)
)
SELECT 0 id, 0 type, 0 code, '' name
UNION ALL
SELECT id, type, code, name
FROM data;
-- 建闭包表树形关系表
CREATE TABLE closure_tree (
ancestor INT NOT NULL,
descendant INT NOT NULL,
distance TINYINT NOT NULL,
PRIMARY KEY (descendant, distance)
);
-- 递归构建树形关系
INSERT INTO closure_tree(ancestor, descendant, distance)
WITH RECURSIVE
parent_of(orig_id, id, dist) AS (
SELECT id, id, 0
FROM data
UNION ALL
SELECT orig_id, pid, dist + 1
FROM parent_of
JOIN data USING(id)
WHERE id
)
SELECT id, orig_id, dist
FROM parent_of;
-- 为闭包表树形关系表建二级索引
CREATE INDEX idx_closure_tree ON closure_tree (ancestor, distance);
-- 丢弃临时数据表
DROP TABLE data;
```
## `SQLite` 一键建表 `SQL`
下列 `SQL` 需要依赖 `SQLite Shell` 的 `.import --csv`,但核心 `SQLite` 库不提供此功能。
因此,需要使用命令行的 `SQLite` 来运行*(`Windows` 可去官网下载个 1~2 MB 的 `sqlite3.exe`)*。
下面使用 `Bash Shell` 来包装执行命令与 `SQL`,大约需要运行 30 秒,然后在同目录下生成 150 MB 左右的 `test.db`。
*( V 站排版原因,行首有全角空格)*
```sql
#!/bin/bash
sqlite3 :memory: <<'EOF'
-- 在内存中计算,最后整理紧凑才写入文件
PRAGMA TEMP_STORE = MEMORY;
-- 导入 csv 文件至临时表
CREATE TEMP TABLE csv (code INTEGER PRIMARY KEY, name TEXT, type INT, p_code INT);
.import --csv area_code_2022.csv csv
-- 建邻接表
CREATE TABLE adjacent (
id INT NOT NULL,
pid INT NOT NULL,
is_leaf INT NOT NULL,
type INT NOT NULL,
code INT NOT NULL,
name TEXT NOT NULL,
PRIMARY KEY (pid, id)
) WITHOUT ROWID;
-- 填充邻接表
INSERT INTO adjacent (pid, id, is_leaf, type, code, name)
SELECT -1, 0, FALSE, 0, 0, ""
UNION ALL
SELECT p_code, ROW_NUMBER() OVER (), type = 5, type, code, name
FROM csv
ORDER BY code;
-- 建临时索引,提速 code 搜索
CREATE INDEX i ON adjacent (code);
-- 更新 pid
UPDATE adjacent
SET pid =
t2.id FROM adjacent t2
WHERE adjacent.pid = t2.code;
-- 丢弃临时索引
DROP INDEX i;
-- 建 id -> pid 索引
CREATE INDEX idx_adjacent_id ON adjacent (id);
-- 建闭包表主表
CREATE TABLE closure (
id INTEGER PRIMARY KEY,
type INT NOT NULL,
code INT NOT NULL,
name TEXT NOT NULL
);
-- 建闭包表树形关系表
CREATE TABLE closure_tree (
ancestor INT NOT NULL,
descendant INT NOT NULL,
distance INT NOT NULL,
PRIMARY KEY (descendant, distance)
) WITHOUT ROWID;
-- 填充闭包表主表
INSERT INTO closure (id, type, code, name)
SELECT id, type, code, name
FROM adjacent;
-- 递归构建树形关系
WITH RECURSIVE
parent_of(orig_id, id, dist) AS (
SELECT id, id, 0
FROM adjacent
UNION ALL
SELECT orig_id, pid, dist + 1
FROM parent_of
JOIN adjacent USING(id)
WHERE id
)
INSERT INTO closure_tree (ancestor, descendant, distance)
SELECT id, orig_id, dist
FROM parent_of;
-- 为闭包表树形关系表建二级索引
CREATE INDEX idx_closure_tree ON closure_tree (ancestor, distance);
-- 整理紧实数据库后,写入磁盘
ANALYZE;
VACUUM INTO 'test.db';
EOF
```