Query Optimization¶

约 2038 个字 6 行代码 25 张图片预计阅读时间 7 分钟

Abstract

Introduction
Transformation of Relational Expressions
Statistical Information for Cost Estimation
Cost-based Optimization
Dynamic Programming for Choosing Evaluation Plans
Nested Subqueries
Materialized Views
Advanced Topics in Query Optimization

Introduction¶

Alternative ways of evaluating a given query

Equivalent expressions
逻辑优化：关系代数表达式（尽量先做选择，投影）
Different algorithms for each operation
物理层面：每个算子选择不同的算法

Example

Estimation of plan cost based on:

Statistical information about relations. Examples: number of tuples, number of distinct values for an attribute
Statistics estimation for intermediate results（Cardinality Estimation） to compute cost of complex expressions
估计中间结果的大小
现在有基于深度学习的估计方法
Cost formulae for algorithms, computed using statistics

关系数据库里可以用查看执行计划。

Generating Equivalent Expressions¶

Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance
形式上不一样，但是结果（输出）是一样的，产生了相同的集合。

Equivalence Rules¶

selection
1. 可以把算子拆分
  如果某些属性有索引，那么可以先拆分，在索引 select 之后再执行其他算子，否则不如不拆分。
2. 算子可交换
  先执行有索引的算子。
3. 投影的属性可以只保留最后一次的
4. 选择算子可以和合并结合
join

自然连接是结合的（先连接中间结果小的）

如果选择算子只和一个关系有关，那么我们可以先执行选择。（选择算子要早进行，推到叶子上）
projection

同理投影也要早做。
如果连接要用到投影后不保留的属性，我们在第一次投影时要把连接用的属性也保留下来。
set operation

这里的减法，减数关系就不用做选择了（减去多的总是没问题的）对交集也适用
other

Enumeration of Equivalent Expressions¶

Repeat
- apply all applicable equivalence rules on every subexpression of every equivalent expression found so far
- add newly generated expressions to the set of equivalent expressions
Until no new equivalent expressions are generated above

可以这样找到所有的等价表达式。

但是实际中我们基于一些经验规则进行启发式的优化

Statistics for Cost Estimation¶

代价估算需要统计信息

$n_r$: number of tuples in a relation r.
$b_r$: number of blocks containing tuples of r.
$l_r$: size of a tuple of r.
$f_r$: blocking factor of r — i.e., the number of tuples of r that fit into one block.
一个块可以放多少个元组
$V(A, r)$: number of distinct values that appear in r for attribute A; same as the size of $\Pi(r)$.
If tuples of r are stored together physically in a file, then: $b_r = \lceil \dfrac{n_r}{f_r}\rceil$
Histograms

attribute age of relation person

Selection Size Estimation¶

中间结果

$\sigma_{A=v}(r)$
$n_r / V(A,r)$ : number of records that will satisfy the selection.
这样的估算基于值是平均分布的
如果要找的是一个 key, 那么 size estimate=1
$\sigma_{A\leq v}(r)$
- Let $c$ denote the estimated number of tuples satisfying the condition.
- $c = 0$ if $v < \min(A,r)$
  v 比属性 A 的最小值还要小
- $c = n_r\cdot \dfrac{v-\min(A,r)}{\max(A,r) - \min A(A,r)}$
- In absence of statistical information c is assumed to be $n_r / 2$ ( 没有最大、最小统计信息时 ).

概率论。
注意这些公式的要求是条件是相互独立的。

Joins¶

The Cartesian product $r \times s$ contains $n_r\cdot n_s$ tuples; each tuple occupies $s_r + s_s$ bytes.

$R \cap S = \emptyset$
没有公共属性，等价于 $r\times s$
$R \cap S$ is a key for $R$, then a tuple of $s$ will join with at most one tuple from $r$

Example
If $R \cap S$ in S is a foreign key in S referencing R, then the number of tuples in $r\bowtie s$ = the number of tuples in s.
If $R \cap S = \{A\}$ is not a key for R or S.
$n_r * \dfrac{n_s}{V(A,s)}, n_s * \dfrac{n_r}{V(A,r)}$.
以第二个为例子，站在 s 的角度，每一个 s 可以和这么多个元素连接。
通常我们取二者中的较小值。

Example

Size Estimation for Other Operations¶

外部连接 r, s 认为是 r s 自然连接的结果加上 r 的大小。

Estimation of Number of Distinct Values¶

估算 V(A,r).

Selections $\sigma_\theta(r)$, estimate $V(A,\sigma_\theta(r))$

If $\theta$ forces A to take a specified value, $V(A,\sigma_\theta(r))=1$
e.g., A = 3
If $\theta$ forces A to take on one of a specified set of values: $V(A,\sigma_\theta(r))=$ number of specified values
e.g., (A = 1 V A = 3 V A = 4)
If the selection condition $\theta$ is of the form A op v, $V(A,\sigma_\theta(r))=V(A,r)*s$
利用选择率 s 计算
In all the other cases, use approx1imate estimate: $V(A,\sigma_\theta(r))=\min(V(A,r), \n_{\sigma_\theta(r)})$

joins $r\bowtie s$, estimate $V(A,r\bowtie s)$

If all attributes in A are from r, the estimated $V(A,r\bowtie s)=\min(V(A,r), n_{r\bowtie s})$
If A contains attributes A1 from r and A2 from s, then estimated $V(A,r\bowtie s)=\min(V(A1,r)*V(A2-A1,s), V(A1-A2,r)*V(A2,s), n_{r\bowtie s})$

Choice of Evaluation Plans¶

Must consider the interaction of evaluation techniques when choosing evaluation plans

choosing the cheapest algorithm for each operation independently may not yield best overall algorithm
e.g. merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation.
Mergejoin 代价高，但是有个好处是 join 后是有次序的，对上层操作有利。

如果要找最优的执行计划，可能需要很长时间。通常按照经验规则。
我们主要考虑连接操作的优化。

Cost-Based Join-Order Selection¶

Consider finding the best join-order for $r_1\bowtie r_2\bowtie \ldots r_n$.
There are $(2(n – 1))!/(n – 1)!$ different join orders for above expression.

Example

Using dynamic programming, the least-cost join order for any subset of $\{r_1, r_2, \ldots r_n\}$ is computed only once and stored for future use.

Join Order Optimization Algorithm

先分解成两个小的集合 $S_1, S-S_1$. 递归地细分。
递归到最底层就变为了对单个表的选择方法。

Left Deep Join Trees¶

In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.

左边可以是中间结果，右边必须是一个关系。

Cost of Optimization¶

With dynamic programming
- time complexity of optimization with bushy trees is $O(3^n)$.
- Space complexity is $O(2^n)$
left-deep join tree
- Time complexity of finding best join order is $O(n 2^n)$
- Space complexity remains at $O(2^n)$

Heuristic Optimization¶

Cost-based optimization is expensive.
可以用启发式优化

Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance:

Perform selection early (reduces the number of tuples)
Perform projection early (reduces the number of attributes)
Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations.
Perform left-deep join order

Additional Optimization Techniques¶

Nested Subqueries¶

Nested query example:

select name from instructor 
where exists 
    (select * from teaches
    where instructor.ID = teaches.ID and teaches.year = 2022)

找出 2022 开课的老师的名字。

两重循环，但是低效。

Parameters are variables from outer level query that are used in the nested subquery; such variables are called correlation variables（相关变量）
即来自外循环的变量。如果没有相关变量，我们可以先执行内部，然后再执行外部。

把刚刚那个例子改为一个 select 语句，那么一个老师如果开了很多门课就会出现很多个名字。但是加上 distinct 关键词后又无法区分同名情况。

半连接 $⋉$\theta s$，检验 r 是否满足某个关系。
If a tuple $r_i$ appears n times in r, it appears n times in the result of $r $⋉$\theta s$ , if there is at least one tuple $s_i$ in s matching with $r_i$.

Example

The process of replacing a nested query by a query with a join/semijoin (possibly with a temporary relation) is called decorrelation( 去除相关 )

Decorrelation of scalar aggregate subqueries can be done using groupby/aggregation in some cases

Example

Materialized Views¶

A materialized view is a view whose contents are computed and stored.

有些数据库里把 view 实例化了，真正存储在内部的临时表。

create view department_total_salary(dept_name, total_salary) 
as select dept_name, sum(salary) from instructor group by dept_name

Saves the effort of finding multiple tuples and adding up their amounts.
但是需要时刻保持这个视图和原表一致。

use incremental view maintenance( 增量视图维护 )
The changes (inserts and deletes) to a relation or expressions are referred to as its differential( 差分 )

join: $V^{new}=V^{old}\cup (i_r\bowtie s), V^{new} = V^{old}-(d_r\bowtie s)$

join
select: $V^{new}=V^{old}\cup \sigma_\theta(i_r), V^{new} = V^{old}-\sigma_\theta(d_r)$
projection:
For each tuple in a projection $\Pi_A(r)$, we will keep a count of how many times it was derived.
- On insert of a tuple to r, if the resultant tuple is already in $\Pi_A(r)$ we increment its count, else we add a new tuple with count = 1
- On delete of a tuple from r, we decrement the count of the corresponding tuple in $\Pi_A(r)$ if the count becomes 0, we delete the tuple from $\Pi_A(r)$
Projection
count $v= _Ag_{count(B)}$
- insert: For each tuple r in $i_r$, if the corresponding group is already present in v, we increment its count, else we add a new tuple with count = 1
- delete: for each tuple t in $i_r$.we look for the group t.A in v, and subtract 1 from the count for the group.
  If the count becomes 0, we delete from v the tuple for the group t.A
sum $v= _Ag_{sum(B)}$
min, max

怎么利用这些 view?

Rewriting queries to use materialized views:

Example
Replacing a use of a materialized view by the view definition

Materialized View Selection
有哪些查询？各种查询的比例？

Query Optimization¶

Introduction¶

Generating Equivalent Expressions¶

Equivalence Rules¶

Enumeration of Equivalent Expressions¶

Statistics for Cost Estimation¶

Selection Size Estimation¶

Joins¶

Size Estimation for Other Operations¶

Estimation of Number of Distinct Values¶

Choice of Evaluation Plans¶

Cost-Based Join-Order Selection¶

Left Deep Join Trees¶

Cost of Optimization¶

Heuristic Optimization¶

Additional Optimization Techniques¶

Nested Subqueries¶

Materialized Views¶

评论