Query Optimization

  • Introduction
  • Transformation of Relational Expressions
  • Statistical Information for Cost Estimation
  • Cost-based Optimization
  • Dynamic Programming for Choosing Evaluation Plans
  • Nested Subqueries
  • Materialized Views
  • Advanced Topics in Query Optimization


Alternative ways of evaluating a given query

  • Equivalent expressions
  • Different algorithms for each operation

Estimation of plan cost based on:

  • Statistical information about relations. Examples: number of tuples, number of distinct values for an attribute
  • Statistics estimation for intermediate results(Cardinality Estimation) to compute cost of complex expressions
  • Cost formulae for algorithms, computed using statistics


Generating Equivalent Expressions

Two relational algebra expressions are said to be equivalent if the two expressions generate the same set of tuples on every legal database instance

Equivalence Rules

  • selection

    1. 可以把算子拆分
      如果某些属性有索引,那么可以先拆分,在索引 select 之后再执行其他算子,否则不如不拆分。
    2. 算子可交换
    3. 投影的属性可以只保留最后一次的
    4. 选择算子可以和合并结合
  • join



  • projection


  • set operation


  • other

Enumeration of Equivalent Expressions

  • Repeat
    • apply all applicable equivalence rules on every subexpression of every equivalent expression found so far
    • add newly generated expressions to the set of equivalent expressions
  • Until no new equivalent expressions are generated above



Statistics for Cost Estimation


  • \(n_r\): number of tuples in a relation r.
  • \(b_r\): number of blocks containing tuples of r.
  • \(l_r\): size of a tuple of r.
  • \(f_r\): blocking factor of r — i.e., the number of tuples of r that fit into one block.
  • \(V(A, r)\): number of distinct values that appear in r for attribute A; same as the size of \(\Pi(r)\).
  • If tuples of r are stored together physically in a file, then: \(b_r = \lceil \dfrac{n_r}{f_r}\rceil\)
  • Histograms
attribute age of relation person

Selection Size Estimation


  • \(\sigma_{A=v}(r)\)
    \(n_r / V(A,r)\) : number of records that will satisfy the selection.
    如果要找的是一个 key, 那么 size estimate=1
  • \(\sigma_{A\leq v}(r)\)
    • Let \(c\) denote the estimated number of tuples satisfying the condition.
    • \(c = 0\) if \(v < \min(A,r)\)
      v 比属性 A 的最小值还要小
    • \(c = n_r\cdot \dfrac{v-\min(A,r)}{\max(A,r) - \min A(A,r)}\)
    • In absence of statistical information c is assumed to be \(n_r / 2\) (没有最大、最小统计信息时).



The Cartesian product \(r \times s\) contains \(n_r\cdot n_s\) tuples; each tuple occupies \(s_r + s_s\) bytes.

  • \(R \cap S = \emptyset\)
    没有公共属性,等价于 \(r\times s\)
  • \(R \cap S\) is a key for \(R\), then a tuple of \(s\) will join with at most one tuple from \(r\)


  • If \(R \cap S\) in S is a foreign key in S referencing R, then the number of tuples in \(r\bowtie s\) = the number of tuples in s.

  • If \(R \cap S = \{A\}\) is not a key for R or S.
    \(n_r * \dfrac{n_s}{V(A,s)}, n_s * \dfrac{n_r}{V(A,r)}\).
    以第二个为例子,站在 s 的角度,每一个 s 可以和这么多个元素连接。


Size Estimation for Other Operations

外部连接 r, s 认为是 r s 自然连接的结果加上 r 的大小。

Estimation of Number of Distinct Values

估算 V(A,r).

Selections \(\sigma_\theta(r)\), estimate \(V(A,\sigma_\theta(r))\)

  • If \(\theta\) forces A to take a specified value, \(V(A,\sigma_\theta(r))=1\)
    e.g., A = 3
  • If \(\theta\) forces A to take on one of a specified set of values: \(V(A,\sigma_\theta(r))=\) number of specified values
    e.g., (A = 1 V A = 3 V A = 4)
  • If the selection condition \(\theta\) is of the form A op v, \(V(A,\sigma_\theta(r))=V(A,r)*s\)
    利用选择率 s 计算
  • In all the other cases, use approx1imate estimate: \(V(A,\sigma_\theta(r))=\min(V(A,r), \n_{\sigma_\theta(r)})\)

joins \(r\bowtie s\), estimate \(V(A,r\bowtie s)\)

  • If all attributes in A are from r, the estimated \(V(A,r\bowtie s)=\min(V(A,r), n_{r\bowtie s})\)
  • If A contains attributes A1 from r and A2 from s, then estimated \(V(A,r\bowtie s)=\min(V(A1,r)*V(A2-A1,s), V(A1-A2,r)*V(A2,s), n_{r\bowtie s})\)

Choice of Evaluation Plans

Must consider the interaction of evaluation techniques when choosing evaluation plans

choosing the cheapest algorithm for each operation independently may not yield best overall algorithm
e.g. merge-join may be costlier than hash-join, but may provide a sorted output which reduces the cost for an outer level aggregation.
Mergejoin 代价高,但是有个好处是 join 后是有次序的,对上层操作有利。


Cost-Based Join-Order Selection

Consider finding the best join-order for \(r_1\bowtie r_2\bowtie \ldots r_n\).
There are \((2(n – 1))!/(n – 1)!\) different join orders for above expression.


Using dynamic programming, the least-cost join order for any subset of \(\{r_1, r_2, \ldots r_n\}\) is computed only once and stored for future use.

Join Order Optimization Algorithm

先分解成两个小的集合 \(S_1, S-S_1\). 递归地细分。

Left Deep Join Trees

In left-deep join trees, the right-hand-side input for each join is a relation, not the result of an intermediate join.


Cost of Optimization

  • With dynamic programming
    • time complexity of optimization with bushy trees is \(O(3^n)\).
    • Space complexity is \(O(2^n)\)
  • left-deep join tree
    • Time complexity of finding best join order is \(O(n 2^n)\)
    • Space complexity remains at \(O(2^n)\)

Heuristic Optimization

Cost-based optimization is expensive.

Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases) improve execution performance:

  • Perform selection early (reduces the number of tuples)
  • Perform projection early (reduces the number of attributes)
  • Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar operations.
  • Perform left-deep join order

Additional Optimization Techniques

Nested Subqueries

Nested query example:

select name from instructor 
where exists 
    (select * from teaches
    where instructor.ID = teaches.ID and teaches.year = 2022)
找出 2022 开课的老师的名字。


Parameters are variables from outer level query that are used in the nested subquery; such variables are called correlation variables(相关变量)

把刚刚那个例子改为一个 select 语句,那么一个老师如果开了很多门课就会出现很多个名字。但是加上 distinct 关键词后又无法区分同名情况。

半连接 \(⋉\)_\theta s$,检验 r 是否满足某个关系。
If a tuple \(r_i\) appears n times in r, it appears n times in the result of \(r \(⋉\)_\theta s\) , if there is at least one tuple \(s_i\) in s matching with \(r_i\).


The process of replacing a nested query by a query with a join/semijoin (possibly with a temporary relation) is called decorrelation(去除相关)

Decorrelation of scalar aggregate subqueries can be done using groupby/aggregation in some cases


Materialized Views

A materialized view is a view whose contents are computed and stored.

有些数据库里把 view 实例化了,真正存储在内部的临时表。

create view department_total_salary(dept_name, total_salary) 
as select dept_name, sum(salary) from instructor group by dept_name

Saves the effort of finding multiple tuples and adding up their amounts.

use incremental view maintenance(增量视图维护)
The changes (inserts and deletes) to a relation or expressions are referred to as its differential(差分)

  • join: \(V^{new}=V^{old}\cup (i_r\bowtie s), V^{new} = V^{old}-(d_r\bowtie s)\)


  • select: \(V^{new}=V^{old}\cup \sigma_\theta(i_r), V^{new} = V^{old}-\sigma_\theta(d_r)\)

  • projection:
    For each tuple in a projection \(\Pi_A(r)\), we will keep a count of how many times it was derived.

    • On insert of a tuple to r, if the resultant tuple is already in \(\Pi_A(r)\) we increment its count, else we add a new tuple with count = 1
    • On delete of a tuple from r, we decrement the count of the corresponding tuple in \(\Pi_A(r)\) if the count becomes 0, we delete the tuple from \(\Pi_A(r)\)

  • count \(v= _Ag_{count(B)}\)

    • insert: For each tuple r in \(i_r\), if the corresponding group is already present in v, we increment its count, else we add a new tuple with count = 1
    • delete: for each tuple t in \(i_r\).we look for the group t.A in v, and subtract 1 from the count for the group.
      If the count becomes 0, we delete from v the tuple for the group t.A
  • sum \(v= _Ag_{sum(B)}\)

  • min, max

怎么利用这些 view?

  • Rewriting queries to use materialized views:


  • Replacing a use of a materialized view by the view definition

Materialized View Selection

最后更新: 2023年5月29日 08:22:24
创建日期: 2023年5月29日 08:22:24