The Processor


Instruction Execution Overview

CPU performance factors

  • Instruction count
    Determined by ISA and compiler
    如同样的功能用 Intel 和 RISC-V 的处理器实现,英特尔的指令用的更少(因为更复杂)
  • CPI and Cycle time
    Determined by CPU hardware

For every instruction, the first two steps are identical

  • Fetch the instruction from the memory
  • Decode and read the registers

Next steps depend on the instruction class

  • Memory-reference
    load, store
  • Arithmetic-logical
  • branches

CPU Overview

  • Use ALU to calculate
    • Arithmetic result
    • Memory address for load/store
    • Branch comparison
      因为我们是单周期,因此 ALU 只能做比较,具体跳转的地址由单独的 Adder 计算。
  • Access data memory for load/store
  • PC \(\leftarrow\) target address or PC + 4


Can’t just join wires together -- Use multiplexers.


  • different sources for unit
  • read/write memory

Logical Design Convention

  • Information encoded in binary
    • Low voltage = 0, High voltage = 1
    • One wire per bit
    • Multi-bit data encoded on multi-wire buses
  • Combinational element
    • Operate on data
    • Output is a function of input
  • State (sequential) elements
    Store information

Building a datapath

Elements that process data and addresses in the CPU.

同类指令的 opcode 是一样的(I 型指令的里逻辑运算、load 指令、jal 不同),具体功能由 Func 决定。(因此不把所有操作编到 opcode 内)

Instruction execution in RISC-V

  • Fetch:
    • Take instructions from the instruction memory
    • Modify PC to point the next instruction
  • Instruction decoding & Read Operand:
    • Will be translated into machine control command
    • Reading Register Operands, whether or not to use
  • Executive Control:
    • Control the implementation of the corresponding ALU operation
  • Memory access:
    • Write or Read data from memory
    • Only ld/sd
  • Write results to register:
    • If it is R-type instructions, ALU results are written to rd
    • If it is I-type instructions, memory data are written to rd
  • Modify PC for branch instructions

Instruction fetching

  • R-format Instructions
    • Read 2 register operands
    • Perform arithmetic/logical operation
    • Write register result
  • Load/Store Instructions
    • Read register operands
    • Calculate address using 12-bit offset
      Use ALU, but sign-extend offset
    • Load: Read memory and update register
    • Store: Write register value to memory
  • Branch Instructions
    • Read 2 register operands
    • Compare operands
      use ALU, substract and check Zero output
    • Calculate target address
      • Sign-extend displacement
      • Shift left 1 place (halfword displacement)
      • Add to PC value

First-cut data path does an instruction in one clock cycle

  • Each datapath element can only do one function at a time
  • Hence, we need separate instruction and data memories

Path Built using Multiplexer

  • R-type instruction Datapath

  • I-type instruction Datapath

    • For ALU
    • For load
  • S-type (store) instruction Datapath

  • SB-type (branch) instruction Datapath
  • UJ-type instruction Datapath

Full datapath

A simple Implementation Scheme

Analyse for cause and effect

  • Information comes from the 32 bits of the instruction
  • Selecting the operations to perform (ALU, read/write, etc.)
  • Controlling the flow of data (multiplexor inputs)
  • ALU's operation based on instruction type and function code
  • 7 个控制信号和一个 4 位的 ALU_operation.
  • MemToReg 有三个源,分别是 load 指令,PC 来的(jal, jalr 存 PC+4), ALU 出来的(R 型,I 型)

ALU symbol & Control

ALU used for

  • Load/Store: F = add
  • Branch: F = subtract
  • R-type: F depends on opcode

因此我们进行两级解码 2-level decode.

First level

  • 一级解码后,可以决定除了 ALU_opration 以外的控制信号
  • 同时我们会解码出 2 位的 ALU_op.

second level

ALU operation is decided by 2-bit ALUOp derived from opcode, and funct7 & funct3 fields of the instruction.

Datapath with Control


An overview of pipelining


对于单周期 CPU, CPI 是 1, 但时钟周期会很长。
假设取指令 200ps, 寄存器读 100ps, ALU 计算 200ps, 内存访问 200ps, 寄存器写 100ps.
那么 add 需要 600ps, ld 需要 800ps, sd 需要 700ps, beq 需要 500ps.

Longest delay determines clock period. (ld)
We will improve performance by pipelining.



  1. IF: Instruction fetch from memory
  2. ID: Instruction decode & register read
  3. EX: Execute operation or calculate address
  4. MEM: Access memory operand
  5. WB: Write result back to register

\(CPI \approx 1\)

  • One instruction will be issued (or finished) each cycle.
  • During any cycle, one instruction is present in each stage.

Pipeline Performance

不同操作的时间也不同,流水线 CPU 的时钟周期为最长的操作时间。

流水线可以提高吞吐量(throughput) 但并没有改变每条指令的执行时间(Latency)。

RISC-V ISA designed for pipelining

RISC-V 适合流水线设计。

  • All instructions are 32-bits
    Easier to fetch and decode in one cycle
  • Few and regular instruction formats
    Can decode and read registers in one step
  • Load/store addressing
    Can calculate address in 3rd stage, access memory in 4th stage



  • Structure hazards
    A required resource is busy.
  • Data hazard
    Need to wait for previous instruction to complete its data read/write.
  • Control hazard
    Deciding on control action depends on previous instruction.

Structure Hazards

如果只有一块内存,但 IF 和 MEM 阶段都需要使用这块内存,那么 IF 就会被 stall 暂停,造成一个 bubble. (即流水线内有一个时刻是清空的,因为没有我们没有去取指令)

Pipelined datapaths require separate instruction/data memories. (Or separate instruction/data caches)

Data Hazards

An instruction depends on completion of data access by a previous instruction.


注意这里 WB 是在前半个周期将数据写入寄存器, ID 是在后半个周期将数据从寄存器中取出。
但实际上我们要用的结果在 EX 时已经产生了。

  • Forwarding(Bypassing)
    Use result when it is computed

    • Don’t wait for it to be stored in a register
    • Requires extra connections in the datapath

Load-Use Data Hazard


流水线的 CPI 不可能等于 1, 因为上图这种情况一定会发生(ld 无法避免)。


修改后, ld x2, 8(x0) 已经从内存中读出数据还没有写回寄存器,可以直接送到 add x3, x1, x2 中去。

Control Hazards

Branch determines flow of control

  • Fetching next instruction depends on branch outcome
  • Pipeline can’t always fetch correct instruction

可以预测,或者将比较放在 ID 级。


Branch Prediction
可以预测 branch 命中或者不命中。

  • Static branch prediction
  • Dynamic branch prediction


  • Pipelining improves performance by increasing instruction throughput
    • Executes multiple instructions in parallel
    • Each instruction has the same latency
  • Subject to hazards(Structure, data, control)
  • Instruction set design affects complexity of pipeline implementation

RISC-V Pipelined Datapath


注意到,第五个阶段写回时,写寄存器号应该是从 MEM/WB 中的,而不是 IF 出的寄存器号。

  • Single-clock-cycle pipeline diagram: Shows pipeline usage in a single cycle; Highlight resources used
  • multi-clock-cycle diagram: Graph of operation over time

Pipelined Control

控制信号 ID 阶段全部译出,随后通过寄存器传到后续阶段中去。
有些信号在 WB 阶段用,有些在 M 阶段用,有些在 EX. 使用之后就不再需要把信号传递下去。

Data Hazards



and 语句中将 EX 的结果前递, or 语句中要将 DM 的结果前递(此时 EX 的值已经不是第一条指令的结果了)

Rd 要一直传下来,我们可以借此判断要读出的寄存器是否是之前指令要写回的寄存器。
Data hazards when

  • EX/MEM.RegisterRd = ID/EX.RegisterRs1
    在第四个时钟周期时把 EX/MEM 的寄存器值送到 ALU.
  • EX/MEM.RegisterRd = ID/EX.RegisterRs2
  • MEM/WB.RegisterRd = ID/EX.RegisterRs1
  • MEM/WB.RegisterRd = ID/EX.RegisterRs2

只有在我们要改变寄存器的值的时候才能用前递的值(有些指令可能根本就不会写回寄存器,没有有效的 Rd)
EX/MEM.RegWrite, MEM/WB.RegWrite 不为 0.
此外 EX/MEM.RegisterRd, MEM/WB.RegisterRd 也不能为 0.

Mux control Source Explanation
ForwardA = 00 ID/EX The first ALU operand comes from the register file.
ForwardA = 10 EX/MEM The first ALU operand is forwarded from the prior ALU result.
ForwardA = 01 MEM/WB The first ALU operand is forwarded from data memory or an earlier ALU result.

Double Data Hazard

我们前递时要加一个条件,只有在 EX/MEM 的条件不成立时,才能查看 MEM/WB 的条件。

Load-Use Hazard Detection

如果我们在 ld 指令 EX/MEM 时暂停,此时 ld 后面有两条指令需要暂停,其实我们可以更早的发现这个问题。

Load-use hazard when

  • ID/EX.MemRead and ((ID/EX.RegisterRd = IF/ID.RegisterRs1) or (ID/EX.RegisterRd = IF/ID.RegisterRs2))

If detected, stall and insert bubble


在 ld 后面 stall,使得这级的控制信号为 0, 同时阻止 PC, IF/ID 寄存器的更新。
相当于又进行了一次 decode.


  • Stalls reduce performance
    • But are required to get correct results
  • Compiler can arrange code to avoid hazards and stalls

Branch Hazards

是否跳转在 MEM 阶段决定。

最朴素的方法是无视这种情况,因为前三个阶段并不涉及对寄存器和内存修改,即使我们预测后续不执行这些指令也不会带来影响。但这样可能带来 CPI 的显著降低。

一种方法是在 ID 级决定是否跳转(此时已经有了两个源操作数和立即数)


这样我们只可能多余一条指令,插入 bubble 即可。但是这样也不可接受。

Dynamic Branch Prediction

In deeper and superscalar(多发射) pipelines, branch penalty is more significant.

Use dynamic prediction.

  • Branch prediction buffer (aka branch history table)
  • Indexed by recent branch instruction addresses
  • Stores outcome (taken/not taken)
  • To execute a branch
    • Check table, expect the same outcome
    • Start fetching from fall-through or target
    • If wrong, flush pipeline and flip prediction

可以处理 for 循环的预测。

1-Bit Predictor


  • Mispredict as taken on last iteration of inner loop
  • Then mispredict as not taken on first iteration of inner loop next time around

可以修改预测方式:只有连续错误两次才会修改我们的预测。即 2-Bit Predictor.

Calculating the Branch Target

即使 branch 命中了,我们也有一个时钟的 bubble 来计算地址。
但 branch 要跳转的地址其实是可以存下来的。

Branch target buffer

  • Cache of target addresses
  • Indexed by PC when instruction fetched
    If hit and instruction is branch predicted taken, can fetch target immediately

Exceptions and Interrupts

“Unexpected” events requiring change in flow of control.

  • Exception 异常
    Arises within the CPU
    e.g. undefined opcode, syscall, …
  • Interrupt 中断
    From an external I/O controller

Handling Exceptions

  • Save PC of offending (or interrupted) instruction
    In RISC-V: Supervisor Exception Program Counter(SEPC)
  • Save indication of the problem
    In RISC-V: Supervisor Exception Cause Register(SCAUSE)
    64 bits, but most bits unused
    e.g. Exception code field: 2 for undefined opcode, 12 for hardware malfunction...
  • Jump to handler
    • Direct
    • Vectored Interrupts
      Handler address determined by the cause.
      Exception vector address to be added to a vector table base register


Handler Actions


  • Read cause, and transfer to relevant handler
  • Determine action required
  • If restartable
    • Take corrective action
    • use SEPC to return to program
  • Otherwise
    • Terminate program
    • Report error using SEPC, SCAUSE, ...

Exceptions in a Pipeline

Another form of control hazard.

Consider malfunction on add in EX stage. add x1, x2, x1

  • Prevent x1 from being clobbered
  • Complete previous instructions
  • Flush add and subsequent instructions
  • Set SEPC and SCAUSE register values
  • Transfer control to handler

Instruction-Level Parallelism (ILP)

Pipelining: executing multiple instructions in parallel

To increase ILP

  • Deeper pipeline
  • Multiple issue
    这样可以做到 CPI < 1. 使用 Instructions Per Cycle (IPC)

Multiple issue

  • Static multiple issue
  • Dynamic multiple issue
    CPU 自己决定执行那些指令,编译器会帮忙重排指令,但还是由 CPU 解决竞争。

Static multiple issue

RISC-V with Static Dual Issue

ALU/branch 指令不用访问内存。


Loop Unrolling

Use different registers per replication

  • Called “register renaming”
  • Avoid loop-carried “anti-dependencies”
Loop Unrolling Example

最后更新: 2023年6月2日 23:16:40
创建日期: 2023年4月1日 00:07:40