

# 第5讲: Memory (2)

## 张献伟

#### <u>xianweiz.github.io</u>

DCS5367, 10/26/2021





作业-HW1

- <u>https://xianweiz.github.io/teach/dcs5637/hws/hw1.pdf</u>
- 截止时间: 10.31, 23:59
- •提交方式: 超算习堂
  - 注册(<u>https://easyhpc.net/</u>)
  - 加入课程(<u>https://easyhpc.net/course/133</u>)
  - 作业列表: HW1

John L. Hennessy | David A. Patterson

#### COMPUTER Architecture

A Quantitative Approach



Advanced Computer Architecture, Fall 2021 课程主页: https://xianv

加入课程





#### **Review Questions**

- What is 'tag' in cache access? Part of address to be used to decide the access is hit/miss.
- Cache associativity? Cache is organized as sets, each of which contains multi blocks.
- Disadvantages of higher associativity? Larger tags, and higher overhead on tag comparator and data mux.
- Write back and write through?

Write back first writes into cache, and then got updated into Memory when being evicted; write through directly updates mem.

- Steps of address translation?
   MMU → page table lookup → page fault → retry
- Why we say DRAM is 'dynamic'?

Data is leaking, and thus dynamic refreshes are needed.



## Page Mode[页模式]

- A "DRAM row" is also called a "DRAM page"
  - Usually larger than the OS page, e.g., 8KB vs. 4KB
- Row buffers act as a cache within DRAM
- Open page



- Row buffer hit: ~20 ns access time (must only move data from row buffer to pins)
- Row buffer conflict: ~60 ns (must first precharge the bitlines, then read new row, then move data to pins)
- Closed page
  - Empty row buffer access: ~40 ns (must first read arrays, then move data from row buffer to pins)
  - Steps
    - Activate command opens row (placed into row buffer)
    - Read/write command reads/writes column in the row buffer
    - Precharge command closes the row and prepares the bank for next access

#### DRAM Bandwidth[带宽]

- Reading from a cell in the core array is a very slow process
  - DDR: Core speed = ½ interface speed
  - DDR2/GDDR3: Core speed = ¼ interface speed
  - DDR3/GDDR4: Core speed = ¼ interface speed
  - ... likely to be worse in the future
- Calculation: *transfer\_rate* \* *interface\_width*

| – E | xample: | 266 MT/s | * 64b = | 2128 M | B/s |
|-----|---------|----------|---------|--------|-----|
|     |         |          |         |        |     |

| Standard | I/O clock rate | M transfers/s | DRAM name  | MiB/s/DIMM | DIMM name |
|----------|----------------|---------------|------------|------------|-----------|
| DDR1     | 133            | 266           | 266 DDR266 |            | PC2100    |
| DDR1     | 150            | 300           | DDR300     | 2400       | PC2400    |
| DDR1     | 200            | 400           | DDR400     | 3200       | PC3200    |
| DDR2     | 266            | 533           | DDR2-533   | 4264       | PC4300    |
| DDR2     | 333            | 667           | DDR2-667   | 5336       | PC5300    |
| DDR2     | 400            | 800           | DDR2-800   | 6400       | PC6400    |
| DDR3     | 533            | 1066          | DDR3-1066  | 8528       | PC8500    |
| DDR3     | 666            | 1333          | DDR3-1333  | 10,664     | PC10700   |
| DDR3     | 800            | 1600          | DDR3-1600  | 12,800     | PC12800   |
| DDR4     | 1333           | 2666          | DDR4-2666  | 21,300     | PC21300   |



#### Memory Power[功耗]

- Dynamic + static[动态和静态]
  - read/write + standby
- Reduce power[降低功耗]
  - Drop operating voltage
  - Power-down mode: disable the memory, except internal automatic refresh





#### DRAM Variants[变种]

• DDR

- DDR3: 1.5V, 800MHz, 64b → 1.6G\*64b = 12.8GB/s
- GDDR: graphics memory for GPUs
  - GDDR5: based on DDR3, 8Gb/s, 32b  $\rightarrow$  8G\*32b = 32GB/s
- LPDDR: low power DRAM, a.k.a., mobile memory
  - Lower voltage, narrower channel, optimized refresh



#### DDR5 & GDDR6

- DDR
  - DDR4: 1-1.2V, 1333MHz, 64b → 21.3GB/s x 4 = 85.2GB/s
  - DDR5: 1.1V, 6.4Gbps, 64b  $\rightarrow$  51.2GB/s x 4 = 204.8GB/s
- GDDR
  - GDDR5: 8Gb/s(~7), 256b, 224GB/s, 12GB, <u>GTX 980</u>
  - GDDR5X: 12Gb/s(~10), 256b, 320GB/s, 8GB, GTX 1080
  - GDDR6: 16Gb/s(~14), 256b, 448GB/s, 10GB, <u>RTX 2080</u>
  - GDDR6X: 21Gb/s (~19), 320b, 760GB/s, 10GB, <u>RTX 3080</u>







#### Stacked DRAMs[堆叠]

- Stacked DRAMs in same package as processor
  - High Bandwidth Memory (HBM)
- HBM consumes less power and still maintains significantly higher bandwidth in a small form factor
  - To keep the TDP target low, HBM's clock speed is limited to 1GBPs but, it makes up for it with its 4096 bits of the memory bus





#### HBM[高带宽内存]

- A normal stack consist of four 4 DRAM dies on a base die and has two 128-bit channels per DRAM die
  - Making 8 channels in total which results in a 1024-bit interface
  - 4 HBM stacks gives a width of 4 \* 1024 = 4096b, 1Gb/s
  - Bandwidth: 4096b \* 1Gb/s = 512GB/s
- Nvidia Tesla P100: HBM2, 4096b, 16GB, 732.2GB/s
- <u>Nvidia Tesla A100</u>: HBM2e, 5120b, 40GB, 1555GB/s





#### eDRAM[嵌入式]

- eDRAM: embedded DRAM
  - DRAM integrated on the same die with ASIC/logic
- No pin limitations
  - Can access using a wide on-chip buses
- System power savings
  - Avoids off-chip I/O transfers



ON Chip DRAM •Connect directly with logic •Large band width

#### External DRAM



•Connect with logic chip by base bonding •Small band width Use of eDRAM in various products

| Product name                                                | eDRAM        |
|-------------------------------------------------------------|--------------|
| IBM z15                                                     | 256+ MB      |
| IBM's System Controller (SC) SCM, with L4 cache for the z15 | 960 MB       |
| Intel Haswell, Iris Pro Graphics 5200 (GT3e)                | 128 MB       |
| Intel Broadwell, Iris Pro Graphics 6200 (GT3e)              | 128 MB       |
| Intel Skylake, Iris Graphics 540 and 550 (GT3e)             | 64 MB        |
| Intel Skylake, Iris Pro Graphics 580 (GT4e)                 | 64 or 128 MB |
| Intel Coffee Lake, Iris Plus Graphics 655 (GT3e)            | 128 MB       |
| PlayStation 2                                               | 4 MB         |
| PlayStation Portable                                        | 4 MB         |
| Xbox 360                                                    | 10 MB        |
| WiiU                                                        | 32 MB        |
|                                                             |              |







Amount of

#### DRAM Scaling[缩放]



|                 |           |           | Best case ac  | Precharge needed |            |            |
|-----------------|-----------|-----------|---------------|------------------|------------|------------|
| Production year | Chip size | DRAM type | RAS time (ns) | CAS time (ns)    | Total (ns) | Total (ns) |
| 2000            | 256M bit  | DDR1      | 21            | 21               | 42         | 63         |
| 2002            | 512M bit  | DDR1      | 15            | 15               | 30         | 45         |
| 2004            | 1G bit    | DDR2      | 15            | 15               | 30         | 45         |
| 2006            | 2G bit    | DDR2      | 10            | 10               | 20         | 30         |
| 2010            | 4G bit    | DDR3      | 13            | 13               | 26         | 39         |
| 2016            | 8G bit    | DDR4      | 13            | 13               | 26         | 39         |



## Scaling Issues[问题]

- DRAM cells are more leaky[数据流失]
  - More frequent refreshes
- Slower access[访问时延]
  - Longer sensing and restoring time
- Decreased reliability[可靠性]
  - Cross-talking noise, enlarged process variations







#### DRAM Researches[前沿研究]

- Sharing/sensing timing reduction[读取时延]
  - Optimize DRAM internal structures [CHARM'ISCA13, TL-DRAM'HPCA13, etc]
  - Utilize existing timing margins [NUAT'HPCA14, AL-DRAM'HPCA15, etc]
- DRAM restore studies[恢复时延]
  - Identify the restore scaling issue [Co-arch'MEM14, tWR'Patent15, etc]
  - Reduce restore timings [AL-DRAM'HPCA15, MCR'ISCA15, RT'HPCA16]
- Memory-based approximate computing[近似计算]
  - Skip DRAM refresh [Flikker'ASPLOS11, Alloc'CASES15, etc]
  - Restore [DrMP'PACT17]





## DRAM Researches (cont'd)

- Nowadays DRAMs are worst-case determined
- Examples:
  - Refresh: only very few rows need to be refreshed at the worstcase rate
  - Timings: overall timing constraints are determined by the worst one
- Idea: use common-case instead





#### Refresh Issues[刷新问题]

- With higher DRAM capacity, more time will be spent on refresh operations, greatly blocking normal reads/writes
- With further scaled DRAMs, more cells need to be refreshed at likely higher rates than today
- Overheads on both performance and energy



#### Emerging Memory[新型存储]

#### **3D XPOINT<sup>™</sup> MEMORY MEDIA**

Breaks the memory/storage barrier



|                                 | Hard disk<br>drive (HDD)  | Dynamic<br>RAM (DRAM) | NAND single-<br>level cell<br>(SLC) flash | Phase<br>change RAM<br>(PCRAM) SLC | Spin-torque<br>transfer RAM<br>(STT-RAM) | Resistive<br>RAM<br>(ReRAM) |
|---------------------------------|---------------------------|-----------------------|-------------------------------------------|------------------------------------|------------------------------------------|-----------------------------|
| Data retention                  | Y                         | N                     | Y                                         | Y                                  | Y                                        | Y                           |
| Cell size<br>(F = feature size) | N/A                       | 6 to 10F <sup>2</sup> | 4 to 6F <sup>2</sup>                      | 4 to 12F <sup>2</sup>              | 6 to 50F <sup>2</sup>                    | 4 to 10F <sup>2</sup>       |
| Access granularity<br>(Bytes)   | 512                       | 64                    | 4,192                                     | 64                                 | 64                                       | 64                          |
| Endurance (writes)              | >10 <sup>15</sup>         | >10 <sup>15</sup>     | 10 <sup>4</sup> to 10 <sup>5</sup>        | 10 <sup>8</sup> to 10 <sup>9</sup> | >10 <sup>15</sup>                        | 1011                        |
| Read latency                    | 5 ms                      | 50 ns                 | 25 us                                     | 50 ns                              | 10 ns                                    | 10 ns                       |
| Write latency                   | 5 ms                      | 50 ns                 | 500 us                                    | 500 ns                             | 50 ns                                    | 50 ns                       |
| Standby power                   | Disk access<br>mechanisms | Refresh               | N                                         | N                                  | N                                        | N                           |





#### NVM[非易失性存储]

- Numerous emerging memory candidates
  - Many fall between NAND and DRAM
- Pros and cons
  - Non-volatility with fraction of DRAM cost/bit
  - Ideal for large memory systems
  - Slower access and limited lifetime







## Future Memory System[未来存储系统]

- Demands[需求]
  - Low latency
  - Large size
  - High bandwidth
  - Low power/energy
- Hybrid memory[混合]
  - DRAM + emerging
- Abstracted interface[抽象]
   Hide device characteristics
- Changing processormemory relationship[存算]
  - Processor-centric to memory-centric

山大學





## NDP/PIM[近内存/存内计算]

https://cseweb.ucsd.edu//~swanson/papers/IEEEMicro2014WONDP.pdf

- Near data processing
  - Minimize data movement by computing at the most appropriate location in the hierarchy
  - In NDP, computation can be performed right at the data's home, either in caches, main memory, or persistent storage
- Processing-in-memory
  - Do computation inside the memory





## Memory Dependability[可靠性]

- Memory is susceptible to cosmic rays
- Soft errors: dynamic/transient errors
  - Detected and fixed by error correcting codes (ECC)
- Hard errors: permanent errors
  - Use sparse rows to replace defective rows
- Chip-level errors
  - Chipkill: a RAID-like error recovery technique
- Stuck-at errors
  - May use data-dependent sparing
- Endurance problems
- Cross-talk (bit-line & word-line)
- Read/write disturbance



Number of memory errors per hour for multi-bit corruptions



https://upcommons.upc.edu/bitstream/handle/2117/96529/Unprotected%20Computing.pdf

21

## Storage Class Memory (SCM)

- An era of very big, PB-level memory pools
- The big memory pooling is made possible by the compute express link (CXL)
- CXL is a standard for linking memory bus devices together: CPUs, GPUs, and memory (and a few other more exotic things like TPUs and DPUs).







22 https://www.computeexpresslink.org/



## To Further Optimize Cache[优化缓存]

- Average memory access time (AMAT) = ( hit-rate \* hitlatency ) + ( miss-rate \* miss-latency )
- Basic requirements
  - Hit latency
  - Miss rate
  - Miss penalty
- Two more requirements
  - Cache bandwidth
  - Power consumption





## Advanced Cache Optimizations[优化]

- Reducing the hit time[缩短命中时延]
  - Small and simple first-level caches
  - Way prediction
- Increasing cache bandwidth[提高缓存带宽]
  - Pipelined caches
  - Multibanked caches
  - Non-blocking caches
- Reducing miss penalty[降低不命中开销]
  - Critical word first
  - Merging write buffers

- > parallelism
- Reducing miss rate[降低不命中率]
  - Compiler optimizations



#### #1: Small & Simple 1<sup>st</sup>-level Cache[小]

- To reduce hit time and power
- The L1 cache size has recently increased either slightly or not at all
  - Limited size: pressure of both a fast clock cycle and power limitations encourages small sizes
  - Lower level of associativity: reduce both hit time and power



## #2: Way Prediction[预测]

- To reduce hit time
  - Add extra bits in the cache to predict the way of the next cache access
    - Block predictor bits
  - Multiplexor is set early to select the desired block
    - And in that clock cycle, only a single tag comparison is performed in parallel with reading the cache data
  - A miss results in checking the other blocks for matches in the next clock cycle
- Miss-prediction gives longer hit time
  - Prediction accuracy
    - > 90% for two-way
    - □ > 80% for four-way
    - I-cache has better accuracy than D-cache
  - First used on MIPS R10000 in mid-90s, now used on ARM Cortex-A8





## #3: Pipelined[流水线]

- To increase bandwidth
  - Primarily target at L1, where access bandwidth constrains instruction throughput
  - Multibanks are also used in L2/L3, but mainly for power
- Pipelining L1
  - Stages
    - Address calculation
    - disambiguation (decoder)
    - cache access (parallel tag and data)
    - result drive (aligner)
  - Allows a higher clock cycle, at the cost of increased latency
  - Examples

Pentium: 1 cycle, Pentium Pro – III: 2, Pentium 4 – Core i7: 4 cycles





#### #3: Multibanked[多单元]

- Organize cache as independent banks to support simultaneous access
  - ARM Cortex-A8 supports 1-4 banks for L2
  - Intel i7 supports 4 banks for L1 and 8 banks for L2
- Interleave banks according to block address
  - Banking works best when the accesses naturally spread across banks
- Multiple banks also are a way to reduce power consumption in both caches and DRAM





## #4: Nonblocking Caches[非阻塞]

- To increase cache bandwidth
- Allow hits before previous misses complete
  - "Hit under miss"
  - "Hit under multiple miss"
- Nontrivial to implement the nonblocking
  - Arbitrating contention between hits and misses; tracking outstanding misses
  - Miss Status Handling Registers (MSHRs)







## #5: Critical Word First & Early Restart

- To reduce miss penalty
- Processor normally needs just one word of the block at a time
  - Don't wait for the full block to be loaded before sending the requested word and restarting the processor
- Critical word first[关键字优先]
  - Request missed word from memory first
  - Send it to the processor as soon as it arrives
- Early restart[提早重启]
  - Request words in normal order
  - Send missed work to the processor as soon as it arrives
- Effectiveness depends on block size and likelihood of another access to the portion of the block that has not yet been fetched



## #6: Merging Write Buffers[写缓冲合并]

- To reduce miss penalty
- When storing to a block that is already pending in the write buffer, update write buffer
- Advantages
  - Multiword writes are usually faster than writes one word a time
  - Reduces stalls due to full write buffer
- Do not apply to I/O addresses[I/O设备]







## #7: Compiler Optimizations[编译]

- To reduce miss rate, without any hardware changes
- Loop interchange
  - Swap nested loops to access memory in sequential order
  - Improving spatial locality
    - Maximizes use of data in a cache block before they are discarded

```
/* Before */
for ( j = 0; j < 100; j = j + 1 )
    for ( i = 0; i < 5000; i = i + 1 )
        x[i][j] = 2 * x[i][j];
/* After */
for ( i = 0; j < 5000; i = i + 1 )
    for ( j = 0; j < 100; j = j + 1 )
        x[i][j] = 2 * x[i][j];</pre>
```



## #7: Compiler Optimizations (cont'd)

- Blocking to reduce cache misses
  - Instead of accessing entire rows or columns, subdivide matrices into blocks
  - Exploits a combination of spatial and temporal locality, and can even help register allocation





#### #8: Hardware Prefetching[硬件预取]

- To reduce miss penalty or miss rate
- Prefetch items before the processor requests them
  - Instruction: fetches two blocks on miss, the requested and the next consecutive
  - Data: prefetch predicted blocks





## #8: Hardware Prefetching (cont'd)

- What to prefetch? (prefetch useful data)
  - Next sequential
  - Stride
  - General pattern
- Where to place?
  - Directly into caches
  - External buffers
- When to prefetch?
  - Prefetched data should be timely provided
- Prefetching relies on extra memory bandwidth
  - Should not interfere much with demand accesses
  - Otherwise it hurts performance







## #9: Compiler-controlled Prefetching

- To reduce miss penalty or miss rate
- Compiler inserts prefetch instructions to request data before the processor needs it
- Two flavors
  - Register prefetch: loads the value into a register
  - Cache prefetch: loads data into the cache
- Typically *nonfaulting* prefetches
  - Simply turns into no-ops if they would normally result in an exception
- Compilers must take care to gain performance
  - Issuing prefetch instructions incurs an instruction overhead



#### #10: Use HBM[高带宽内存]

- Use HBM to build massive L4 caches, size of 128MB 1GB
- Tags of HBM cache
  - 64B block: 1GB L4 requires 94MB of tags
    - Issue: cannot place in on-chip caches
  - 4KB block: 1GB L4 requires <1MB tag</p>
    - Issues: inefficient use of huge blocks, and high transfer overhead
- One approach (L-H, MICRO'2011):
  - Each SDRAM row is a block index
  - Each row contains set of tags and 29 data segments



#### Summary

| Technique                                        | Hit<br>time | Band-<br>width | Miss<br>penalty | Miss<br>rate | Power consumption | Hardware cost/<br>complexity | Comment                                                                                                             |
|--------------------------------------------------|-------------|----------------|-----------------|--------------|-------------------|------------------------------|---------------------------------------------------------------------------------------------------------------------|
| Small and simple<br>caches                       | +           |                |                 | —            | +                 | 0                            | Trivial; widely used                                                                                                |
| Way-predicting caches                            | +           |                |                 |              | +                 | 1                            | Used in Pentium 4                                                                                                   |
| Pipelined & banked<br>caches                     | -           | +              |                 |              |                   | 1                            | Widely used                                                                                                         |
| Nonblocking caches                               |             | +              | +               |              |                   | 3                            | Widely used                                                                                                         |
| Critical word first and<br>early restart         |             |                | +               |              |                   | 2                            | Widely used                                                                                                         |
| Merging write buffer                             |             |                | +               |              |                   | 1                            | Widely used with write through                                                                                      |
| Compiler techniques to<br>reduce cache misses    | i -         |                |                 | +            |                   | 0                            | Software is a challenge, but<br>many compilers handle<br>common linear algebra<br>calculations                      |
| Hardware prefetching<br>of instructions and data |             |                | +               | +            |                   | 2 instr.,<br>3 data          | Most provide prefetch<br>instructions; modern high-<br>end processors also<br>automatically prefetch in<br>hardware |
| Compiler-controlled<br>prefetching               |             |                | +               | +            |                   | 3                            | Needs nonblocking cache;<br>possible instruction<br>overhead; in many CPUs                                          |
| HBM as additional<br>level of cache              |             | +/-            | -               | +            | +                 | 3                            | Depends on new packaging<br>technology. Effects depend<br>heavily on hit rate<br>improvements                       |

Gz

