

# The CARM Tool

Cache-aware Roofline Model for HPC

José Morgado Leonel Sousa Aleksandar Ilic

28th POP Seminar 5 September 2024





- Modern HPC systems and applications are complex and heterogeneous
  - Hard to model
  - Hard to optimize
- The CARM as a solution
  - Easy to understand
  - Accurate performance overview
  - Good optimization hints
- CARM is only supported by Intel Advisor



Intel Sapphire Rapids CPU





- Modern HPC systems and applications are complex and heterogeneous
  - Hard to model
  - Hard to optimize
- The CARM as a solution
  - Easy to understand
  - Accurate performance overview
  - Good optimization hints





The Cache-Aware Roofline Model





#### Main contributions of the CARM Tool

- Porting CARM to AMD/ARM/RISC-V
- Providing application analysis in the scope of CARM
- Combining all features in a single tool

#### Porting CARM to other architectures

- Requires tailored microbenchmarks
- Understanding of underlying architecture

#### Experimental results

- CARM Architecture Analysis
- © CARM Application Analysis







State of the Art

The CARM Tool

Results

Conclusion

## **State of the Art**

State of the art CPUs

The Cache-Aware Roofline Model

State of the art Roofline Tools

State of the art PMU and DBI Tools



## **State of the art - CPUs**







## The Cache-aware Roofline Model

- How does CARM work?
  - Sloped Roof
  - Flat Roof

- How to generate CARM?
  - Floating-Point Microbenchmarks
  - Memory Microbenchmarks

- The CARM Tool
  - Automatic Benchmarking
  - Automatic CARM Generation





### The Cache-aware Roofline Model

#### How does CARM work?

- Sloped Roof
- Flat Roof

#### How to generate CARM?

- Floating-Point Microbenchmarks
- Memory Microbenchmarks

#### The CARM Tool

- Automatic Benchmarking
- Automatic CARM Generation





## **State of the art – Roofline Tools**

| Tools<br>Features          | Intel Advisor | AMD uProf | ER1       | CARM Tool                   |
|----------------------------|---------------|-----------|-----------|-----------------------------|
| Supported<br>Architectures | Intel         | AMD       | Intel   F | Intel   AMD<br>ARM   RISC-V |
| Supported<br>Roofline      | CARM          | ORM       | ? CARI    | CARM                        |
| Application<br>Analysis    | DBI           | PMUs      | No sup    | DBI   PMUs                  |
| Open-Source                | No            | No        | Yes       | Yes                         |





# State of the art – PMU Tools

| Tools<br>Features | PAPI            | Perf                     |  |
|-------------------|-----------------|--------------------------|--|
| Supports          | Intel   AMD ARM | Intel   AMD ARM   RISC-V |  |
| Advantage         | Portability     | Availability             |  |
| Disadvantage      | Only ROI*       | Portability              |  |





## State of the art – DBI Tools

| Tools<br>Features | Intel SDE         | DynamoRIO                       |  |
|-------------------|-------------------|---------------------------------|--|
| Supports          | Intel   AMD       | Intel   AMD   ARM<br>? RISC-V ? |  |
| Advantage         | Detailed Analysis | Customizable Analysis           |  |
| Disadvantage      | Not Customizable  | Overhead                        |  |
| Open-Source       | No                | Yes                             |  |







State of the Art

The CARM Tool

Results

Conclusion

## The CARM Tool

- High-Level Overview
  - Graphical User Interface
  - Automatic Benchmarking
  - Application Analysis

- Low-Level Overview
  - Benchmark Generation
  - Frequency Measuring
  - Benchmarking



# The CARM Tool – High Level

### Interfacing

- © Command Line Interface
- Graphical User Interface

### Automatic Benchmarking

- CARM Benchmarks
- Memory / FP / Mixed Benchmarks

### Application Analysis

- PMU Analysis
- DBI Analysis





# The CARM Tool – Supported ISAs

| intel  | AMD    | arm    | RISC-V          |
|--------|--------|--------|-----------------|
| Scalar | Scalar | Scalar | Scalar          |
| SSE    | SSE    | Neon   | RVV 0.7.1   1.0 |
| AVX2   | AVX2   | SVE    |                 |
| AVX512 | AVX512 |        |                 |



# The CARM Tool – High Level





## The CARM Tool – Graphical User Interface

#### Results Visualization







## The CARM Tool – Graphical User Interface

#### Benchmark Execution







## The CARM Tool – Graphical User Interface

### Application Profiling







### The CARM Tool – Automatic Benchmarking

```
AVX512 Microbenchmarks Pseudo-Code
    volatile
  "movq %0, %%r8" //Outer Loop Variable Iterations
  "Loop2 %=:"
  "movg %1, %%rax" //Pointer to Test Data Array
  "movq $388, %rdi" //Inner Loop Iterations
  "Loop1 %=:"
  "vmovapd 0(%rax), %zmm0" //Vector Load
  "vmovapd %%zmm1, 64(%%rax)" //Vector Store
  "vfmadd132pd %%zmm2, %%zmm2, %%zmm2" //Vector FMA
  "vmovapd 16384(%rax), %zmm29"
  "vmovapd %%zmm30, 16448(%%rax)"
  "vfmadd132pd %%zmm31, %%zmm31, %%zmm31"
  "addq $16512, %rax" //Pointer Bump
  "suba $1, %%rdi"
  "jnz Loop1 %=" //Inner Loop End
  "vmovapd 0(%rax), %zmm0"
  "vmovapd %%zmm1, 64(%%rax)"
  "vfmadd132pd %%zmm2, %%zmm2, %%zmm2"
  "subq $1, %%r8"
  "inz Loop2 %=" //Outer Loop End
  :"r"(num reps t), "r" (test var)
  :"rax", "rdi", "r8", "zmm0-31"
```

```
RISC-V RVV Vector Length Detection
asm volatile
                                                        Run.pv
    "li
                   t0, 8192\n\t"
     "vsetvli
                   t0, t0, e64, m1\n\t"
                   t0, %[vl]\n\t"
     "SW
         [vl] "m" (vec length)
         "t0", "t1", "t2"
              Bench
                                    Config
                                                         Test
           Microbenchmarks
                                Automatic/Manual
                                                    Microbenchmarks
             Generation
                                configuration files
                                                       Execution
                      Automatic Benchmarking
```



## The CARM Tool – Application Analysis







## The CARM Tool – Low Level

#### Benchmark Generation

- Following a general structure
- Adapted to each ISA extension

### Frequency Measuring

- Assembly based
- Adapted to each ISA

### Benchmarking

- Timing tests
- Actual benchmarking





## The CARM Tool – Low Level







### The CARM Tool – Benchmark Generation



#### **AVX512 Microbenchmarks Pseudo-Code** volatile "movq %0, %%r8" //Outer Loop Variable Iterations "Loop2 %=:" "movg %1, %%rax" //Pointer to Test Data Array "movq \$388, %rdi" //Inner Loop Iterations "Loop1 %=:" "vmovapd 0(%rax), %zmm0" //Vector Load "vmovapd %%zmm1, 64(%%rax)" //Vector Store "vfmadd132pd %%zmm2, %%zmm2, %%zmm2" //Vector FMA "vmovapd 16384(%rax), %%zmm29" "vmovapd %%zmm30, 16448(%%rax)" "vfmadd132pd %%zmm31, %%zmm31" "addq \$16512, %rax" //Pointer Bump "subq \$1, %%rdi" "jnz Loop1 %=" //Inner Loop End "vmovapd 0(%rax), %zmm0" "vmovapd %%zmm1, 64(%%rax)" "vfmadd132pd %%zmm2, %%zmm2, %%zmm2" "subq \$1, %%r8" "inz Loop2 %=" //Outer Loop End :"r"(num reps t), "r" (test var) :"rax", "rdi", "r8", "zmm0-31"



## The CARM Tool – Frequency Measuring

#### Frequency Measuring Approach

- Assembly function that leads to 1 IPC
- Adapted to each ISA
- Under 1% error on all tested machines

### Timing methods considered

- Time Stamp Counter (TSC) Intel | AMD
- Clockgettime function ARM | RISC-V

#### AARCH64 clktestarm: clktest\_loop: add x29, x29, x8 //....// add x29, x29, x8 sub x0, x0, x9 cbnz x0, clktest\_loop 20 ret

### **Frequency Measuring**

CoreClockCheckerRISCV.s

CoreClockCheckerARM.s

CoreClockCheckerx86.s



# The CARM Tool – Bechmarking

#### Timing Tests

- Preliminary execution of benchmarks
- Ensures benchmarks last enough time
- Avoids unnecessarily long benchmarking

### Benchmarking

- Tests are repeated 1024 times
- Thread barriers ensure parallel execution
- Best run per thread is selected

```
AVX512 Microbenchmarks Pseudo-Code
        volatile (
  "movq %0, %%r8" //Outer Loop Variable Iterations
  "Loop2 %=: "
  "movq %1, %%rax" //Pointer to Test Data Array
  "movg $388, %rdi" //Inner Loop Iterations
  "Loop1 %=: "
  "vmovapd 0(%rax), %zmm0" //Vector Load
  "vmovapd %%zmm1, 64(%%rax)" //Vector Store
  "vfmadd132pd %%zmm2, %%zmm2, %%zmm2" //Vector FMA
  "vmovapd 16384(%rax), %zmm29"
  "vmovapd %%zmm30, 16448(%%rax)"
  "vfmadd132pd %%zmm31, %%zmm31, %%zmm31"
  "addq $16512, %%rax" //Pointer Bump
  "subq $1, %%rdi"
  "jnz Loop1 %=" //Inner Loop End
  "vmovapd 0(%rax), %zmm0"
  "vmovapd %%zmm1, 64(%%rax)"
  "vfmadd132pd %%zmm2, %%zmm2, %%zmm2"
  "suba $1. %%r8"
  "jnz Loop2 %=" //Outer Loop End
  :"r"(num reps t), "r" (test var)
  :"rax", "rdi", "r8", "zmm0-31"
```







State of the Art

The CARM Tool

Results

Conclusion

### Results

CARM based Architecture Analysis

- Comparison with state of the art Roofline Tools
  - Intel Advisor
  - Empirical Roofline Toolkit
- CARM based SpMV Application Analysis



# **Results - CARM Architecture Analysis**

#### Machines Utilized

- One from each vendor
- Covering all ISA extensions supported

### Analysis Objective

Verify if CARM benchmarks can reach architectural limits

|                  | Venus                | Cara       | Armq   | Milk-V       |
|------------------|----------------------|------------|--------|--------------|
| Vendor           | Intel                | AMD        | ARM    | RISC-V       |
| Architecture     | Skyla ke-X           | Zen 3      | Vulcan | XuanTie C920 |
| ISA Extensions   | SSE   AVX2   AVX-512 | SSE   AVX2 | Neon   | RVV 0.7.1    |
| FP   LD/ST Units | 2   3                | 2   3      | 2   2  | 2   1        |



#### Best Load/Store Ratio

- Follows Ld/St unit ratio
- Loads outperform stores

### Memory Architectural Limits

- Accurately achieved for L1
- Progressively harder to reach for lower memory levels





- Best Load/Store Ratio
  - Follows Ld/St unit ratio
  - Loads outperform stores

- Memory Architectural Limits
  - Accurately achieved for L1
  - Progressively harder to reach for lower memory levels





- Best Load/Store Ratio
  - Follows Ld/St unit ratio
  - Loads outperform stores

- Memory Architectural Limits
  - Accurately achieved for L1
  - Progressively harder to reach for lower memory levels





- Best Load/Store Ratio
  - Follows Ld/St unit ratio
  - Loads outperform stores

- Memory Architectural Limits
  - Accurately achieved for L1
  - Progressively harder to reach for lower memory levels





#### Best Load/Store Ratio

- Follows Ld/St unit ratio
- Loads outperform stores

#### Memory Architectural Limits

- Accurately achieved for L1
- Progressively harder to reach for lower memory levels

| Deviation / IPC* | Venus   | Cara    | Armq     | Milk-V  |
|------------------|---------|---------|----------|---------|
| u                | -1.54%  | -0.093% | -0.01%   | -0.01%  |
| L2               | +18.57% | -0.01%  | 0.9 IPC  | 0.4 IPC |
| L3               | -17%    | -16.7%  | 0.75 IPC | 0.1 IPC |



# Results - CARM FP Arithmetic Analysis

#### Architectural Limits

- Accurately achieved
- Intel slows down with wider ISA extensions

#### Mixed Benchmark Validation

- FMA leads to more deviation
- Wider ISA extensions lead to more deviation





# Results - CARM FP Arithmetic Analysis

#### Architectural Limits

- Accurately achieved
- Intel slows down with wider ISA extensions

### Mixed Benchmark Validation

- FMA leads to more deviation
- Wider ISA extensions lead to more deviation

|            | Intel  | AMD    | ARM    | RISCV  |
|------------|--------|--------|--------|--------|
| Scalar     | -0.78% | -0.15% | +0.54% | +0.66% |
| Widest ISA | -5.9%  | -0.12% | +0.85% | +0.63% |



# Results - CARM FP Arithmetic Analysis

#### Architectural Limits

- Accurately achieved
- Intel slows down with wider ISA extensions

#### Mixed Benchmark Validation

- FMA leads to more deviation
- Wider ISA extensions lead to more deviation

| Deviation            | Venus   | Cara    | Armq    | Milk-V  |
|----------------------|---------|---------|---------|---------|
| Mixed Scalar Add     | -3.44%  | -1.11%  | -17.72% | -4.55%  |
| Mixed Scalar FMA     | -5.34%  | -10.27% | -17.25% | -8.29%  |
| Mixed Widest ISA Add | -3.44%  | -0.16%  | -25.93% | -10.45% |
| Mixed Widest ISA FMA | -24.83% | -9.14%  | -24.88% | -10.57% |



## Results – CARM Comparison

## Comparison with ERT

- Higher L1 bandwidth and GFLOPS achieved
- Lower level discrepancies due to imprecise cache size values

## Comparison with Intel Advisor

- Similar L1 bandwidth and GFLOPS achieved
- Lower level discrepancies due to Id/st ratio variations





# **Results – Application Analysis**

- Cross-Architecture SpMV Analysis
  - SpMV Performance Using the Eigen library
  - SpMV performance comparison

RCM Re-Ordering improves performance





# **Results – Application Analysis**

- © Cross-Architecture SpMV Analysis
- SpMV Performance

Using the Eigen library

RCM Re-Ordering improves performance

SpMV performance comparison





# **Results – Application Analysis**

- Cross-Architecture SpMV Analysis
  - Using the Eigen library
  - SpMV performance comparison

- SpMV Performance
  - RCM Re-Ordering improves performance





DBI - Original Matrix

DBI - RCM Matrix



PMU - Original Matrix



PMU - RCM Matrix





State of the Art

The CARM Tool

Results

Conclusion

## **Conclusion**

- Current collaborations
  - © CERN, SYCLOPS, POP3, BSC

- Complete collaborations
  - SparCity SuperTwin Live-CARM
- CARM Tool / CHAMP Hub Github and Paper



## **Conclusion – Current Collaborations**

© CERN – Adaptive Perf Tool

BSC – Paraver Tool

SYCLOPS – SYCL DB

POP3 – Application Profiling





# **Conclusion – Complete Collaborations**

## Sparcity – SuperTwin

- © CARM Tool source code shipped with the tool
- SuperTwin interfaces with the CARM Tool to get CARM results

#### Live-CARM

- Based on live performance counter data from SuperTwin
- Application and benchmark analysis was conducted





# **Conclusion – Complete Collaborations**

## Sparcity – SuperTwin

- © CARM Tool source code shipped with the tool
- SuperTwin interfaces with the CARM Tool to get CARM results



### Live-CARM

- Based on live performance counter data from SuperTwin
- Application and benchmark analysis was conducted





## CARM Tool – Available on Github

- The CARM Tool is open source and available on Github
  - https://github.com/champ-hub/carm-roofline

The CARM Tool's paper is accepted for publication in IISWC24

The first of many tools to be developed in the scope of CHAMP hub



Heterogeneous Computing and Performance Modeling Hub



# Thank You

Any questions?

José Morgado Leonel Sousa

Aleksandar Ilic

28th POP Seminar 5 September 2024