Abstract
For the past several decades, optimizing compilers have been a
primary area of focus in both industry and academia. This continued research interest is a testament to the complexity of this
task, primarily stemming from the vast number of parameters that
must be explored to attain near-optimal results. One of the key
compiler optimizations is "Register Blocking (RB)" also known as
"Register-level Tiling" or "unroll-and-jam". RB can strongly reduce
the number of executed Load/Store (L/S) instructions, and as a
consequence the number of data accesses in memory hierarchy,
but due to its inherent complexities, fine-tuning is essential for its
effective implementation. To address this problem, in this work a
new methodology is proposed for RB. The RB factors, the loops
to apply RB, the number of allocated variables/registers per array
reference, and the loops’ ordering are generated by an analytical
model, leveraging the target hardware (HW) architecture details and
loop kernel characteristics. The proposed methodology has been
evaluated on both embedded and general-purpose CPUs across
seven well-known loop kernels, achieving high speedups and L/S
instruction gains over GCC compiler, handwritten optimized codes,
and the popular Pluto tool.
primary area of focus in both industry and academia. This continued research interest is a testament to the complexity of this
task, primarily stemming from the vast number of parameters that
must be explored to attain near-optimal results. One of the key
compiler optimizations is "Register Blocking (RB)" also known as
"Register-level Tiling" or "unroll-and-jam". RB can strongly reduce
the number of executed Load/Store (L/S) instructions, and as a
consequence the number of data accesses in memory hierarchy,
but due to its inherent complexities, fine-tuning is essential for its
effective implementation. To address this problem, in this work a
new methodology is proposed for RB. The RB factors, the loops
to apply RB, the number of allocated variables/registers per array
reference, and the loops’ ordering are generated by an analytical
model, leveraging the target hardware (HW) architecture details and
loop kernel characteristics. The proposed methodology has been
evaluated on both embedded and general-purpose CPUs across
seven well-known loop kernels, achieving high speedups and L/S
instruction gains over GCC compiler, handwritten optimized codes,
and the popular Pluto tool.
Original language | English |
---|---|
Pages | 71-79 |
Number of pages | 9 |
DOIs | |
Publication status | Accepted/In press - 15 Feb 2024 |
Event | 21st ACM International Conference on Computing Frontiers - Ischia, Italy Duration: 7 May 2024 → 9 May 2024 https://www.computingfrontiers.org/2024/program.html |
Conference
Conference | 21st ACM International Conference on Computing Frontiers |
---|---|
Abbreviated title | CF' 24 |
Country/Territory | Italy |
City | Ischia |
Period | 7/05/24 → 9/05/24 |
Internet address |
Keywords
- Compiler Optimization
- Register Blocking
- Register Tiling
- Unroll-and-Jam
- High Performance Computing
- Data Reuse
- CPUs