Rahul Shah and James Demmel
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-58
May 14, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.pdf
Modern high-performance computing (HPC) increasingly relies on performance-portable software frameworks to efficiently exploit heterogeneous architectures such as multi-core CPUs and GPUs. Meanwhile, randomized numerical linear algebra (RandNLA) offers theoretically grounded, scalable algorithms for accelerating linear algebra computations, but implementing these techniques in a performance-portable and reproducible manner on diverse hardware remains challenging. This thesis addresses these challenges by integrating a full Kokkos back end into the RandBLAS library, enabling thread-scalable sparse-dense matrix multiplication (SpMM) on both CPUs and GPUs. We further replace Kokkos's default pseudorandom number generator with a counter-based Philox engine from Random123, eliminating a prior CUDA-specific shim; this approach yields deterministic, vectorized random streams and reduced register pressure, translating to speedups of up to 2x on an NVIDIA A100 GPU and 1.3x on an Intel Xeon CPU. A two-stage "sketch-and-solve" pipeline (sparse embedding via SpMM followed by dense factorization) leverages these advances to accelerate low-rank approximation computations with minimal accuracy loss, advancing toward the goal of a fully GPU-accelerated randomized LAPACK. All code will be upstreamed to the BallisticLA organization, providing a reproducible, RNG-agnostic foundation for large-scale Monte Carlo studies and future RandNLA development
Advisor: James Demmel
";
?>
BibTeX citation:
@mastersthesis{Shah:EECS-2025-58, Author = {Shah, Rahul and Demmel, James}, Editor = {Buluç, Aydin}, Title = {Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization}, School = {EECS Department, University of California, Berkeley}, Year = {2025}, Month = {May}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.html}, Number = {UCB/EECS-2025-58}, Abstract = {Modern high-performance computing (HPC) increasingly relies on performance-portable software frameworks to efficiently exploit heterogeneous architectures such as multi-core CPUs and GPUs. Meanwhile, randomized numerical linear algebra (RandNLA) offers theoretically grounded, scalable algorithms for accelerating linear algebra computations, but implementing these techniques in a performance-portable and reproducible manner on diverse hardware remains challenging. This thesis addresses these challenges by integrating a full Kokkos back end into the RandBLAS library, enabling thread-scalable sparse-dense matrix multiplication (SpMM) on both CPUs and GPUs. We further replace Kokkos's default pseudorandom number generator with a counter-based Philox engine from Random123, eliminating a prior CUDA-specific shim; this approach yields deterministic, vectorized random streams and reduced register pressure, translating to speedups of up to 2x on an NVIDIA A100 GPU and 1.3x on an Intel Xeon CPU. A two-stage "sketch-and-solve" pipeline (sparse embedding via SpMM followed by dense factorization) leverages these advances to accelerate low-rank approximation computations with minimal accuracy loss, advancing toward the goal of a fully GPU-accelerated randomized LAPACK. All code will be upstreamed to the BallisticLA organization, providing a reproducible, RNG-agnostic foundation for large-scale Monte Carlo studies and future RandNLA development} }
EndNote citation:
%0 Thesis %A Shah, Rahul %A Demmel, James %E Buluç, Aydin %T Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization %I EECS Department, University of California, Berkeley %D 2025 %8 May 14 %@ UCB/EECS-2025-58 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.html %F Shah:EECS-2025-58