Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization

Rahul Shah and James Demmel

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-58
May 14, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.pdf

Modern high-performance computing (HPC) increasingly relies on performance-portable software frameworks to efficiently exploit heterogeneous architectures such as multi-core CPUs and GPUs. Meanwhile, randomized numerical linear algebra (RandNLA) offers theoretically grounded, scalable algorithms for accelerating linear algebra computations, but implementing these techniques in a performance-portable and reproducible manner on diverse hardware remains challenging. This thesis addresses these challenges by integrating a full Kokkos back end into the RandBLAS library, enabling thread-scalable sparse-dense matrix multiplication (SpMM) on both CPUs and GPUs. We further replace Kokkos's default pseudorandom number generator with a counter-based Philox engine from Random123, eliminating a prior CUDA-specific shim; this approach yields deterministic, vectorized random streams and reduced register pressure, translating to speedups of up to 2x on an NVIDIA A100 GPU and 1.3x on an Intel Xeon CPU. A two-stage "sketch-and-solve" pipeline (sparse embedding via SpMM followed by dense factorization) leverages these advances to accelerate low-rank approximation computations with minimal accuracy loss, advancing toward the goal of a fully GPU-accelerated randomized LAPACK. All code will be upstreamed to the BallisticLA organization, providing a reproducible, RNG-agnostic foundation for large-scale Monte Carlo studies and future RandNLA development

Advisor: James Demmel

\"Edit"; ?>


BibTeX citation:

@mastersthesis{Shah:EECS-2025-58,
    Author = {Shah, Rahul and Demmel, James},
    Editor = {Buluç, Aydin},
    Title = {Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization},
    School = {EECS Department, University of California, Berkeley},
    Year = {2025},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.html},
    Number = {UCB/EECS-2025-58},
    Abstract = {Modern high-performance computing (HPC) increasingly relies on performance-portable software frameworks to efficiently exploit heterogeneous architectures such as multi-core CPUs and GPUs. Meanwhile, randomized numerical linear algebra (RandNLA) offers theoretically grounded, scalable algorithms for accelerating linear algebra computations, but implementing these techniques in a performance-portable and reproducible manner on diverse hardware remains challenging. This thesis addresses these challenges by integrating a full Kokkos back end into the RandBLAS library, enabling thread-scalable sparse-dense matrix multiplication (SpMM) on both CPUs and GPUs. We further replace Kokkos's default pseudorandom number generator with a counter-based Philox engine from Random123, eliminating a prior CUDA-specific shim; this approach yields deterministic, vectorized random streams and reduced register pressure, translating to speedups of up to 2x on an NVIDIA A100 GPU and 1.3x on an Intel Xeon CPU. A two-stage "sketch-and-solve" pipeline (sparse embedding via SpMM followed by dense factorization) leverages these advances to accelerate low-rank approximation computations with minimal accuracy loss, advancing toward the goal of a fully GPU-accelerated randomized LAPACK. All code will be upstreamed to the BallisticLA organization, providing a reproducible, RNG-agnostic foundation for large-scale Monte Carlo studies and future RandNLA development}
}

EndNote citation:

%0 Thesis
%A Shah, Rahul
%A Demmel, James
%E Buluç, Aydin
%T Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 14
%@ UCB/EECS-2025-58
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.html
%F Shah:EECS-2025-58