Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization

Rahul Shah and James Demmel

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-58

May 14, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.pdf

Modern high-performance computing (HPC) increasingly relies on performance-portable software frameworks to efficiently exploit heterogeneous architectures such as multi-core CPUs and GPUs. Meanwhile, randomized numerical linear algebra (RandNLA) offers theoretically grounded, scalable algorithms for accelerating linear algebra computations, but implementing these techniques in a performance-portable and reproducible manner on diverse hardware remains challenging. This thesis addresses these challenges by integrating a full Kokkos back end into the RandBLAS library, enabling thread-scalable sparse-dense matrix multiplication (SpMM) on both CPUs and GPUs. We further replace Kokkos's default pseudorandom number generator with a counter-based Philox engine from Random123, eliminating a prior CUDA-specific shim; this approach yields deterministic, vectorized random streams and reduced register pressure, translating to speedups of up to 2x on an NVIDIA A100 GPU and 1.3x on an Intel Xeon CPU. A two-stage "sketch-and-solve" pipeline (sparse embedding via SpMM followed by dense factorization) leverages these advances to accelerate low-rank approximation computations with minimal accuracy loss, advancing toward the goal of a fully GPU-accelerated randomized LAPACK. All code will be upstreamed to the BallisticLA organization, providing a reproducible, RNG-agnostic foundation for large-scale Monte Carlo studies and future RandNLA development

Advisors: James Demmel

BibTeX citation:

@mastersthesis{Shah:EECS-2025-58,
    Author= {Shah, Rahul and Demmel, James},
    Editor= {Buluç, Aydin},
    Title= {Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.html},
    Number= {UCB/EECS-2025-58},
    Abstract= {Modern high-performance computing (HPC) increasingly relies on performance-portable software frameworks to efficiently exploit heterogeneous architectures such as multi-core CPUs and GPUs. Meanwhile, randomized numerical linear algebra (RandNLA) offers theoretically grounded, scalable algorithms for accelerating linear algebra computations, but implementing these techniques in a performance-portable and reproducible manner on diverse hardware remains challenging. This thesis addresses these challenges by integrating a full Kokkos back end into the RandBLAS library, enabling thread-scalable sparse-dense matrix multiplication (SpMM) on both CPUs and GPUs. We further replace Kokkos's default pseudorandom number generator with a counter-based Philox engine from Random123, eliminating a prior CUDA-specific shim; this approach yields deterministic, vectorized random streams and reduced register pressure, translating to speedups of up to 2x on an NVIDIA A100 GPU and 1.3x on an Intel Xeon CPU. A two-stage "sketch-and-solve" pipeline (sparse embedding via SpMM followed by dense factorization) leverages these advances to accelerate low-rank approximation computations with minimal accuracy loss, advancing toward the goal of a fully GPU-accelerated randomized LAPACK. All code will be upstreamed to the BallisticLA organization, providing a reproducible, RNG-agnostic foundation for large-scale Monte Carlo studies and future RandNLA development},
}

EndNote citation:

%0 Thesis
%A Shah, Rahul 
%A Demmel, James 
%E Buluç, Aydin 
%T Kokkos GPU Implementation of CPU-Based BLAS/LAPACK Operations and RandBLAS Randomization
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 14
%@ UCB/EECS-2025-58
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-58.html
%F Shah:EECS-2025-58