The One Billion Row Challenge in CUDA: from 17m to 17s
On my journey to learn CUDA, I decided to tackle the One Billion Row Challenge with it.
The challenge is simple, but implementing it in CUDA was not. Here I will share my solution that runs in 16.8 seconds
on a V100. It’s certainly not the fastest solution, but it is the first
one of its kind (no cudf, hand-written kernels only). I challenge other CUDA
enthusiasts to make it faster.
Baseline in pure C++
You can’t improve what you don’t measure. Since I’m going to be writing C++ anyways for CUDA,...
Read more at tspeterkim.github.io