Random Number Generation

Numba provides a random number generation algorithm that can be executed on the GPU. Due to technical issues with how NVIDIA implemented cuRAND, however, Numba’s GPU random number generator is not based on cuRAND. Instead, Numba’s GPU RNG is an implementation of the xoroshiro128+ algorithm. The xoroshiro128+ algorithm has a period of 2**128 - 1, which is shorter than the period of the XORWOW algorithm used by default in cuRAND, but xoroshiro128+ still passes the BigCrush tests of random number generator quality.

When using any RNG on the GPU, it is important to make sure that each thread has its own RNG state, and they have been initialized to produce non-overlapping sequences. The numba.cuda.random module provides a host function to do this, as well as CUDA device functions to obtain uniformly or normally distributed random numbers.

Note

Numba (like cuRAND) uses the Box-Muller transform <https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform> to generate normally distributed random numbers from a uniform generator. However, Box-Muller generates pairs of random numbers, and the current implementation only returns one of them. As a result, generating normally distributed values is half the speed of uniformly distributed values.

numba.cuda.random.create_xoroshiro128p_states(n, seed, subsequence_start=0, stream=0)

Returns a new device array initialized for n random number generators.

This intializes the RNG states so that each state in the array corresponds subsequences in the separated by 2**64 steps from each other in the main sequence. Therefore, as long no CUDA thread requests more than 2**64 random numbers, all of the RNG states produced by this function are guaranteed to be independent.

The subsequence_start parameter can be used to advance the first RNG state by a multiple of 2**64 steps.

Parameters:
  • n (int) – number of RNG states to create
  • seed (uint64) – starting seed for list of generators
  • subsequence_start (uint64) –
  • stream (CUDA stream) – stream to run initialization kernel on
numba.cuda.random.init_xoroshiro128p_state

Use SplitMix64 to generate an xoroshiro128p state from 64-bit seed.

This ensures that manually set small seeds don’t result in a predictable initial sequence from the random number generator.

Parameters:
  • states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
  • index (uint64) – offset in states to update
  • seed (int64) – seed value to use when initializing state
numba.cuda.random.init_xoroshiro128p_states(states, seed, subsequence_start=0, stream=0)

Initialize RNG states on the GPU for parallel generators.

This intializes the RNG states so that each state in the array corresponds subsequences in the separated by 2**64 steps from each other in the main sequence. Therefore, as long no CUDA thread requests more than 2**64 random numbers, all of the RNG states produced by this function are guaranteed to be independent.

The subsequence_start parameter can be used to advance the first RNG state by a multiple of 2**64 steps.

Parameters:
  • states (1D DeviceNDArray, dtype=xoroshiro128p_dtype) – array of RNG states
  • seed (uint64) – starting seed for list of generators
numba.cuda.random.rotl

Left rotate x by k bits.

numba.cuda.random.uint64_to_unit_float32

Convert uint64 to float64 value in the range [0.0, 1.0)

numba.cuda.random.uint64_to_unit_float64

Convert uint64 to float64 value in the range [0.0, 1.0)

numba.cuda.random.xoroshiro128p_jump

Advance the RNG in states[index] by 2**64 steps.

Parameters:
  • states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
  • index (int64) – offset in states to update
numba.cuda.random.xoroshiro128p_next

Return the next random uint64 and advance the RNG in states[index].

Parameters:
  • states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
  • index (int64) – offset in states to update
Return type:

uint64

numba.cuda.random.xoroshiro128p_normal_float32

Return a normally distributed float32 and advance states[index].

The return value is drawn from a Gaussian of mean=0 and sigma=1 using the Box-Muller transform. This advances the RNG sequence by two steps.

Parameters:
  • states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
  • index (int64) – offset in states to update
Return type:

float32

numba.cuda.random.xoroshiro128p_normal_float64

Return a normally distributed float32 and advance states[index].

The return value is drawn from a Gaussian of mean=0 and sigma=1 using the Box-Muller transform. This advances the RNG sequence by two steps.

Parameters:
  • states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
  • index (int64) – offset in states to update
Return type:

float64

numba.cuda.random.xoroshiro128p_uniform_float32

Return a float32 in range [0.0, 1.0) and advance states[index].

Parameters:
  • states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
  • index (int64) – offset in states to update
Return type:

float32

numba.cuda.random.xoroshiro128p_uniform_float64

Return a float64 in range [0.0, 1.0) and advance states[index].

Parameters:
  • states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
  • index (int64) – offset in states to update
Return type:

float64

Example

Here is a sample program that uses the random number generator:

from __future__ import print_function, absolute_import

from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np

@cuda.jit
def compute_pi(rng_states, iterations, out):
    """Find the maximum value in values and store in result[0]"""
    thread_id = cuda.grid(1)

    # Compute pi by drawing random (x, y) points and finding what
    # fraction lie inside a unit circle
    inside = 0
    for i in range(iterations):
        x = xoroshiro128p_uniform_float32(rng_states, thread_id)
        y = xoroshiro128p_uniform_float32(rng_states, thread_id)
        if x**2 + y**2 <= 1.0:
            inside += 1

    out[thread_id] = 4.0 * inside / iterations

threads_per_block = 64
blocks = 24
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)

compute_pi[blocks, threads_per_block](rng_states, 10000, out)
print('pi:', out.mean())