cuRAND :: CUDA 工具包文档

3. Device API 概述

要使用 device API，请在定义使用 cuRAND device 函数的内核的文件中包含文件 curand_kernel.h。device API 包括用于伪随机数生成和准随机数生成的函数。

3.1. 伪随机序列

伪随机序列的函数支持位生成和从分布中生成。

3.1.1. 使用 XORWOW 和 MRG32k3a 生成器进行位生成

__device__ unsigned int
curand(curandStateXORWOW_t *state)

__device__ unsigned int
curand(curandStateMRG32k3a_t *state)

在调用 curand_init() 之后，curand() 返回一个周期大于 $2^{190}$ 的伪随机数序列。如果每次调用 curand() 时都使用相同的初始状态，并且在调用 curand() 之间状态未被修改，则始终生成相同的序列。

__device__ void
curand_init(unsigned long long seed,
            unsigned long long sequence,
            unsigned long long offset,
            curandStateXORWOW_t *state)

__device__ void
curand_init(unsigned long long seed,
            unsigned long long sequence,
            unsigned long long offset,
            curandStateMRG32k3a_t  *state)

curand_init() 函数使用给定的种子、序列号和序列内的偏移量来设置调用者分配的初始状态。不同的种子保证产生不同的起始状态和不同的序列。相同的种子始终产生相同的状态和相同的序列。设置的状态将是在从种子状态调用 $2^{67}$ ⋅ sequence + offset 次 curand() 之后的状态。

使用不同种子生成的序列通常不具有统计相关的值，但某些种子选择可能会给出统计相关的序列。使用相同种子和不同序列号生成的序列将不具有统计相关的值。

为了获得最高质量的并行伪随机数生成，每个实验都应分配一个唯一的种子。在一个实验中，计算的每个线程都应分配一个唯一的序列号。如果一个实验跨越多个内核启动，建议内核启动之间的线程使用相同的种子，并且序列号以单调递增的方式分配。如果启动了相同的线程配置，则可以在内核启动之间将随机状态保存在全局内存中，以避免状态设置时间。

3.1.2. 使用 MTGP32 生成器进行位生成

MTGP32 生成器是对广岛大学开发的代码的改编（参见 [1]）。在该算法中，为多个序列生成样本，每个序列都基于一组计算出的参数。cuRAND 使用已为周期为 2¹¹²¹⁴ 的 32 位生成器预先生成的 200 个参数集。可以生成其他参数集，如 [1] 中所述，并使用这些参数集代替。每个参数集（序列）都有一个状态结构，该算法允许每个 200 个序列最多 256 个并发线程（在单个块内）的线程安全生成和状态更新。

请注意，两个不同的块不能安全地在同一状态下运行。另请注意，在一个块内，最多 256 个线程可以在给定状态下运行。

对于 MTGP32 生成器，提供了两个 host 函数来帮助设置设备内存中不同序列的参数，并设置初始状态。

__host__ curandStatust curandMakeMTGP32Constants(mtgp32paramsfastt params[],
                                                 mtgp32kernelparamst *p)

此函数将参数集数据从预生成格式 (mtgp32_params_fast_t) 重新组织为内核函数使用的格式 (mtgp32_kernel_params_t)，并将它们复制到设备内存。

__host__ curandStatus_t
curandMakeMTGP32KernelState(curandStateMtgp32_t *s,
                            mtgp32_params_fast_t params[],
                            mtgp32_kernel_params_t *k,
                            int n,
                            unsigned long long seed)

此函数基于指定的参数集和种子初始化 n 个状态，并将它们复制到 s 指示的设备内存。请注意，如果您使用的是预生成状态，则 n 的最大值为 200。

cuRAND MTGP32 生成器提供了两个内核函数来生成随机位。

__device__ unsigned int
curand(curandStateMtgp32_t *state)

此函数计算线程索引，并为该索引生成结果并更新状态。线程索引 t 计算为

t = (blockDim.x * blockDim.y * threadIdx.z) + (blockDim.x * threadIdx.y) + threadIdx.x

可以从单个内核启动重复调用此函数，但有以下约束

只能从具有 256 个或更少线程的块安全地调用它。
给定状态不能被多个块使用。
给定块可以使用多个状态生成随机数。
在代码的给定点，块中的所有线程或没有线程都必须调用此函数。

__device__ unsigned int
curandmtgp32specific(curandStateMtgp32_t *state, unsigned char index,
                     unsigned char n)

此函数为线程特定的 index 指定的位置生成结果并更新状态，并将状态中的偏移量前进 n 个位置。curand_mtgp32_specific 可以在内核启动中多次调用，但有以下约束

最多 256 个线程可以为给定状态调用此函数。
在一个块内，对于给定状态，如果 n 个线程正在调用该函数，则索引必须从 0...n-1 运行。索引不必与线程号匹配，并且可以根据调用程序的要求在线程之间分配。在代码的给定点，必须使用来自 0...n-1 的所有索引或不使用任何索引。
给定状态不能被多个块使用。
给定块可以使用多个状态生成随机数。

图 1. MTGP32 块和线程操作

Illustration of MTGP32 progressing through its state array

图 1 说明了 MTGP32 中的块和线程如何在生成器状态上运行。每一行代表一个 32 位整数 s(n) 的循环状态数组。在数组上运行的线程被标识为 T(m)。所示的特定情况与 host API 的内部实现相匹配，该 API 启动 64 个块，每个块包含 256 个线程。每个块在不同的序列上运行，由一组唯一的参数 P(n) 确定。MTGP32 序列的一个完整状态由 351 个 32 位整数定义。每个线程 T(m) 在其中一个整数 s(n+m) 上运行，将其与 s(n+m+1) 和拾取元素 s(n+m+p) 组合，其中 p <= 95。它将新状态存储在状态数组中的位置 s(n+m+351)。线程同步后，基索引 n 会前进更新状态的线程数。为了避免被覆盖，数组本身的长度必须至少为 256 + 351 个整数。实际上，为了索引效率，它的大小为 1024 个整数。

对可以在给定状态数组上运行的块中线程数量的限制源于需要确保状态 s(n+351) 在作为拾取状态需要之前已更新。如果存在线程 T(256)，则它可以使用 s(n+256+95)，即 s(n+351)，在线程零更新 s(n+351) 之前。如果应用程序要求一个块中超过 256 个线程调用 MTGP32 生成器函数，则它必须使用多个 MTGP32 状态，可以通过使用多个参数集或使用具有不同种子的多个生成器来实现。另请注意，生成器函数在每次调用结束时同步线程，因此对于块中的 256 个线程调用生成器效率最高。

3.1.3. 使用 Philox_4x32_10 生成器进行位生成

__device__ unsigned int
curand(curandStatePhilox4_32_10_t *state)

在调用 curand_init() 之后，curand() 返回一个周期为 $2^{128}$ 的伪随机数序列。如果每次调用 curand() 时都使用相同的初始状态，并且在调用 curand() 之间状态未被修改，则始终生成相同的序列。

__device__ void
curand_init(unsigned long long seed,
            unsigned long long subsequence,
            unsigned long long offset,
            curandStatePhilox4_32_10_t *state)

的伪随机数序列。curand_init() 函数使用给定的种子、子序列和偏移量来设置调用者分配的初始状态。不同的种子保证产生不同的起始状态和不同的序列。子序列和偏移量共同定义了周期为 $2^{128}$ 的序列中的偏移量。偏移量定义了长度为 $2^{64}$ 的子序列中的偏移量。当子序列中的最后一个元素生成后，下一个随机数是来自连续子序列的第一个元素。相同的种子始终产生相同的状态和相同的序列。

使用不同种子生成的序列通常不具有统计相关的值，但某些种子选择可能会给出统计相关的序列。

为了获得最高质量的并行伪随机数生成，每个实验都应分配一个唯一的种子值。在一个实验中，计算的每个线程都应分配一个唯一的 ID 号。如果一个实验跨越多个内核启动，建议内核启动之间的线程使用相同的种子，并且 ID 号以单调递增的方式分配。如果启动了相同的线程配置，则可以在内核启动之间将随机状态保存在全局内存中，以避免状态设置时间。

3.1.4. 分布

__device__ float
curand_uniform (curandState_t *state)

此函数返回一个介于 0.0 和 1.0 之间的均匀分布的伪随机浮点数序列。它可能返回从 0.0 到 1.0 的值，其中包含 1.0 但排除 0.0。分布函数可以使用来自基本生成器的任意数量的无符号整数值。消耗的值的数量不能保证是固定的。

__device__ float
curand_normal (curandState_t *state)

此函数返回一个均值为 0.0，标准差为 1.0 的单精度正态分布浮点数。此结果可以缩放和平移，以生成具有任何均值和标准差的正态分布值。

__device__ float
curand_log_normal (curandState_t *state, float mean, float stddev)

此函数返回一个基于具有给定均值和标准差的正态分布的单精度对数正态分布浮点数。

__device__ unsigned int
curand_poisson (curandState_t *state, double lambda)

此函数返回一个基于具有给定 lambda 的泊松分布的单精度泊松分布无符号整数。用于从均匀分布结果中导出泊松结果的算法因 lambda 的值和生成器的类型而异。某些算法为单个输出绘制多个样本。

__device__ double
curand_uniform_double (curandState_t *state)

__device__ double
curand_normal_double (curandState_t *state)

__device__ double
curand_log_normal_double (curandState_t *state, double mean, double stddev)

以上三个函数是 curand_uniform()、curand_normal() 和 curand_log_normal() 的双精度版本。

对于伪随机生成器，双精度函数使用多次调用 curand() 来生成 53 个随机位。

__device__ float2
curand_normal2 (curandState_t *state)

__device__ float2
curand_log_normal2 (curandState_t *state)

__device__ double2
curand_normal2_double (curandState_t *state)

__device__ double2
curand_log_normal2_double (curandState_t *state)

以上函数每次调用生成两个正态分布或对数正态分布的伪随机结果。由于底层实现使用 Box-Muller 变换，因此通常比每次调用生成单个结果更有效。

__device__ uint4
curand4 (curandStatePhilox4_32_10_t *state)

__device__ float4
curand_uniform4 (curandStatePhilox4_32_10_t *state)

__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)

__device__ float4
curand_log_normal4 (curandStatePhilox4_32_10_t *state, float mean, float stddev)

__device__ uint4
curand_poisson4 (curandStatePhilox4_32_10_t *state, double lambda)

__device__ uint4
curand_discrete4 (curandStatePhilox4_32_10_t *state, curandDiscreteDistribution_t discrete_distribution)

__device__ double2
curand_uniform2_double (curandStatePhilox4_32_10_t *state)

__device__ double2
curand_normal2_double (curandStatePhilox4_32_10_t *state)

__device__ double2
curand_log_normal2_double (curandStatePhilox4_32_10_t *state, double mean, double stddev)

以上函数每次调用生成四个单精度或两个双精度结果。由于底层实现使用 Philox 生成器，因此通常比每次调用生成单个结果更有效。

3.2. 准随机序列

虽然默认生成器类型是来自 XORWOW 的伪随机数，但可以使用以下函数生成基于 Sobol 32 位整数的 Sobol 序列

__device__ void
curand_init (
    unsigned int *direction_vectors, 
    unsigned int offset,
    curandStateSobol32_t *state)

__device__ void
curand_init (
    unsigned int *direction_vectors,
    unsigned int scramble_c, 
    unsigned int offset,
    curandStateScrambledSobol32_t *state)

__device__ unsigned int
curand (curandStateSobol32_t *state)

__device__ float
curand_uniform (curandStateSobol32_t *state)

__device__ float 
curand_normal (curandStateSobol32_t *state)

__device__ float 
curand_log_normal (
    curandStateSobol32_t *state,  
    float mean, 
    float stddev)

__device__ unsigned int 
curand_poisson (curandStateSobol32_t *state, double lambda)

__device__ double
curand_uniform_double (curandStateSobol32_t *state)

__device__ double
curand_normal_double (curandStateSobol32_t *state)

__device__ double
curand_log_normal_double (
    curandStateSobol32_t *state, 
    double mean, 
    double stddev)

curand_init() 函数初始化准随机数生成器状态。没有种子参数，只有方向向量和偏移量。对于扰码 Sobol 生成器，还有一个额外的参数 scramble_c，它是扰码序列的初始值。对于 curandStateSobol32_t 类型和 curandStateScrambledSobol32_t 类型，方向向量是一个 32 个无符号整数值的数组。对于 curandStateSobol64_t 类型和 curandStateScrambledSobol64_t 类型，方向向量是一个 64 个无符号长长整型值的数组。扰码序列的偏移量和初始常量对于 32 位 Sobol 生成器是无符号整型类型。对于 64 位 Soblol 生成器，这些参数是无符号长长整型类型。对于 curandStateSobol32_t 类型和 curandStateScrambledSobol32_t 类型，序列正好是 $2^{32}$ 个元素长，其中每个元素为 32 位。对于 curandStateSobol64_t 类型和 curandStateScrambledSobol64_t 类型，序列正好是 $2^{64}$ 个元素长，其中每个元素为 64 位。每次调用 curand() 都返回下一个准随机元素。调用 curand_uniform() 返回 0.0 到 1.0 之间的准随机浮点数或双精度浮点数，其中包含 1.0 但排除 0.0。类似地，调用 curand_normal() 返回均值为 0.0，标准差为 1.0 的正态分布浮点数或双精度浮点数。调用 curand_log_normal() 返回从具有指定均值和标准差的正态分布导出的对数正态分布浮点数或双精度浮点数。所有生成函数都可以使用任何类型的 Sobol 生成器调用。

例如，生成填充单位立方体的准随机坐标需要跟踪三个准随机生成器。所有三个都将从 offset = 0 开始，并且维度分别为 0、1 和 2。对每个生成器状态的 curand_uniform() 的单次调用将生成 $x$ , $y$ 和 $z$ 坐标。方向向量表可以通过 curandGetDirectionVectors32() 和 curandGetDirectionVectors64() 函数在 host 上访问。所需的方向向量应在使用前复制到设备内存中。

用于准随机生成的正态分布函数使用逆累积密度函数来保留准随机序列的维度。因此，与伪随机生成器一样，没有一次生成多个结果的函数。

双精度 Sobol32 函数返回双精度结果，该结果使用来自底层生成器的 32 位内部精度。

双精度 Sobol64 函数返回双精度结果，该结果使用来自底层生成器的 53 位内部精度。这些位取自 64 位样本的高阶 53 位。

3.3. 跳跃前进

有几个函数可以从生成器状态跳过。

__device__ void
skipahead(unsigned long long n, curandState_t *state)

__device__ void
skipahead(unsigned int n, curandStateSobol32_t *state)

使用此函数等效于调用 curand() $n$ 次而不使用返回值，但速度要快得多。

__device__ void
skipahead_sequence(unsigned long long n, curandState_t *state)

此函数等效于调用 curand() $n \cdot 2^{67}$ 次而不使用返回值，并且速度要快得多。

3.4. 离散分布的 Device API

离散分布（例如泊松分布）需要额外的 API，这些 API 在 HOST 端执行预处理以生成特定分布的直方图。对于泊松分布，此直方图对于不同的 lambda 值是不同的。在具有至少 48KB L1 缓存的 GPU 上，这些分布将表现出最佳性能。

curandStatus_t 
curandCreatePoissonDistribution(
    double lambda, 
    curandDiscreteDistribution_t *discrete_distribution)

curandCreatePoissonDistribution() 函数用于为给定 lambda 的泊松分布创建直方图。

__device__ unsigned int 
curand_discrete (
    curandState_t *state,
    curandDiscreteDistribution_t discrete_distribution)

此函数返回一个基于给定离散分布直方图的单精度离散分布无符号整数。

curandStatus_t 
curandDestroyDistribution(
    curandDiscreteDistribution_t discrete_distribution)

curandDestroyDistribution() 函数用于清理与直方图相关的结构。

3.5. 性能注意事项

调用 curand_init() 比调用 curand() 或 curand_uniform() 慢。对于 curand_init() 的大偏移量比小偏移量花费更多时间。保存和恢复随机生成器状态比重复重新计算起始状态要快得多。

如下所示，生成器状态可以存储在内核启动之间的全局内存中，在本地内存中用于快速生成，然后存储回全局内存中。

__global__ void example(curandState *global_state)
{
    curandState local_state;
    local_state = global_state[threadIdx.x];
    for(int i = 0; i < 10000; i++) {
        unsigned int x = curand(&local_state);
        ...
    }
    global_state[threadIdx.x] = local_state;
}

随机生成器状态的初始化通常比随机数生成需要更多的寄存器和本地内存。将 curand_init() 和 curand() 的调用分离到单独的内核中可能有利于获得最佳性能。

状态设置可能是一项昂贵的操作。加速设置的一种方法是对每个线程使用不同的种子，并使用恒定的序列号 0。当需要创建许多生成器时，这尤其有用。虽然设置速度更快，但此方法对生成序列的数学特性提供的保证较少。如果恰好存在从种子初始化生成器状态的哈希函数与生成器的周期性之间的不良交互，则对于某些种子值，可能会出现具有高度相关输出的线程。我们不知道任何问题值；如果它们确实存在，则可能很少见。

3.6. Device API 示例

此示例使用 cuRAND device API 来生成使用 XORWOW 或 MRG32k3a 生成器的伪随机数。对于整数，它计算设置了低位的比例。对于均匀分布的实数，它计算大于 0.5 的比例。对于正态分布的实数，它计算均值一个标准差范围内的比例。

/*
 * This program uses the device CURAND API to calculate what
 * proportion of pseudo-random ints have low bit set.
 * It then generates uniform results to calculate how many
 * are greater than .5.
 * It then generates  normal results to calculate how many
 * are within one standard deviation of the mean.
 */
#include <stdio.h>
#include <stdlib.h>

#include <cuda_runtime.h>
#include <curand_kernel.h>

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

__global__ void setup_kernel(curandState *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(1234, id, 0, &state[id]);
}

__global__ void setup_kernel(curandStatePhilox4_32_10_t *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(1234, id, 0, &state[id]);
}

__global__ void setup_kernel(curandStateMRG32k3a *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(0, id, 0, &state[id]);
}

__global__ void generate_kernel(curandState *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    int count = 0;
    unsigned int x;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&localState);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_kernel(curandStatePhilox4_32_10_t *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    int count = 0;
    unsigned int x;
    /* Copy state to local memory for efficiency */
    curandStatePhilox4_32_10_t localState = state[id];
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&localState);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_uniform_kernel(curandState *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float x;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int i = 0; i < n; i++) {
        x = curand_uniform(&localState);
        /* Check if > .5 */
        if(x > .5) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_uniform_kernel(curandStatePhilox4_32_10_t *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float x;
    /* Copy state to local memory for efficiency */
    curandStatePhilox4_32_10_t localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int i = 0; i < n; i++) {
        x = curand_uniform(&localState);
        /* Check if > .5 */
        if(x > .5) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_normal_kernel(curandState *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float2 x;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Generate pseudo-random normals */
    for(int i = 0; i < n/2; i++) {
        x = curand_normal2(&localState);
        /* Check if within one standard deviaton */
        if((x.x > -1.0) && (x.x < 1.0)) {
            count++;
        }
        if((x.y > -1.0) && (x.y < 1.0)) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_normal_kernel(curandStatePhilox4_32_10_t *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    float2 x;
    /* Copy state to local memory for efficiency */
    curandStatePhilox4_32_10_t localState = state[id];
    /* Generate pseudo-random normals */
    for(int i = 0; i < n/2; i++) {
        x = curand_normal2(&localState);
        /* Check if within one standard deviaton */
        if((x.x > -1.0) && (x.x < 1.0)) {
            count++;
        }
        if((x.y > -1.0) && (x.y < 1.0)) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_kernel(curandStateMRG32k3a *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    unsigned int x;
    /* Copy state to local memory for efficiency */
    curandStateMRG32k3a localState = state[id];
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&localState);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_uniform_kernel(curandStateMRG32k3a *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    double x;
    /* Copy state to local memory for efficiency */
    curandStateMRG32k3a localState = state[id];
    /* Generate pseudo-random uniforms */
    for(int i = 0; i < n; i++) {
        x = curand_uniform_double(&localState);
        /* Check if > .5 */
        if(x > .5) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

__global__ void generate_normal_kernel(curandStateMRG32k3a *state,
                                int n,
                                unsigned int *result)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int count = 0;
    double2 x;
    /* Copy state to local memory for efficiency */
    curandStateMRG32k3a localState = state[id];
    /* Generate pseudo-random normals */
    for(int i = 0; i < n/2; i++) {
        x = curand_normal2_double(&localState);
        /* Check if within one standard deviaton */
        if((x.x > -1.0) && (x.x < 1.0)) {
            count++;
        }
        if((x.y > -1.0) && (x.y < 1.0)) {
            count++;
        }
    }
    /* Copy state back to global memory */
    state[id] = localState;
    /* Store results */
    result[id] += count;
}

int main(int argc, char *argv[])
{
    const unsigned int threadsPerBlock = 64;
    const unsigned int blockCount = 64;
    const unsigned int totalThreads = threadsPerBlock * blockCount;

    unsigned int i;
    unsigned int total;
    curandState *devStates;
    curandStateMRG32k3a *devMRGStates;
    curandStatePhilox4_32_10_t *devPHILOXStates;
    unsigned int *devResults, *hostResults;
    bool useMRG = 0;
    bool usePHILOX = 0;
    int sampleCount = 10000;
    bool doubleSupported = 0;
    int device;
    struct cudaDeviceProp properties;

    /* check for double precision support */
    CUDA_CALL(cudaGetDevice(&device));
    CUDA_CALL(cudaGetDeviceProperties(&properties,device));
    if ( properties.major >= 2 || (properties.major == 1 && properties.minor >= 3) ) {
        doubleSupported = 1;
    }

    /* Check for MRG32k3a option (default is XORWOW) */
    if (argc >= 2)  {
        if (strcmp(argv[1],"-m") == 0) {
            useMRG = 1;
            if (!doubleSupported){
                printf("MRG32k3a requires double precision\n");
                printf("^^^^ test WAIVED due to lack of double precision\n");
                return EXIT_SUCCESS;
            }
        }else if (strcmp(argv[1],"-p") == 0) {
		usePHILOX = 1;
	}
        /* Allow over-ride of sample count */
        sscanf(argv[argc-1],"%d",&sampleCount);
    }

    /* Allocate space for results on host */
    hostResults = (unsigned int *)calloc(totalThreads, sizeof(int));

    /* Allocate space for results on device */
    CUDA_CALL(cudaMalloc((void **)&devResults, totalThreads *
              sizeof(unsigned int)));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, totalThreads *
              sizeof(unsigned int)));

    /* Allocate space for prng states on device */
    if (useMRG) {
        CUDA_CALL(cudaMalloc((void **)&devMRGStates, totalThreads *
                  sizeof(curandStateMRG32k3a)));
    }else if(usePHILOX) {
        CUDA_CALL(cudaMalloc((void **)&devPHILOXStates, totalThreads *
                  sizeof(curandStatePhilox4_32_10_t)));
    }else {
        CUDA_CALL(cudaMalloc((void **)&devStates, totalThreads *
                  sizeof(curandState)));
    }

    /* Setup prng states */
    if (useMRG) {
        setup_kernel<<<64, 64>>>(devMRGStates);
    }else if(usePHILOX)
    {
        setup_kernel<<<64, 64>>>(devPHILOXStates);
    }else {
        setup_kernel<<<64, 64>>>(devStates);
    }

    /* Generate and use pseudo-random  */
    for(i = 0; i < 50; i++) {
        if (useMRG) {
            generate_kernel<<<64, 64>>>(devMRGStates, sampleCount, devResults);
        }else if (usePHILOX){
            generate_kernel<<<64, 64>>>(devPHILOXStates, sampleCount, devResults);
	}else {
            generate_kernel<<<64, 64>>>(devStates, sampleCount, devResults);
        }
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, totalThreads *
        sizeof(unsigned int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < totalThreads; i++) {
        total += hostResults[i];
    }
    printf("Fraction with low bit set was %10.13f\n",
        (float)total / (totalThreads * sampleCount * 50.0f));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, totalThreads *
              sizeof(unsigned int)));

    /* Generate and use uniform pseudo-random  */
    for(i = 0; i < 50; i++) {
        if (useMRG) {
            generate_uniform_kernel<<<64, 64>>>(devMRGStates, sampleCount, devResults);
        }else if(usePHILOX) {
            generate_uniform_kernel<<<64, 64>>>(devPHILOXStates, sampleCount, devResults);
	}else {
            generate_uniform_kernel<<<64, 64>>>(devStates, sampleCount, devResults);
        }
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, totalThreads *
        sizeof(unsigned int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < totalThreads; i++) {
        total += hostResults[i];
    }
    printf("Fraction of uniforms > 0.5 was %10.13f\n",
        (float)total / (totalThreads * sampleCount * 50.0f));
    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, totalThreads *
              sizeof(unsigned int)));

    /* Generate and use normal pseudo-random  */
    for(i = 0; i < 50; i++) {
        if (useMRG) {
            generate_normal_kernel<<<64, 64>>>(devMRGStates, sampleCount, devResults);
        }else if(usePHILOX) {
            generate_normal_kernel<<<64, 64>>>(devPHILOXStates, sampleCount, devResults);
	}else {
            generate_normal_kernel<<<64, 64>>>(devStates, sampleCount, devResults);
        }
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, totalThreads *
        sizeof(unsigned int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < totalThreads; i++) {
        total += hostResults[i];
    }
    printf("Fraction of normals within 1 standard deviation was %10.13f\n",
        (float)total / (totalThreads * sampleCount * 50.0f));

    /* Cleanup */
    if (useMRG) {
        CUDA_CALL(cudaFree(devMRGStates));
    }else if(usePHILOX)
    {
        CUDA_CALL(cudaFree(devPHILOXStates));
    }else {
        CUDA_CALL(cudaFree(devStates));
    }
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    printf("^^^^ kernel_example PASSED\n");
    return EXIT_SUCCESS;
}

以下示例使用 cuRAND host MTGP 设置 API 和 cuRAND device API 来生成使用 MTGP32 生成器的整数，并计算设置了低位的比例。

/*
 * This program uses the device CURAND API to calculate what
 * proportion of pseudo-random ints have low bit set.
 */
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
/* include MTGP host helper functions */
#include <curand_mtgp32_host.h>
/* include MTGP pre-computed parameter sets */
#include <curand_mtgp32dc_p_11213.h>


#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

#define CURAND_CALL(x) do { if((x) != CURAND_STATUS_SUCCESS) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

__global__ void generate_kernel(curandStateMtgp32 *state,
                                int n,
                                int *result)
{
    int id = threadIdx.x + blockIdx.x * 256;
    int count = 0;
    unsigned int x;
    /* Generate pseudo-random unsigned ints */
    for(int i = 0; i < n; i++) {
        x = curand(&state[blockIdx.x]);
        /* Check if low bit set */
        if(x & 1) {
            count++;
        }
    }
    /* Store results */
    result[id] += count;
}

int main(int argc, char *argv[])
{
    int i;
    long long total;
    curandStateMtgp32 *devMTGPStates;
    mtgp32_kernel_params *devKernelParams;
    int *devResults, *hostResults;
    int sampleCount = 10000;

    /* Allow over-ride of sample count */
    if (argc == 2) {
        sscanf(argv[1],"%d",&sampleCount);
    }

    /* Allocate space for results on host */
    hostResults = (int *)calloc(64 * 256, sizeof(int));

    /* Allocate space for results on device */
    CUDA_CALL(cudaMalloc((void **)&devResults, 64 * 256 *
              sizeof(int)));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0, 64 * 256 *
              sizeof(int)));

    /* Allocate space for prng states on device */
    CUDA_CALL(cudaMalloc((void **)&devMTGPStates, 64 *
              sizeof(curandStateMtgp32)));

    /* Setup MTGP prng states */

    /* Allocate space for MTGP kernel parameters */
    CUDA_CALL(cudaMalloc((void**)&devKernelParams, sizeof(mtgp32_kernel_params)));

    /* Reformat from predefined parameter sets to kernel format, */
    /* and copy kernel parameters to device memory               */
    CURAND_CALL(curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, devKernelParams));

    /* Initialize one state per thread block */
    CURAND_CALL(curandMakeMTGP32KernelState(devMTGPStates,
                mtgp32dc_params_fast_11213, devKernelParams, 64, 1234));

    /* State setup is complete */

    /* Generate and use pseudo-random  */
    for(i = 0; i < 10; i++) {
        generate_kernel<<<64, 256>>>(devMTGPStates, sampleCount, devResults);
    }

    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy(hostResults, devResults, 64 * 256 *
        sizeof(int), cudaMemcpyDeviceToHost));

    /* Show result */
    total = 0;
    for(i = 0; i < 64 * 256; i++) {
        total += hostResults[i];
    }


    printf("Fraction with low bit set was %10.13g\n",
        (double)total / (64.0f * 256.0f * sampleCount * 10.0f));

    /* Cleanup */
    CUDA_CALL(cudaFree(devKernelParams));
    CUDA_CALL(cudaFree(devMTGPStates));
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    printf("^^^^ kernel_mtgp_example PASSED\n");
    return EXIT_SUCCESS;
}

以下示例使用 cuRAND device API 来生成具有 64 位扰码 Sobol 生成器的均匀双精度数。它使用结果来推导球体体积的近似值。

/*
 * This program uses the device CURAND API to calculate what
 * proportion of quasi-random 3D points fall within a sphere
 * of radius 1, and to derive the volume of the sphere.
 *
 * In particular it uses 64 bit scrambled Sobol direction
 * vectors returned by curandGetDirectionVectors64, to
 * generate double precision uniform samples.
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>

#define THREADS_PER_BLOCK 64
#define BLOCK_COUNT 64
#define TOTAL_THREADS (THREADS_PER_BLOCK * BLOCK_COUNT)

/* Number of 64-bit vectors per dimension */
#define VECTOR_SIZE 64


#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

#define CURAND_CALL(x) do { if((x) != CURAND_STATUS_SUCCESS) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)

/* This kernel initializes state per thread for each of x, y, and z */

__global__ void setup_kernel(unsigned long long * sobolDirectionVectors,
                             unsigned long long *sobolScrambleConstants,
                             curandStateScrambledSobol64 *state)
{
    int id = threadIdx.x + blockIdx.x * THREADS_PER_BLOCK;
    int dim = 3*id;
    /* Each thread uses 3 different dimensions */
    curand_init(sobolDirectionVectors + VECTOR_SIZE*dim,
                sobolScrambleConstants[dim],
                1234,
                &state[dim]);

    curand_init(sobolDirectionVectors + VECTOR_SIZE*(dim + 1),
                sobolScrambleConstants[dim + 1],
                1234,
                &state[dim + 1]);

    curand_init(sobolDirectionVectors + VECTOR_SIZE*(dim + 2),
                sobolScrambleConstants[dim + 2],
                1234,
                &state[dim + 2]);
}

/* This kernel generates random 3D points and increments a counter if
 * a point is within a unit sphere
 */
__global__ void generate_kernel(curandStateScrambledSobol64 *state,
                                int n,
                                long long int *result)
{
    int id = threadIdx.x + blockIdx.x * THREADS_PER_BLOCK;
    int baseDim = 3 * id;
    long long int count = 0;
    double x, y, z;

    /* Generate quasi-random double precision coordinates */
    for(int i = 0; i < n; i++) {
        x = curand_uniform_double(&state[baseDim]);
        y = curand_uniform_double(&state[baseDim + 1]);
        z = curand_uniform_double(&state[baseDim + 2]);

        /* Check if within sphere of radius 1 */
        if( (x*x + y*y + z*z) < 1.0) {
            count++;
        }
    }
    /* Store results */
    result[id] += count;
}

int main(int argc, char *argv[])
{
    int i;
    long long total;
    curandStateScrambledSobol64 *devSobol64States;
    curandDirectionVectors64_t *hostVectors64;
    unsigned long long int * hostScrambleConstants64;
    unsigned long long int * devDirectionVectors64;
    unsigned long long int * devScrambleConstants64;
    long long int *devResults, *hostResults;
    int sampleCount = 10000;
    int iterations = 100;
    double fraction;
    double pi = 3.1415926535897932;

    /* Allow over-ride of sample count */
    if (argc == 2) {
        sscanf(argv[1],"%d",&sampleCount);
    }

    /* Allocate space for results on host */
    hostResults = (long long int*)calloc(TOTAL_THREADS,
                                      sizeof(long long int));

    /* Allocate space for results on device */
    CUDA_CALL(cudaMalloc((void **)&devResults,
                         TOTAL_THREADS * sizeof(long long int)));

    /* Set results to 0 */
    CUDA_CALL(cudaMemset(devResults, 0,
                         TOTAL_THREADS * sizeof(long long int)));

    /* Get pointers to the 64 bit scrambled direction vectors and constants*/
    CURAND_CALL(curandGetDirectionVectors64( &hostVectors64,
                                             CURAND_SCRAMBLED_DIRECTION_VECTORS_64_JOEKUO6));

    CURAND_CALL(curandGetScrambleConstants64( &hostScrambleConstants64));


    /* Allocate memory for 3 states per thread (x, y, z), each state to get a unique dimension */
    CUDA_CALL(cudaMalloc((void **)&devSobol64States,
              TOTAL_THREADS * 3 * sizeof(curandStateScrambledSobol64)));

    /* Allocate memory and copy 3 sets of vectors per thread to the device */

    CUDA_CALL(cudaMalloc((void **)&(devDirectionVectors64),
                         3 * TOTAL_THREADS * VECTOR_SIZE * sizeof(long long int)));

    CUDA_CALL(cudaMemcpy(devDirectionVectors64, hostVectors64,
                         3 * TOTAL_THREADS * VECTOR_SIZE * sizeof(long long int),
                         cudaMemcpyHostToDevice));

    /* Allocate memory and copy 3 scramble constants (one costant per dimension)
       per thread to the device */

    CUDA_CALL(cudaMalloc((void **)&(devScrambleConstants64),
                         3 * TOTAL_THREADS * sizeof(long long int)));

    CUDA_CALL(cudaMemcpy(devScrambleConstants64, hostScrambleConstants64,
                         3 * TOTAL_THREADS * sizeof(long long int),
                         cudaMemcpyHostToDevice));

    /* Initialize the states */

    setup_kernel<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>(devDirectionVectors64,
                                                     devScrambleConstants64,
                                                     devSobol64States);

    /* Generate and count quasi-random points  */

    for(i = 0; i < iterations; i++) {
        generate_kernel<<<BLOCK_COUNT, THREADS_PER_BLOCK>>>(devSobol64States, sampleCount, devResults);
    }

    /* Copy device memory to host */

    CUDA_CALL(cudaMemcpy(hostResults,
              devResults,
              TOTAL_THREADS * sizeof(long long int),
              cudaMemcpyDeviceToHost));

    /* Tally and show result */

    total = 0;
    for(i = 0; i < TOTAL_THREADS; i++) {
        total += hostResults[i];
    }

    fraction = (double)total / ((double)TOTAL_THREADS * (double)sampleCount * (double)iterations);
    printf("Fraction inside sphere was %g\n", fraction);
    printf("(4/3) pi = %g, sampled volume = %g\n",(4.0*pi/3.0),8.0 * fraction);

    /* Cleanup */

    CUDA_CALL(cudaFree(devSobol64States));
    CUDA_CALL(cudaFree(devDirectionVectors64));
    CUDA_CALL(cudaFree(devScrambleConstants64));
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    printf("^^^^ kernel_sobol_example PASSED\n");


    return EXIT_SUCCESS;
}

3.7. Thrust 和 cuRAND 示例

以下示例演示了 cuRAND 和 Thrust 的混合使用。它是 monte_carlo.cu 的最小修改版本，monte_carlo.cu 是标准 Thrust 示例之一。该示例估计 $π$ 的方法是在单位正方形中随机选取点，并计算到原点的距离，以查看这些点是否在四分之一单位圆内。

#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <curand_kernel.h>

#include <iostream>
#include <iomanip>

// we could vary M & N to find the perf sweet spot

struct estimate_pi : 
    public thrust::unary_function<unsigned int, float>
{
  __device__
  float operator()(unsigned int thread_id)
  {
    float sum = 0;
    unsigned int N = 10000; // samples per thread

    unsigned int seed = thread_id;

    curandState s;

    // seed a random number generator
    curand_init(seed, 0, 0, &s);

    // take N samples in a quarter circle
    for(unsigned int i = 0; i < N; ++i)
    {
      // draw a sample from the unit square
      float x = curand_uniform(&s);
      float y = curand_uniform(&s);

      // measure distance from the origin
      float dist = sqrtf(x*x + y*y);

      // add 1.0f if (u0,u1) is inside the quarter circle
      if(dist <= 1.0f)
        sum += 1.0f;
    }

    // multiply by 4 to get the area of the whole circle
    sum *= 4.0f;

    // divide by N
    return sum / N;
  }
};

int main(void)
{
  // use 30K independent seeds
  int M = 30000;

  float estimate = thrust::transform_reduce(
        thrust::counting_iterator<int>(0),
        thrust::counting_iterator<int>(M),
        estimate_pi(),
        0.0f,
        thrust::plus<float>());
  estimate /= M;

  std::cout << std::setprecision(3);
  std::cout << "pi is approximately ";
  std::cout << estimate << std::endl;
  return 0;
}

3.8. 泊松 API 示例

此示例显示了泊松分布的 3 种 API 类型之间的差异。它是商店中队列的模拟。host API 是生成泊松分布随机数的大向量的最稳健的方法。（即，它在 lambda 值的完整范围内具有最佳的统计特性）离散 Device API 几乎与 HOST API 一样稳健，并允许在内核内部生成泊松分布随机数。简单的 Device API 最不稳健，但在为许多不同的 lambda 生成泊松分布随机数时更有效。

/*
 * This program uses CURAND library for Poisson distribution
 * to simulate queues in store for 16 hours. It shows the
 * difference of using 3 different APIs:
 * - HOST API -arrival of customers is described by Poisson(4)
 * - SIMPLE DEVICE API -arrival of customers is described by
 *     Poisson(4*(sin(x/100)+1)), where x is number of minutes
 *     from store opening time.
 * - ROBUST DEVICE API -arrival of customers is described by:
 *     - Poisson(2) for first 3 hours.
 *     - Poisson(1) for second 3 hours.
 *     - Poisson(3) after 6 hours.
 */

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>

#define CUDA_CALL(x) do { if((x) != cudaSuccess) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__); \
    return EXIT_FAILURE;}} while(0)
#define CURAND_CALL(x) do { if((x)!=CURAND_STATUS_SUCCESS) { \
    printf("Error at %s:%d\n",__FILE__,__LINE__);\
    return EXIT_FAILURE;}} while(0)


#define HOURS 16
#define OPENING_HOUR 7
#define CLOSING_HOUR (OPENING_HOUR + HOURS)

#define access_2D(type, ptr, row, column, pitch)\
    *((type*)((char*)ptr + (row) * pitch) + column)

enum API_TYPE {
    HOST_API = 0,
    SIMPLE_DEVICE_API = 1,
    ROBUST_DEVICE_API = 2,
};

/* global variables */
API_TYPE api;
int report_break;
int cashiers_load_h[HOURS];
__constant__ int cashiers_load[HOURS];

__global__ void setup_kernel(curandState *state)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    /* Each thread gets same seed, a different sequence
       number, no offset */
    curand_init(1234, id, 0, &state[id]);
}

__inline__ __device__
void update_queue(int id, int min, unsigned int new_customers,
                  unsigned int &queue_length,
                  unsigned int *queue_lengths, size_t pitch)
{
    int balance;
    balance = new_customers - 2 * cashiers_load[(min-1)/60];
    if (balance + (int)queue_length <= 0){
        queue_length = 0;
    }else{
        queue_length += balance;
    }
    /* Store results */
    access_2D(unsigned int, queue_lengths, min-1, id, pitch)
        = queue_length;
}


__global__ void simple_device_API_kernel(curandState *state,
                    unsigned int *queue_lengths, size_t pitch)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int new_customers;
    unsigned int queue_length = 0;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Simulate queue in time */
    for(int min = 1; min <= 60 * HOURS; min++) {
        /* Draw number of new customers depending on API */
        new_customers = curand_poisson(&localState,
                                4*(sin((float)min/100.0)+1));
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                     queue_lengths, pitch);
    }
    /* Copy state back to global memory */
    state[id] = localState;
}


__global__ void host_API_kernel(unsigned int *poisson_numbers,
                    unsigned int *queue_lengths, size_t pitch)
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    unsigned int new_customers;
    unsigned int queue_length = 0;
    /* Simulate queue in time */
    for(int min = 1; min <= 60 * HOURS; min++) {
        /* Get random number from global memory */
        new_customers = poisson_numbers
                    [blockDim.x * gridDim.x * (min -1) + id];
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                     queue_lengths, pitch);
    }
}

__global__ void robust_device_API_kernel(curandState *state,
                   curandDiscreteDistribution_t poisson_1,
                   curandDiscreteDistribution_t poisson_2,
                   curandDiscreteDistribution_t poisson_3,
                   unsigned int *queue_lengths, size_t pitch)
{
    int id = threadIdx.x + blockIdx.x * 64;
    unsigned int new_customers;
    unsigned int queue_length = 0;
    /* Copy state to local memory for efficiency */
    curandState localState = state[id];
    /* Simulate queue in time */
    /* first 3 hours */
    for(int min = 1; min <= 60 * 3; min++) {
        /* draw number of new customers depending on API */
        new_customers =
                    curand_discrete(&localState, poisson_2);
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                                    queue_lengths, pitch);
    }
    /* second 3 hours */
    for(int min = 60 * 3 + 1; min <= 60 * 6; min++) {
        /* draw number of new customers depending on API */
        new_customers =
                    curand_discrete(&localState, poisson_1);
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                                    queue_lengths, pitch);
    }
    /* after 6 hours */
    for(int min = 60 * 6 + 1; min <= 60 * HOURS; min++) {
        /* draw number of new customers depending on API */
        new_customers =
                    curand_discrete(&localState, poisson_3);
        /* Update queue */
        update_queue(id, min, new_customers, queue_length,
                                    queue_lengths, pitch);
    }
    /* Copy state back to global memory */
    state[id] = localState;
}

/* Set time intervals between reports */
void report_settings()
{
    do{
        printf("Set time intervals between queue reports");
        printf("(in minutes > 0)\n");
        if (scanf("%d", &report_break) == 0) continue;
    }while(report_break <= 0);
}


/* Set number of cashiers each hour */
void add_cachiers(int *cashiers_load)
{
    int i, min, max, begin, end;
    printf("Cashier serves 2 customers per minute...\n");
    for (i = 0; i < HOURS; i++){
        cashiers_load_h[i] = 0;
    }
    while (true){
        printf("Adding cashier...\n");
        min = OPENING_HOUR;
        max = CLOSING_HOUR-1;
        do{
            printf("Set hour that cahier comes (%d-%d)",
                                                min, max);
            printf(" [type 0 to finish adding cashiers]\n");
            if (scanf("%d", &begin) == 0) continue;
        }while (begin > max || (begin < min && begin != 0));
        if (begin == 0) break;
        min = begin+1;
        max = CLOSING_HOUR;
        do{
            printf("Set hour that cahier leaves (%d-%d)",
                                                min, max);
            printf(" [type 0 to finish adding cashiers]\n");
            if (scanf("%d", &end) == 0) continue;
        }while (end > max || (end < min && end != 0));
        if (end == 0) break;
        for (i = begin - OPENING_HOUR;
             i < end   - OPENING_HOUR; i++){
            cashiers_load_h[i]++;
        }
    }
    for (i = OPENING_HOUR; i < CLOSING_HOUR; i++){
        printf("\n%2d:00 - %2d:00     %d cashier",
                i, i+1, cashiers_load_h[i-OPENING_HOUR]);
        if (cashiers_load[i-OPENING_HOUR] != 1) printf("s");
    }
    printf("\n");
}

/* Set API type */
API_TYPE set_API_type()
{
    printf("Choose API type:\n");
    int choose;
    do{
        printf("type 1 for HOST API\n");
        printf("type 2 for SIMPLE DEVICE API\n");
        printf("type 3 for ROBUST DEVICE API\n");
        if (scanf("%d", &choose) == 0) continue;
    }while( choose < 1 || choose > 3);
    switch(choose){
        case 1: return HOST_API;
        case 2: return SIMPLE_DEVICE_API;
        case 3: return ROBUST_DEVICE_API;
        default:
            fprintf(stderr, "wrong API\n");
            return HOST_API;
    }
}

void settings()
{
    add_cachiers(cashiers_load);
    cudaMemcpyToSymbol("cashiers_load", cashiers_load_h,
            HOURS * sizeof(int), 0, cudaMemcpyHostToDevice);
    report_settings();
    api = set_API_type();
}

void print_statistics(unsigned int *hostResults, size_t pitch)
{
    int min, i, hour, minute;
    unsigned int sum;
    for(min = report_break; min <= 60 * HOURS;
                            min += report_break) {
        sum = 0;
        for(i = 0; i < 64 * 64; i++) {
            sum += access_2D(unsigned int, hostResults,
                                        min-1, i, pitch);
        }
        hour = OPENING_HOUR + min/60;
        minute = min%60;
        printf("%2d:%02d   # of waiting customers = %10.4g |",
                    hour, minute, (float)sum/(64.0 * 64.0));
        printf("  # of cashiers = %d  |  ",
                    cashiers_load_h[(min-1)/60]);
        printf("# of new customers/min ~= ");
        switch (api){
            case HOST_API:
                printf("%2.2f\n", 4.0);
                break;
            case SIMPLE_DEVICE_API:
                printf("%2.2f\n",
                            4*(sin((float)min/100.0)+1));
                break;
            case ROBUST_DEVICE_API:
                if (min <= 3 * 60){
                    printf("%2.2f\n", 2.0);
                }else{
                    if (min <= 6 * 60){
                        printf("%2.2f\n", 1.0);
                    }else{
                        printf("%2.2f\n", 3.0);
                    }
                }
                break;
            default:
                fprintf(stderr, "Wrong API\n");
        }
    }
}


int main(int argc, char *argv[])
{
    int n;
    size_t pitch;
    curandState *devStates;
    unsigned int *devResults, *hostResults;
    unsigned int *poisson_numbers_d;
    curandDiscreteDistribution_t poisson_1, poisson_2;
    curandDiscreteDistribution_t poisson_3;
    curandGenerator_t gen;

    /* Setting cashiers, report and API */
    settings();

    /* Allocate space for results on device */
    CUDA_CALL(cudaMallocPitch((void **)&devResults, &pitch,
                64 * 64 * sizeof(unsigned int), 60 * HOURS));

    /* Allocate space for results on host */
    hostResults = (unsigned int *)calloc(pitch * 60 * HOURS,
                sizeof(unsigned int));

    /* Allocate space for prng states on device */
    CUDA_CALL(cudaMalloc((void **)&devStates, 64 * 64 *
              sizeof(curandState)));

    /* Setup prng states */
    if (api != HOST_API){
        setup_kernel<<<64, 64>>>(devStates);
    }
    /* Simulate queue  */
    switch (api){
        case HOST_API:
            /* Create pseudo-random number generator */
            CURAND_CALL(curandCreateGenerator(&gen,
                                CURAND_RNG_PSEUDO_DEFAULT));
            /* Set seed */
            CURAND_CALL(curandSetPseudoRandomGeneratorSeed(
                                            gen, 1234ULL));
            /* compute n */
            n = 64 * 64 * HOURS * 60;
            /* Allocate n unsigned ints on device */
            CUDA_CALL(cudaMalloc((void **)&poisson_numbers_d,
                                n * sizeof(unsigned int)));
            /* Generate n unsigned ints on device */
            CURAND_CALL(curandGeneratePoisson(gen,
                                poisson_numbers_d, n, 4.0));
            host_API_kernel<<<64, 64>>>(poisson_numbers_d,
                                        devResults, pitch);
            /* Cleanup */
            CURAND_CALL(curandDestroyGenerator(gen));
            break;
        case SIMPLE_DEVICE_API:
            simple_device_API_kernel<<<64, 64>>>(devStates,
                                        devResults, pitch);
            break;
        case ROBUST_DEVICE_API:
            /* Create histograms for Poisson(1) */
            CURAND_CALL(curandCreatePoissonDistribution(1.0,
                                                &poisson_1));
            /* Create histograms for Poisson(2) */
            CURAND_CALL(curandCreatePoissonDistribution(2.0,
                                                &poisson_2));
            /* Create histograms for Poisson(3) */
            CURAND_CALL(curandCreatePoissonDistribution(3.0,
                                                &poisson_3));
            robust_device_API_kernel<<<64, 64>>>(devStates,
                            poisson_1, poisson_2, poisson_3,
                            devResults, pitch);
            /* Cleanup */
            CURAND_CALL(curandDestroyDistribution(poisson_1));
            CURAND_CALL(curandDestroyDistribution(poisson_2));
            CURAND_CALL(curandDestroyDistribution(poisson_3));
            break;
        default:
            fprintf(stderr, "Wrong API\n");
    }
    /* Copy device memory to host */
    CUDA_CALL(cudaMemcpy2D(hostResults, pitch, devResults,
                pitch, 64 * 64 * sizeof(unsigned int),
                60 * HOURS, cudaMemcpyDeviceToHost));
    /* Show result */
    print_statistics(hostResults, pitch);
    /* Cleanup */
    CUDA_CALL(cudaFree(devStates));
    CUDA_CALL(cudaFree(devResults));
    free(hostResults);
    return EXIT_SUCCESS;
}