So I am trying to run my program on Google Colab using their Tesla T4 GPU available. I am using Numba to implement the @cuda.jit and I am wondering why when I do my estimation, I am getting that it will run faster on CPU rather than GPU. Is there something wrong with my implementation of the GPU code or should it just not run faster for this, I assumed it should run faster for the Monte Carlo method. And I am sure there are faster ways to do this but I am just trying to do it simplistically and how it makes sense to me first before I further optimize it.
import numpy as np import matplotlib.pyplot as plt import time from random import * from numba import jit, cuda, njit from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32 # This is the 10 sphere pi estimation using Monte Carlo. def pi_value(trial): hit = 0 for i in range(trial): x1 = random() x2 = random() x3 = random() x4 = random() x5 = random() x6 = random() x7 = random() x8 = random() x9 = random() x10 = random() if (x1**2+x2**2+x3**2+x4**2+x5**2+x6**2+x6**2+x8**2+x9**2+x10**2)**(1/2) <= 1: hit += 1 return hit iter10 = 10000000 dimen = 10 start = time.time() hit = pi_value(iter10) end = time.time() start1 = time.time() hit1 = pi_value(iter10) end1 = time.time() run_time = end1 - start1 piv = (122880 * (hit1 / iter10))**(1/5) print("For the {dimen} sphere with {trials} random points, the value of pi is estimated to be {pi}, and executed in {run_time} seconds.".format(dimen=dimen, trials=iterations, pi=piv, run_time=run_time)) # This is the 10 sphere run on GPU @cuda.jit def pi_value(rng_states, iterations, out): thread_id = cuda.grid(1) hit = 0 for i in range(iterations): x1 = xoroshiro128p_uniform_float32(rng_states, thread_id) x2 = xoroshiro128p_uniform_float32(rng_states, thread_id) x3 = xoroshiro128p_uniform_float32(rng_states, thread_id) x4 = xoroshiro128p_uniform_float32(rng_states, thread_id) x5 = xoroshiro128p_uniform_float32(rng_states, thread_id) x6 = xoroshiro128p_uniform_float32(rng_states, thread_id) x7 = xoroshiro128p_uniform_float32(rng_states, thread_id) x8 = xoroshiro128p_uniform_float32(rng_states, thread_id) x9 = xoroshiro128p_uniform_float32(rng_states, thread_id) x10 = xoroshiro128p_uniform_float32(rng_states, thread_id) if (x1**2+x2**2+x3**2+x4**2+x5**2+x6**2+x6**2+x8**2+x9**2+x10**2)**(1/2) <= 1: hit += 1 out[thread_id] = (122880 * (hit / iterations))**(1/5) threads_per_block = 128 blocks = 32 rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1) out = np.zeros(threads_per_block * blocks, dtype=np.float32) pi_value[blocks, threads_per_block](rng_states, 10000000, out) print('pi:', out.mean())
Any help would be appreciated thanks!
https://stackoverflow.com/questions/67443978/monte-carlo-pi-estimation-in-python-on-gpu-using-numba-cuda-jit May 08, 2021 at 12:04PM
没有评论:
发表评论