2021年3月22日星期一

Can I Speed up This Aerodynamics Calculation With Numba, Vectorization, or Multiprocessing?

Problem:

I am trying to increase the speed of an aerodynamics function in Python.

Function Set:

import numpy as np  from numba import njit    def calculate_velocity_induced_by_line_vortices(      points, origins, terminations, strengths, collapse=True  ):        # Expand the dimensionality of the points input. It is now of shape (N x 1 x 3).      # This will allow NumPy to broadcast the upcoming subtractions.      points = np.expand_dims(points, axis=1)            # Define the vectors from the vortex to the points. r_1 and r_2 now both are of      # shape (N x M x 3). Each row/column pair holds the vector associated with each      # point/vortex pair.      r_1 = points - origins      r_2 = points - terminations            r_0 = r_1 - r_2      r_1_cross_r_2 = nb_2d_explicit_cross(r_1, r_2)      r_1_cross_r_2_absolute_magnitude = (          r_1_cross_r_2[:, :, 0] ** 2          + r_1_cross_r_2[:, :, 1] ** 2          + r_1_cross_r_2[:, :, 2] ** 2      )      r_1_length = nb_2d_explicit_norm(r_1)      r_2_length = nb_2d_explicit_norm(r_2)            # Define the radius of the line vortices. This is used to get rid of any      # singularities.      radius = 3.0e-16            # Set the lengths and the absolute magnitudes to zero, at the places where the      # lengths and absolute magnitudes are less than the vortex radius.      r_1_length[r_1_length < radius] = 0      r_2_length[r_2_length < radius] = 0      r_1_cross_r_2_absolute_magnitude[r_1_cross_r_2_absolute_magnitude < radius] = 0            # Calculate the vector dot products.      r_0_dot_r_1 = np.einsum("ijk,ijk->ij", r_0, r_1)      r_0_dot_r_2 = np.einsum("ijk,ijk->ij", r_0, r_2)            # Calculate k and then the induced velocity, ignoring any divide-by-zero or nan      # errors. k is of shape (N x M)      with np.errstate(divide="ignore", invalid="ignore"):          k = (              strengths              / (4 * np.pi * r_1_cross_r_2_absolute_magnitude)              * (r_0_dot_r_1 / r_1_length - r_0_dot_r_2 / r_2_length)          )                # Set the shape of k to be (N x M x 1) to support numpy broadcasting in the          # subsequent multiplication.          k = np.expand_dims(k, axis=2)                induced_velocities = k * r_1_cross_r_2            # Set the values of the induced velocity to zero where there are singularities.      induced_velocities[np.isinf(induced_velocities)] = 0      induced_velocities[np.isnan(induced_velocities)] = 0        if collapse:          induced_velocities = np.sum(induced_velocities, axis=1)        return induced_velocities      @njit      def nb_2d_explicit_norm(vectors):      return np.sqrt(          (vectors[:, :, 0]) ** 2 + (vectors[:, :, 1]) ** 2 + (vectors[:, :, 2]) ** 2      )      @njit  def nb_2d_explicit_cross(a, b):      e = np.zeros_like(a)      e[:, :, 0] = a[:, :, 1] * b[:, :, 2] - a[:, :, 2] * b[:, :, 1]      e[:, :, 1] = a[:, :, 2] * b[:, :, 0] - a[:, :, 0] * b[:, :, 2]      e[:, :, 2] = a[:, :, 0] * b[:, :, 1] - a[:, :, 1] * b[:, :, 0]      return e  

Context:

This function is used by Ptera Software, an open-source solver for flapping wing aerodynamics. As shown by the profile output below, it is by far the largest contributor to Ptera Software's run time.

profile run

Currently, Ptera Software takes just over 3 minutes to run a typical case, and my goal is to get this below 1 minute.

The function takes in a group of points, origins, terminations, and strengths. At every point, it finds the induced velocity due to the line vortices, which are characterized by the groups of origins, terminations, and strengths. If collapse is true, then the output is the cumulative velocity induced at each point due to the vortices. If false, the function outputs each vortex's contribution to the velocity at each point.

During a typical run, the velocity function is called approximately 2000 times. At first, the calls involve vectors with relatively small input arguments (around 200 points, origins, terminations, and strengths). Later calls involve large input arguments (around 400 points and around 6,000 origins, terminations, and strengths). An ideal solution would be fast for all size inputs, but increasing the speed of large input calls is more important.

For testing, I recommend running the following script with your own implementation of the function:

import timeit    import matplotlib.pyplot as plt  import numpy as np    n_repeat = 2  n_execute = 10 ** 3  min_oom = 0  max_oom = 3    times_py = []    for i in range(max_oom - min_oom + 1):      n_elem = 10 ** i      n_elem_pretty = np.format_float_scientific(n_elem, 0)      print("Number of elements: " + n_elem_pretty)        # Benchmark Python.      print("\tBenchmarking Python...")      setup = '''  import numpy as np    these_points = np.random.random((''' + str(n_elem) + ''', 3))  these_origins = np.random.random((''' + str(n_elem) + ''', 3))  these_terminations = np.random.random((''' + str(n_elem) + ''', 3))  these_strengths = np.random.random(''' + str(n_elem) + ''')    def calculate_velocity_induced_by_line_vortices(points, origins, terminations,                                                  strengths, collapse=True):      pass      '''      statement = '''  results_orig = calculate_velocity_induced_by_line_vortices(these_points, these_origins,                                                             these_terminations,                                                             these_strengths)      '''            times = timeit.repeat(repeat=n_repeat, stmt=statement, setup=setup, number=n_execute)      time_py = min(times)/n_execute      time_py_pretty = np.format_float_scientific(time_py, 2)      print("\t\tAverage Time per Loop: " + time_py_pretty + " s")        # Record the times.      times_py.append(time_py)    sizes = [10 ** i for i in range(max_oom - min_oom + 1)]    fig, ax = plt.subplots()    ax.plot(sizes, times_py, label='Python')  ax.set_xscale("log")  ax.set_xlabel("Size of List or Array (elements)")  ax.set_ylabel("Average Time per Loop (s)")  ax.set_title(      "Comparison of Different Optimization Methods\nBest of "      + str(n_repeat)      + " Runs, each with "      + str(n_execute)      + " Loops"  )  ax.legend()  plt.show()  

Previous Attempts:

My prior attempts at speeding up this function involved vectorizing it (which worked great, so I kept those changes) and trying out Numba's JIT compiler. I had mixed results with Numba. When I tried to use Numba on a modified version of the entire velocity function, my results were much slower than before. However, I found that Numba significantly sped up the cross-product and norm functions, which I implemented above.

Updates:

Update 1:

Based on Mercury's comment (which has since been deleted), I replaced

points = np.expand_dims(points, axis=1)  r_1 = points - origins  r_2 = points - terminations  

with two calls to the following function:

@njit  def subtract(a, b):      c = np.empty((a.shape[0], b.shape[0], 3))      for i in range(a.shape[0]):          for j in range(b.shape[0]):              for k in range(3):                  c[i, j, k] = a[i, k] - b[j, k]      return c  

This resulted in a speed increase from 227 s to 220 s. This is better! However, it is still not fast enough.

I also have tried setting the njit fastmath flag to true, and using a numba function instead of calls to np.einsum. Neither increased the speed.

Update 2:

With Jérôme Richard's answer, the run time is now 156 s, which is a decrease of 29%! I'm satisfied enough to accept this answer, but feel free to make other suggestions if you think you can improve on their work!

https://stackoverflow.com/questions/66750661/can-i-speed-up-this-aerodynamics-calculation-with-numba-vectorization-or-multi March 23, 2021 at 01:07AM

没有评论:

发表评论