I'm trying to launch a training job on Google AI Platform with a custom container. As I want to use GPUs for the training, the base image I've used for my container is:
FROM nvidia/cuda:11.1.1-cudnn8-runtime-ubuntu18.04 With this image (and tensorflow 2.4.1 installed on top of that) I thought I can use the GPUs on AI Platform but it does not seem to be the case. When training starts, the logs are showing following:
W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (gke-cml-0309-144111--n1-highmem-8-43e-0b9fbbdc-gnq6): /proc/driver/nvidia/version does not exist I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce. Is this a good way to build an image to use GPUs on Google AI Platform? Or should I try instead to rely on a tensorflow image and install manually all the needed drivers to exploit GPUs?
EDIT: I read here (https://cloud.google.com/ai-platform/training/docs/containers-overview) the following:
For training with GPUs, your custom container needs to meet a few special requirements. You must build a different Docker image than what you'd use for training with CPUs. Pre-install the CUDA toolkit and cuDNN in your Docker image. Using the nvidia/cuda image as your base image is the recommended way to handle this. It has the matching versions of CUDA toolkit and cuDNN pre- installed, and it helps you set up the related environment variables correctly. Install your training application, along with your required ML framework and other dependencies in your Docker image. They also give a Dockerfile example here for training with GPUs. So what I did seems ok. Unfortunately I still have these errors mentioned above that could explain (or not) why I cannot use GPUs on Google AI Platform.
https://stackoverflow.com/questions/66550195/could-not-load-dynamic-library-libcuda-so-1-error-on-google-ai-platform-with-cus March 09, 2021 at 11:49PM
没有评论:
发表评论