2021年2月9日星期二

Bazel targets built against TensorFlow C++ API don't execute factory registration functions

I am encountering an issue that seems to be identical to the one described here: https://chromium.googlesource.com/external/github.com/tensorflow/tensorflow/+/v1.4.0/tensorflow/docs_src/mobile/linking_libs.md#global-constructor-magic

One of the subtlest problems you may run up against is the "No session factory registered for the given session options" error when trying to call TensorFlow from your own application. ... TensorFlow uses a registration pattern in a lot of places:

class RegisterMul {   public:    RegisterMul() {      global_kernel_registry()->Register("Mul", [](){        return new MulKernel()      });    }  };  RegisterMul g_register_mul;  

This sets up a class RegisterMul with a constructor that tells the global kernel registry what function to call when somebody asks it how to create a "Mul" kernel. Then there's a global object of that class, and so the constructor should be called at the start of any program.

The global object that's defined is not used by any other code, so linkers not designed with this in mind will decide that it can be deleted. As a result, the constructor is never called, and the class is never registered.

The solution is to force the linker to not strip any code from the library, even if it believes it's unused. On iOS, this step can be accomplished with the -force_load flag, specifying a library path, and on Linux you need --whole-archive. These persuade the linker to not be as aggressive about stripping, and should retain the globals.

However, I'm having trouble turning "on Linux you need --whole-archive" into something that actually works. I am:

  • Compiling TensorFlow 2.2.2 from source into a python wheel
  • Using that wheel in another project built with bazel
  • Exposing that wheel's C++ API via a bazel target (see below)
  • Referencing that C++ API target from custom C++ code

I've tried adding:

alwayslink=True,  linkopts = ["-Wl,--whole-archive"],  

to the bazel targets for each of the my targets in the above list, and it has not made a difference. My target for the wheel is:

cc_library(      name = "c_api",      # This is the only .so or .so.* in the wheel      srcs=["//:tensorflow/libtensorflow_framework.so.2"],      hdrs = glob([          "tensorflow/include/**/*.h",          "tensorflow/include/**/*.inc",          "tensorflow/include/**/Eigen/**/*",      ]),      alwayslink=True,      linkopts = ["-Wl,--whole-archive"],      includes = [          "tensorflow/include",      ],      visibility = ["//visibility:public"],      deps = ["@zoox//third_party/cuda:cuda_libs"],  )  

And the actual end target that's failing:

cc_library(      name = "tensorflow_wrapper",      srcs = ["tensorflow_wrapper.cpp"],      hdrs = ["tensorflow_wrapper.h"],      tags = ["offline-only"],      deps = [          ":utils",          "//other/stuff:etc",          # This is the above target for the wheel          "@pypi__tensorflow_python3_deps//:c_api",      ],      alwayslink = 1,      linkopts = ["-Wl,--whole-archive"],  )  

The :c_api target is enough to get custom ops working, and the C++ code that's failing still compiles and runs, it just doesn't have the required factories registered, so fails when they'd be needed.

What do I need to change to have the registration pattern that TensorFlow uses actually execute?

https://stackoverflow.com/questions/66073142/bazel-targets-built-against-tensorflow-c-api-dont-execute-factory-registratio February 06, 2021 at 10:55AM

没有评论:

发表评论