AMD社製GPUを用いたTensorFlow環境構築(Tensorflow導入~サンプル動作編)

AMD社製GPUを用いたTensorFlow環境構築(Tensorflow導入~サンプル動作編)

はじめに

AMD GPUを用いてTensorflowのサンプル動作するまでの過程を記載します。
マイニングマシンからの転用でROCmを用いたTensorFlow環境を構築できるか試してみます。
前回ではROCmの導入をしましたので、
今回はTensorflowの導入~サンプル動作までを行います。

本記事はQiitaに投稿した記事の詳細版となります。
次回記事:TensorFlow動作(CPU-GPU比較)

構成

CPU: Celeron G3930
GPU: Radeon Vega 56
Ubuntu : 18.04 LTS(Kernel 4.15)
ROCm Version: 2.1

TensorFlowインストール事前準備

Python諸々のインストール:公式GitHubのInstall required python packages, On Python 3-based systemsを参考に以下コードを実行しました。

sudo apt-get update && sudo apt-get install -y \
    python3-numpy \
    python3-dev \
    python3-wheel \
    python3-mock \
    python3-future \
    python3-pip \
    python3-yaml \
    python3-setuptools && \
    sudo apt-get clean && \
    sudo rm -rf /var/lib/apt/lists/*

TensorFlowインストール

tensorflow-rocmをPipでインストール

pip3 install tensorflow-rocm

TensorFlow実行

Tensorflow動作確認→色々足りませんエラー

python3
>>> import tensorflow
Traceback (most recent call last):
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libCXLActivityLogger.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/tk/.local/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libCXLActivityLogger.so: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.
>>> 

調べてみると同様な現象の方々がいた為、こちらを参考に以下コマンドを入力しました。

$ sudo apt-get update && \
      sudo apt-get install -y --allow-unauthenticated \
      rocm-dkms rocm-dev rocm-libs \
      rocm-device-libs \
      hsa-ext-rocr-dev hsakmt-roct-dev hsa-rocr-dev \
      rocm-opencl rocm-opencl-dev \
      rocm-utils \
      rocm-profiler cxlactivitylogger \
      miopen-hip miopengemm

諸々のインストール完了後、再度Tensorflow動作確認→とりあえずは動作しました。

python3
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
WARNING:tensorflow:From /home/tk/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/distribution.py:265: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/tk/.local/lib/python3.6/site-packages/tensorflow/python/ops/distributions/bernoulli.py:169: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
>>> 

サンプル動作

Gitのインストール

sudo apt install git

Git clone:公式GitHubに従って、クローンしました。

cd ~
git clone https://github.com/tensorflow/models.git

この中から、CIFAR10というベンチマークを動かしてみました。

cd ~/models/tutorials/image/cifar10
export HIP_VISIBLE_DEVICES=0
python3 ./cifar10_train.py

結果→無事動作しました。

2019-02-12 16:58:19.769456: step 7720, loss = 0.85 (3943.5 examples/sec; 0.032 sec/batch)
2019-02-12 16:58:20.132842: step 7730, loss = 0.87 (3522.4 examples/sec; 0.036 sec/batch)
2019-02-12 16:58:20.468507: step 7740, loss = 0.88 (3813.3 examples/sec; 0.034 sec/batch)
2019-02-12 16:58:20.791237: step 7750, loss = 1.07 (3966.2 examples/sec; 0.032 sec/batch)
2019-02-12 16:58:21.121733: step 7760, loss = 0.90 (3873.0 examples/sec; 0.033 sec/batch)
2019-02-12 16:58:21.487553: step 7770, loss = 0.83 (3499.0 examples/sec; 0.037 sec/batch)
2019-02-12 16:58:21.851375: step 7780, loss = 0.81 (3518.2 examples/sec; 0.036 sec/batch)
2019-02-12 16:58:22.170275: step 7790, loss = 0.87 (4013.8 examples/sec; 0.032 sec/batch)

GPUの負荷を見てみてもちゃんと動作しているようです。

$ sudo /opt/rocm/bin/rocm-smi -u
========================        ROCm System Management Interface        ========================
================================================================================================
GPU[0]      : Cannot get GPU use.
GPU[1]      : Current GPU use: 64%
================================================================================================
========================               End of ROCm SMI Log              ========================

まとめ

一応ROCm、TensorFlowの導入、サンプル動作まで一通り実現できましたが、
ほかの方のベンチマークを見てみるともっと値が出ていたり、GPUの負荷も結構変動しているように見受けられたので、パラメータ最適化に関しては少し調べてみたいと思います。

おすすめ書籍