table of content
无法使用CUDA
表现
nvidia-smi正常显示GPU信息nvidia-smi topo -m显示无Nvlink,如图
- 使用
python -c "import torch; print(torch.cuda.is_available())"为False并且报错
/<path>/lib/python3.10/site-packages/torch/cuda/__init__.py:118: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
解决方案
搜索得知,多个类似问题表现均为缺少nvidia-fabricmanager,安装并启用即可解决
sudo apt install nvidia-fabricmanager-<driver-version>
sudo systemctl enable nvidia-fabricmanager
sudo systemctl start nvidia-fabricmanager
但是由于安装的驱动为550.54.15,Ubuntu 18.04的apt源中最高只有530的版本。经尝试,可直接安装20.04的deb包,不会有依赖问题。
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo systemctl enable nvidia-fabricmanager
sudo systemctl start nvidia-fabricmanager

CUDA正常后,Docker中无法使用GPU
卸载驱动时一起把nvidia-docker卸载了
解决方案
由于上述步骤中手动安装nvidia-fabricmanager以及驱动且不存在于apt源中,apt无法自动解决依赖问题,因此需要手动安装nvidia-container-toolkit.
经过实测,可以使用20.04的deb包,不会有依赖问题。
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-drivers-550_550.54.15-1_amd64.deb
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-drivers-fabricmanager-550_550.54.15-1_amd64.deb
apt-get download nvidia-container-toolkit
apt-get download libnvidia-container1
apt-get download nvidia-container-toolkit-base
apt-get download libnvidia-container-tools
dpkg -i --ignore-depends=nvidia-driver-550 cuda-drivers-550_550.54.15-1_amd64.deb
dpkg -i cuda-drivers-fabricmanager-550_550.54.15-1_amd64.deb
dpkg -i nvidia-container-toolkit-base_1.15.0-1_amd64.deb
dpkg -i libnvidia-container1_1.15.0-1_amd64.deb
dpkg -i libnvidia-container-tools_1.15.0-1_amd64.deb
dpkg -i nvidia-container-toolkit_1.15.0-1_amd64.deb
安装完成后,重启docker
sudo systemctl restart docker
ref
https://stackoverflow.com/questions/66371130/cuda-initialization-unexpected-error-from-cudagetdevicecount https://blog.csdn.net/weixin_41674971/article/details/125291799 https://forums.fast.ai/t/notes-on-using-nvidia-a100-40gb/89894