WSL2 에서 ROCm 사용하기

해당 부분을 먼저 참조하시는걸 권장드립니다.

이전에 Radeon 욕을 좀 했었는데, 최근들어서 다시 Radeon 시스템을 사용하게 되었다.

사실 순수 게임용도나, 고전적인 GPU 연산 코딩용도로 Radeon GPU는 나쁘지 않다.

하지만 ML 개발자로서 Radeon GPU를 활용하기에는 문서가 꽤나 불친절한 편이다...

가지고 놀아볼 겸, ROCm을 설치하는 과정에서도 여러 문제점이 있었기에, 해당 포스팅에 설치 방법을 정리한다.

1. Ubuntu 22.04 버전을 사용하자

WSL support matrices by ROCm version — Use ROCm on Radeon GPUs

Available from PyTorch.org nightly builds, not tested extensively by AMD.

rocm.docs.amd.com

현재 `2024년 10월 28일` 기준 Ubuntu 24.04 버전이 나온지도 거의 반년이 지났지만 지원하지 않는다.

먼저 24.04 버전으로 시도하다가 각종 오류 발생으로 인해 자료를 찾아보니 22.04 버전을 사용해야 한다.

Microsoft Store 에서 Ubuntu 22.04 버전을 다운받아 WSL을 사용하자.

2. ROCm 설치

Install Radeon software for WSL with ROCm — Use ROCm on Radeon GPUs

After the Unified Driver Deb Package repositories are installed, run the installer script with appropriate --usecase parameters to install the driver components. AMD recommends installing the WSL usecase by default. Post-install verification check Run a po

rocm.docs.amd.com

해당 과정을 그대로 따라 설치하면 된다.

Ubuntu 22.04 버전에서는 아무 문제 없이 잘 설치된다.

그 후 `rocminfo` 명령어를 입력하면

CPU로 Agent1 이 출력되고, GPU로 Agent2 가 출력된다. (일반적인 single-gpu 환경)

root@Cyp:~# rocminfo
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  ENABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    CPU
  Uuid:                    CPU-XX
  Marketing Name:          CPU
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
  Chip ID:                 0(0x0)
  Cacheline Size:          64(0x40)
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16334012(0xf93cbc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16334012(0xf93cbc) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1030
  Marketing Name:          AMD Radeon RX 6800
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        16(0x10)
  Queue Min Size:          4096(0x1000)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L3:                      131072(0x20000) KB
  Chip ID:                 29631(0x73bf)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1950
  Internal Node ID:        1
  Compute Unit:            60
  SIMDs per CU:            2
  Shader Engines:          4
  Shader Arrs. per Eng.:   2
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 118
  SDMA engine uCode::      0
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16703716(0xfee0e4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1030
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Agent2 에서 설치된 Radeon GPU를 확인할 수 있다.

나의 경우는 RX6800을 사용중이다.

3. Pytorch 설치

Install PyTorch for ROCm — Use ROCm on Radeon GPUs

AMD recommends the PIP install method to create a PyTorch environment when working with ROCm™ for machine learning development. Note The latest version of Python module numpy v2.0 is incompatible with the torch wheels for this version. Downgrade to an ol

rocm.docs.amd.com

위 과정을 따라 Pytorch 를 설치하면 그대로 호환된다.

개인적으로 현재 Tensorflow 는 거의 사용하지 않아서 따로 설치해보진 않았다.••••••••••

verify pytorch installation 란을 확인해서 제대로 Pytorch가 GPU 환경에서 동작하는지 확인하자.

4. 최종확인

먼저, 다시한번 GPU가 제대로 Pytorch 에서 포착되는지 체크하자.

test.py

import torch
# ROCm 버전 확인
if hasattr(torch.version, 'hip'):
    print(f"ROCm version: {torch.version.hip}")
else:
    print("ROCm is not available")

# GPU 감지 확인
if torch.cuda.is_available():
    print("CUDA GPU available")
elif hasattr(torch, 'hip') and torch.hip.is_available():
    print("ROCm GPU available")
else:
    print("No GPU detected")

# 사용 가능한 GPU 개수와 이름 확인
if torch.cuda.is_available() or (hasattr(torch, 'hip') and torch.hip.is_available()):
    print(f"Number of GPUs: {torch.cuda.device_count()}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

output

ROCm version: 6.1.40093-bd86f1708
CUDA GPU available
Number of GPUs: 1
GPU Name: AMD Radeon RX 6800

5. 문제점

import torch

# GPU 기본 정보 확인
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Current GPU:", torch.cuda.current_device())
    print("GPU name:", torch.cuda.get_device_name(0))
    print("GPU memory:", torch.cuda.get_device_properties(0).total_memory / 1024**3, "GB")

# 아주 작은 텐서로 빠른 테스트
print("\n=== Quick Test ===")
x = torch.rand(100, 100, device="cuda")
y = torch.rand(100, 100, device="cuda")
print("Small tensors created on GPU")

# 간단한 연산 테스트
z = torch.matmul(x, y)
print("Matrix multiplication completed")

# GPU 메모리 상태 확인
if torch.cuda.is_available():
    print("\nGPU Memory Status:")
    print(f"Allocated: {torch.cuda.memory_allocated(0)/1024**2:.2f} MB")
    print(f"Cached: {torch.cuda.memory_reserved(0)/1024**2:.2f} MB")

간단한 위의 코드를 실행시켜 보면 `x, y`에 텐서값을 할당할 때 연산이 되지 않는다.

따라서 코드가 `x` 지점에서 멈추며, 5분이상 기다려봐도 작동하지 않는다.

관련되서 Reddit 과 Github에 여러 의견이 있는데 참고하길 바란다.

From the ROCm community on Reddit

Explore this post and more from the ROCm community

www.reddit.com

뭐 그룹 추가 문제라는데... 시도해도 안된다.

물론, RX6800은 AMD에서 공식적으로 ROCm을 지원하는 GPU가 아니기에 그럴 수 있다.

System requirements (Linux) — ROCm installation (Linux)

The following table shows the supported AMD Instinct™ accelerators, and Radeon™ PRO and Radeon GPUs. If a GPU is not listed on this table, it’s not officially supported by AMD. Accelerators and GPUs listed in the following table support compute workl

rocm.docs.amd.com

좀 재미있는 사실은,

Windows 에서는 ROCm의 하위격인 HIP를 RX6800을 대상으로 지원한다.

System requirements (Windows) — HIP SDK installation (Windows)

ROCm components are described in What is ROCm? Support on Windows is provided with two levels on enablement. Note Some math libraries are Linux exclusive.

rocm.docs.amd.com

위에 22H2를 지원한다고 적혀있는데, 나는 현재 Windows 24H2 버전을 사용하고 있기에...

굳이 시도해보고 싶지 않다. (WSL2에 설치도 그냥 심심해서 시도해 본건데, 내 인내심이 다했다.)

From the pytorch community on Reddit

Explore this post and more from the pytorch community

www.reddit.com

DirectML을 기반으로 잘 작동한다는 2년전 후기가 있다.

결론적으로 ChatGPT를 기반으로 시작된 AI 붐으로 인해 AMD 측도 ROCm 생태계에 많은 관심을 쏟고 있는 듯 하지만,

현재까지는 Nvidia 에 비해서 사용하기 매우 불편하다.

연구자들이건, 개발자들이건, 학생이건 복잡하게 셋팅하는걸 싫어하는데 ROCm은 셋팅이 짜증나고 자료도 적으며,

버전마다 파편화가 심하다. (Nvidia 는 GTX 1000 번대로도 잘 돌아간다.)

여담으로 Ollama 에서도 Radeon을 지원하기에 RX6800으로 Qwen 2.5 14B 모델도 구동시켜 보았지만,

이전에 사용하던 RTX3070 대비 토큰 생성속도가 1/4 정도이다.(깡성능은 오히려 RX6800이 +10% 이상 우위이다.)

나는 앞으로 2027년도 까지는 Radeon RX 버전을 ML과 연관짓지 않기로 마음먹었다.

6. WSL 환경이 아닌, Native Ubuntu 환경에서는 올바르게 작동한다.

RX6800+ROCm VS Tesla T4+Cuda

우선 이전 포스팅에서 WSL2 에서 ROCm 이 돌아가지 않아서, 반쯤 포기했었는데또다시 어느정도 기운을 차려서 다시한번 뻘짓을 하고 있다.이전 포스팅을 통해 내가 시도했던 WSL2 환경 위에서 RX6800

cypsw.tistory.com

저작자표시 비영리 동일조건 (새창열림)

'Operating System > WSL' 카테고리의 다른 글

Docker Desktop 없이 WSL2에 Docker 설치 (0)	2024.07.06

1. Ubuntu 22.04 버전을 사용하자

2. ROCm 설치

3. Pytorch 설치

4. 최종확인

5. 문제점

6. WSL 환경이 아닌, Native Ubuntu 환경에서는 올바르게 작동한다.

'Operating System > WSL' 카테고리의 다른 글

티스토리툴바