AMD GPU SPM (Streaming Performance Monitor) Technical Reference
Part 1: User-Space Perspective1. What is SPMSPM (Streaming Performance Monitor) is a hardware streaming performance counter collection mechanism provided by the AMD GPURLC (Run List Controller)unit.Unlike traditional “start - stop - read” snapshot-style performance counters, SPMcontinuously streamsmicroarchitecture counter data (CU occupancy, cache hit rate, VALU/SALU utilization, etc.) into a user-provided memory buffer at hardware clock granularity with near-zero overhead – without stopping the GPU workload.2. User-Space API Layering------------------------------------------------------------------ | Application / Tool (rocprofiler, custom profiler) | | hsa_amd_spm_acquire() | | hsa_amd_spm_set_dest_buffer() - double-buffer pattern | | hsa_amd_spm_release() | ----------------------------------------------------------------- | v ------------------------------------------------------------------ | HSA Runtime (libhsa-runtime64.so) | | hsa_ext_amd.cpp: | | hsa_amd_spm_acquire(agent) | | - agent-driver().SPMAcquire(node_id) | | amd_kfd_driver.cpp: | | KfdDriver::SPMAcquire(node_id) | | - HSAKMT_CALL(hsaKmtSPMAcquire(node_id)) | ----------------------------------------------------------------- | v ------------------------------------------------------------------ | Thunk Layer (libhsakmt.so) | | spm.c: | | hsaKmtSPMAcquire(PreferredNode) | | - validate_nodeid - gpu_id | | - ioctl(fd, AMDKFD_IOC_RLC_SPM, {opACQUIRE, gpu_id}) | | hsaKmtSPMSetDestBuffer(node, size, timeout, ...) | | - ioctl(fd, AMDKFD_IOC_RLC_SPM, {opSET_DEST_BUF, ...}) | | hsaKmtSPMRelease(PreferredNode) | | - ioctl(fd, AMDKFD_IOC_RLC_SPM, {opRELEASE, gpu_id}) | ------------------------------------------------------------------ | v KFD ioctl (0x84) AMDKFD_IOC_RLC_SPM3. Three API SemanticsAPIKFD OpSemanticshsa_amd_spm_acquire(agent)SPM_OP_ACQUIREAcquireexclusiveSPM access on the GPU for the calling process. Only one owner at a time.hsa_amd_spm_set_dest_buffer()SPM_OP_SET_DEST_BUFSet/replace the destination buffer. KFD starts DMA-ing RLC SPM ring data into user buffer. Supports timeout-based wait for the previous buffer to fill.hsa_amd_spm_release(agent)SPM_OP_RELEASERelease exclusive SPM access. Stops data streaming; remaining data is available upon return.4. Typical Usage Flow (Double-Buffer Pattern)User Process KFD / Hardware | | | 1. hsa_amd_spm_acquire(gpu) | | -----ioctl(ACQUIRE)------------------- | | | Lock SPM for this process | | Set spm_pasid caller PASID | | Program RLC_SPM_MC_CNTL.VMID | | | 2. Allocate buf_A, buf_B (user memory) | | | | 3. set_dest_buffer(buf_A, size, ...) | | -----ioctl(SET_DEST_BUF)-------------- | | | Start RLC SPM streaming | | DMA counters - buf_A | | | 4. set_dest_buffer(buf_B, size, t500) | | -----ioctl(SET_DEST_BUF)-------------- | | (blocks up to 500ms for buf_A fill) | Switch to buf_B | returns: size_copied for buf_A | | | | 5. Parse buf_A while HW fills buf_B | | ... repeat ping-pong ... | | | | 6. hsa_amd_spm_release(gpu) | | -----ioctl(RELEASE)------------------- | | | Stop SPM, release lock | 7. Parse final buffer |5. KFD ioctl Data Structuresenumkfd_ioctl_spm_op{KFD_IOCTL_SPM_OP_ACQUIRE,// Acquire exclusive accessKFD_IOCTL_SPM_OP_RELEASE,// Release exclusive accessKFD_IOCTL_SPM_OP_SET_DEST_BUF// Set/replace destination buffer};structkfd_ioctl_spm_args{__u64 dest_buf;// User-space destination buffer address__u32 buf_size;// Buffer size in bytes__u32 op;// Operation (enum kfd_ioctl_spm_op)__u32 timeout;// [in/out] Timeout in ms; updated with remaining__u32 gpu_id;// Target GPU ID__u32 bytes_copied;// [out] Bytes copied to previous buffer__u32 has_data_loss;// [out] Nonzero if ring overflowed};structkfd_ioctl_spm_buffer_header{__u32 version;// 0-23: minor, 24-31: major__u32 bytes_copied;// Per-sub-block data amount__u32 has_data_loss;// Per-sub-block data loss indicator__u32 reserved[5];};6. Consumer: rocprofilerrocprofiler (projects/rocprofiler/src/core/session/spm/spm.cpp) is theprimary user-space consumer:rocprofiler_spm_session | -- startSpm() | -- hsa_amd_spm_acquire(gpu_agent) | -- Submit AQL start packet (configure HW counters) | -- Allocate 3 x 32MB buffers (triple-buffer) | -- set_dest_buffer(buf[0]) | -- spmBufferSetup() thread: ping-pong set_dest_buffer | -- spmDataParse() thread: decode counter samples | -- stopSpm() -- Submit AQL stop packet -- hsa_amd_spm_release(gpu_agent)7. User-Space File InventoryLayerFileRoleHSA Public APIhsa_ext_amd.hDeclarehsa_amd_spm_{acquire,release,set_dest_buffer}HSA Runtimehsa_ext_amd.cppAPI entry, validate agent, dispatch to driverHSA Runtimeamd_kfd_driver.cppKfdDriver::SPM{Acquire,Release,SetDestBuffer}HSA Runtimethunk_loader.h/thunk_loader.cppHSAKMT_DEF/HSAKMT_PFNdynamic symbol loadHSA Runtimehsa_api_trace.cppAPI trace hook registrationHSA Runtimehsa_table_interface.cppHSA table dispatchThunkspm.cThree ioctl wrapper functionsThunkhsakmt.hDeclarehsaKmtSPM*Thunkkfd_ioctl.hkfd_ioctl_spm_args, ioctl cmd definitionThunklibhsakmt.verExported symbol tableDXG backenddxg/spm.cppWindows DXG backend stubVirtIO backendvirtio/hsakmt_virtio_topology.cVirtIO backend implementationConsumerrocprofiler/spm/spm.cpprocprofiler SPM session managementPart 2: Kernel-Space Perspective1. SPM Hardware Architecture---GPU-Die--------------------------------------------------- | | | -------- -------- -------- | | | CU 0 | | CU 1 | | CU N | Shader Engines | | ------- ------- ------- | | | | | | | ------------------------ | | | | | Performance Counter Muxes | | | | | ------v------ | | | RLC | Run List Controller | | | ------- | | | | | SPM | | Streaming Performance Monitor | | | | Engine| | | | | ------ | - Configurable sample interval | | | | | - Ring buffer in GPU-visible memory | | ------------ - Per-VMID access control | | | | | ------v------ | | | MC / MMHUB | Memory Controller | | | (VRAM / | | | | GART) | SPM data - ring buffer in memory | | ------------- | -------------------------------------------------------------Key hardware registers:RegisterRoleRLC_SPM_MC_CNTLSPM engine master control;RLC_SPM_VMIDfield selects owning VMIDRLC_SPM_RING_RDPTRSPM ring buffer read pointerRLC_SPM_RING_WRPTRSPM ring buffer write pointerRLC_SPM_PERFMON_*Performance counter selection and sample interval2. Kernel Driver Layers------------------------------------------------------------------ | KFD chardev ioctl handler | | AMDKFD_IOC_RLC_SPM (cmd 0x84) | | -- kfd_ioctl_spm() [out-of-tree / ROCK kernel] | | | | | -- ACQUIRE: | | | mutex_lock(spm_mutex) | | | if (dev-spm_pasid ! 0) return -EBUSY | | | dev-spm_pasid current-pasid | | | update_spm_vmid(adev, vmid) | | | | | -- SET_DEST_BUF: | | | if (dev-spm_pasid ! current-pasid) -EINVAL | | | configure ring buffer base/size | | | if (timeout) wait_for_completion_timeout() | | | copy bytes_copied, has_data_loss to user | | | | | -- RELEASE: | | if (dev-spm_pasid ! current-pasid) -EINVAL | | stop SPM engine | | dev-spm_pasid 0 | | update_spm_vmid(adev, 0xf) // reset to default | ------------------------------------------------------------------ | v ------------------------------------------------------------------ | amdgpu GFX IP callbacks (per-generation) | | gfx_v9_0.c / gfx_v10_0.c / gfx_v11_0.c / gfx_v12_1.c: | | .update_spm_vmid gfx_vN_0_update_spm_vmid | | - WREG32(RLC_SPM_MC_CNTL, vmid) | | .init_spm_golden gfx_vN_0_init_spm_golden | | - Program golden settings for SPM engine | ------------------------------------------------------------------3. update_spm_vmid Implementation (GFX9 Example)staticvoidgfx_v9_0_update_spm_vmid(structamdgpu_device*adev,intxcc_id,structamdgpu_ring*ring,unsignedintvmid){amdgpu_gfx_off_ctrl(adev,false);// Disable GFXOFF power-save// Read-modify-write RLC_SPM_MC_CNTL registerdataRREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL);data~RLC_SPM_MC_CNTL__RLC_SPM_VMID_MASK;data|vmidRLC_SPM_MC_CNTL__RLC_SPM_VMID__SHIFT;WREG32_SOC15(GC,0,mmRLC_SPM_MC_CNTL,data);amdgpu_gfx_off_ctrl(adev,true);// Re-enable GFXOFF}Key points:Must disable GFXOFF before accessing RLC registersVMID field determines which process context triggers SPM data captureSRIOV usesNO_KIQvariant to avoid KIQ ring deadlock4. spm_pasid Mutual Exclusion Modelkfd_dev (per-GPU): -- spm_pasid: unsigned int // 0 no owner, nonzero owning PASID -- spm_mutex: mutex // protects spm_pasid and HW state ACQUIRE: lock(spm_mutex) if spm_pasid ! 0 - -EBUSY (another process owns it) spm_pasid caller_pasid program HW VMID unlock(spm_mutex) RELEASE: lock(spm_mutex) if spm_pasid ! caller_pasid - -EINVAL stop HW, spm_pasid 0 update_spm_vmid(adev, 0xf) // reset unlock(spm_mutex)SPM is a globally exclusive resource– each GPU can have only one processholding SPM at a time. This is a hardware limitation: the RLC SPM engine hasonly one ring buffer and one VMID slot.5. Data Flow: Hardware to User-SpaceHardware Kernel User -------- ------ ---- CU perf counters -- RLC SPM engine | | (HW auto-sample at configured interval) v RLC SPM Ring Buffer (GPU-visible memory, kernel-managed) | | (KFD copies via CPU or SDMA on SET_DEST_BUF) v kfd_ioctl_spm: wait_for_completion_timeout() copy_to_user(bytes_copied, has_data_loss) | v User dest_buf (user-allocated, CPU-accessible) | v rocprofiler parses SPM samples - per-counter time-series data6. Upstream vs. Out-of-Tree StatusComponentUpstream (drm-next)ROCK / DKMSRLC_SPM_MC_CNTLregister defsYesYesupdate_spm_vmid()GFX callbacksYesYesinit_spm_golden()golden regsYesYesspm_pasidinkfd_priv.hYes (field only)YesAMDKFD_IOC_RLC_SPMioctl handlerNoYeskfd_ioctl_spm_argsUAPI headerNoYes (libhsakmt ships its own)Note:The SPM ioctl (cmd 0x84) currently exists only in AMD’s out-of-tree ROCK/amdgpu-dkms kernel. It has not been upstreamed to mainline Linux. The upstream kernel only has low-level hardware register interfaces (update_spm_vmid,init_spm_golden), not the user-space ioctl entry point.7. SPM vs. Traditional Performance CountersFeatureTraditional PMC (Snapshot)SPM (Streaming)Sampling modestart - stop - readContinuous HW auto-sampleOverheadCP/RLC interaction per readNear-zero; HW auto-DMATime resolutionPer-dispatch granularityConfigurable sample period (us-level)Data volumeTens of counter values per readContinuous time-series streamExclusivityMulti-processes can read different countersSingle-process exclusiveTypical use caserocprofcounter moderocprofSPM mode, temporal analysis