Vulkan: Tuning warptile for Mali GPU Performance #13483

rmatif · 2025-05-12T15:20:30Z

rmatif
May 12, 2025

I'm working on Local Diffusion, using stable-diffusion.cpp on Android. Vulkan performance on Mali GPUs is currently very poor

Disabling mul_mat_l in ggml-vulkan.cpp helped a bit. I then tried modifying the m_warptile and s_warptile values. Reducing the first element (m tile?) from 128 to 64 gave a ~3x inference speedup, but the output images were garbage/noisy.

Questions:

How can I correctly tune m_warptile and s_warptile for Mali GPUs to get both performance and correct output?
Are there specific alignment requirements for these values on Mali?
Do the matmul shaders need to be adapted if these warptile values are changed?

Looking for guidance to improve Vulkan matmul performance on Mali without breaking correctness

jeffbolznv · 2025-05-15T13:19:10Z

jeffbolznv
May 15, 2025
Collaborator

What is the warp size and shared memory size for this GPU? These should be printed out on startup.

The first value is the workgroup size. I'm surprised this broke things unless the workgroup size is smaller than the warp size.

Which is currently faster, m_warptile or s_warptile?

0 replies

rmatif · 2025-05-15T13:59:12Z

rmatif
May 15, 2025
Author

@jeffbolznv

Thanks for the response. Here's the warp size and shared memory size of the GPU:

I pretty much brute-forced all possible combinations while tuning m_warptile and s_warptile, but it always resulted in broken output. That said, I only tested it with stable-diffusion.cpp and not with llama.cpp, though theoretically it should behave the same, since it's just mat_mul under the hood, unless the im2col op is somehow affecting it. I’ll try it later today with llama.cpp to confirm

In my case, m_warptile turned out to be significantly faster than s_warptile. I don’t remember the exact numbers, but the difference was very noticeable

2 replies

jeffbolznv May 15, 2025
Collaborator

When you change m_warptile, you also need to change m_wg_denoms to match (first two elements of m_wg_denoms must match second/third element of m_warptile). I don't know why a workgroup size of 64 would fail, though.

rmatif May 15, 2025
Author

@jeffbolznv

You were right.I just tested this in llama.cpp, and changing the first element of m_warptile to 64 doesn't break anything, but it also doesn't provide a performance gain

The issue seems specific to stable-diffusion.cpp. Since most of the workload involves conv_2d, I thought maybe im2col required some sort of alignment or something. Other than that, I’m not sure why it works fine in llama.cpp but not in sd.cpp, any hint ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan: Tuning warptile for Mali GPU Performance #13483

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Vulkan: Tuning warptile for Mali GPU Performance #13483

rmatif May 12, 2025

Replies: 2 comments · 2 replies

jeffbolznv May 15, 2025 Collaborator

rmatif May 15, 2025 Author

jeffbolznv May 15, 2025 Collaborator

rmatif May 15, 2025 Author

rmatif
May 12, 2025

Replies: 2 comments 2 replies

jeffbolznv
May 15, 2025
Collaborator

rmatif
May 15, 2025
Author

jeffbolznv May 15, 2025
Collaborator

rmatif May 15, 2025
Author