Skip to content

Eval bug: ggml_cuda_compute_forward: MUL_MAT failed when using FA + MLA on DeepSeekv3 0324, on mixed CPU + GPU #13252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Panchovix opened this issue May 2, 2025 · 2 comments · May be fixed by #13306

Comments

@Panchovix
Copy link

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
version: 5255 (d24d5928)
built with gcc-14 (GCC) 14.2.1 20250210 (Red Hat 14.2.1-8) for x86_64-redhat-linux

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen 7 7800X3D, 192GB RAM, 5090+4090x2+A6000

Models

Deepseek V3 0324

Problem description & steps to reproduce

Hi there, many thanks for all the work.

I was trying to use Deepseek V3 0324 Q2_K_XL (https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/UD-Q2_K_XL) on my mixed PC, using ~120GB RAM and the rest on RAM.

When using the -fa flag, I get MUL_MAT failed.

When not using -fa, it works fine.

Model was being loaded with

./llama-server -m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 16384 --no-mmap --no-warmup -fa -v -ngl 99 --override-tensor 'blk\.(2[5-9]|[3-6][0-9])\..*_exps\.=CPU' --override-tensor 'blk\.([1-6])\..*_exps\.=CUDA0' --override-tensor 'blk\.([7-9]|1[0])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[1-5])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[6-9]|2[0-4])\..*_exps\.=CUDA3'

First Bad Commit

N/A

Relevant log output

/llama-server -m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 16384 --no-mmap --no-warmup -fa -ngl 99 --override-tensor 'blk\.(2[5-9]|[3-6][0-9])\..*_exps\.=CPU' --override-tensor 'blk\.([1-6])\..*_exps\.=CUDA0' --override-tensor 'blk\.([7-9]|1[0])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[1-5])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[6-9]|2[0-4])\..*_exps\.=CUDA3'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
build: 5255 (d24d5928) with gcc-14 (GCC) 14.2.1 20250210 (Red Hat 14.2.1-8) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 860,890,1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15
main: loading model
srv    load_model: loading model '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23698 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23698 MiB free
llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 5090) - 29679 MiB free
llama_model_load_from_file_impl: using device CUDA3 (NVIDIA RTX A6000) - 48281 MiB free
llama_model_loader: loaded meta data with 64 key-value pairs and 1086 tensors from /run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Deepseek-V3-0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = Deepseek-V3-0324
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 256x20B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = DeepSeek V3 0324
llama_model_loader: - kv  11:               general.base_model.0.version str              = V3-0324
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  14:                               general.tags arr[str,4]       = ["deepseek_v3", "deepseek", "unsloth"...
llama_model_loader: - kv  15:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  16:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv  17:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  18:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  19:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  20:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  21:          deepseek2.attention.head_count_kv u32              = 1
llama_model_loader: - kv  22:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  23: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  24:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  25:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  26:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  27:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  28:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  29:             deepseek2.attention.key_length u32              = 576
llama_model_loader: - kv  30:           deepseek2.attention.value_length u32              = 512
llama_model_loader: - kv  31:         deepseek2.attention.key_length_mla u32              = 192
llama_model_loader: - kv  32:       deepseek2.attention.value_length_mla u32              = 128
llama_model_loader: - kv  33:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  34:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  35:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  36:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  37:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  38:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  39:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  40:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  41:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  42: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  43: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  44:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  45:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  46:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  47:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  48:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  49:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  50:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  51:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  52:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  53:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  54:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  55:               general.quantization_version u32              = 2
llama_model_loader: - kv  56:                          general.file_type u32              = 10
llama_model_loader: - kv  57:                      quantize.imatrix.file str              = DeepSeek-V3-0324-GGUF/imatrix_unsloth...
llama_model_loader: - kv  58:                   quantize.imatrix.dataset str              = unsloth_calibration_DeepSeek-V3-0324.txt
llama_model_loader: - kv  59:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  60:              quantize.imatrix.chunks_count i32              = 60
llama_model_loader: - kv  61:                                   split.no u16              = 0
llama_model_loader: - kv  62:                        split.tensors.count i32              = 1086
llama_model_loader: - kv  63:                                split.count u16              = 0
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q8_0:  122 tensors
llama_model_loader: - type q2_K:  122 tensors
llama_model_loader: - type q3_K:   54 tensors
llama_model_loader: - type q4_K:  389 tensors
llama_model_loader: - type q5_K:   23 tensors
llama_model_loader: - type q6_K:   15 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q2_K - Medium
print_info: file size   = 233.18 GiB (2.98 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 818
load: token to piece cache size = 0.8223 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 7168
print_info: n_layer          = 61
print_info: n_head           = 128
print_info: n_head_kv        = 1
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 576
print_info: n_embd_head_v    = 512
print_info: n_gqa            = 128
print_info: n_embd_k_gqa     = 576
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = Deepseek-V3-0324
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 192
print_info: n_embd_head_v_mla    = 128
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<|begin▁of▁sentence|>'
print_info: EOS token        = 1 '<|end▁of▁sentence|>'
print_info: EOT token        = 1 '<|end▁of▁sentence|>'
print_info: PAD token        = 2 '<|▁pad▁|>'
print_info: LF token         = 201 'Ċ'
print_info: FIM PRE token    = 128801 '<|fim▁begin|>'
print_info: FIM SUF token    = 128800 '<|fim▁hole|>'
print_info: FIM MID token    = 128802 '<|fim▁end|>'
print_info: EOG token        = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 18097.53 MiB
load_tensors:        CUDA1 model buffer size = 17719.83 MiB
load_tensors:        CUDA2 model buffer size = 22027.26 MiB
load_tensors:        CUDA3 model buffer size = 38894.36 MiB
load_tensors:          CPU model buffer size = 142037.11 MiB
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
.......load_all_data: using async uploads for device CUDA1, buffer type CUDA1, backend CUDA1
.......load_all_data: using async uploads for device CUDA2, buffer type CUDA2, backend CUDA2
..........load_all_data: using async uploads for device CUDA3, buffer type CUDA3, backend CUDA3
................load_all_data: no device found for buffer type CPU for async uploads
............................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     0.49 MiB
llama_context: n_ctx = 16384
llama_context: n_ctx = 16384 (padded)
init: kv_size = 16384, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 1
init: layer   0: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   1: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   2: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   3: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   4: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   5: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   6: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   7: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   8: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer   9: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer  10: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer  11: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA0
init: layer  12: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  13: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  14: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  15: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  16: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  17: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  18: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  19: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  20: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  21: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  22: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  23: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA1
init: layer  24: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  25: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  26: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  27: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  28: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  29: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  30: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  31: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  32: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  33: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  34: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  35: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  36: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  37: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  38: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA2
init: layer  39: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  40: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  41: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  42: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  43: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  44: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  45: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  46: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  47: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  48: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  49: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  50: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  51: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  52: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  53: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  54: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  55: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  56: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  57: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  58: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  59: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init: layer  60: n_embd_k_gqa = 576, n_embd_v_gqa = 512, dev = CUDA3
init:      CUDA0 KV buffer size =   408.00 MiB
init:      CUDA1 KV buffer size =   408.00 MiB
init:      CUDA2 KV buffer size =   510.00 MiB
init:      CUDA3 KV buffer size =   748.00 MiB
llama_context: KV self size  = 2074.00 MiB, K (f16): 1098.00 MiB, V (f16):  976.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 5
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      CUDA0 compute buffer size =  3238.50 MiB
llama_context:      CUDA1 compute buffer size =   378.00 MiB
llama_context:      CUDA2 compute buffer size =   378.00 MiB
llama_context:      CUDA3 compute buffer size =   378.00 MiB
llama_context:  CUDA_Host compute buffer size =   336.01 MiB
llama_context: graph nodes  = 4660
llama_context: graph splits = 307 (with bs=512), 235 (with bs=1)
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 16384
slot        reset: id  0 | task -1 | 

...

slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.632294
srv  update_slots: decoding batch, n_tokens = 2048
set_embeddings: value = 0
clear_adapter_lora: call
/run/media/pancho/6AE20D1AE20CEBDF/ChatIAs/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at /run/media/pancho/6AE20D1AE20CEBDF/ChatIAs/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2344
  err
CUDA error
[New LWP 64005]
[New LWP 64004]
[New LWP 64003]
[New LWP 64002]
[New LWP 64001]
[New LWP 64000]
[New LWP 63999]
[New LWP 63605]
[New LWP 63604]
[New LWP 63603]
[New LWP 63602]
[New LWP 63601]
[New LWP 63600]
[New LWP 63599]
[New LWP 63598]
[New LWP 63597]
[New LWP 63596]
[New LWP 63595]
[New LWP 63594]
[New LWP 63593]
[New LWP 63592]
[New LWP 63591]
[New LWP 63590]
[New LWP 63589]
[New LWP 63588]
[New LWP 63587]
[New LWP 63586]
[New LWP 63585]
[New LWP 63584]
[New LWP 63583]
[New LWP 63582]
[New LWP 63581]
[New LWP 63580]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f47c40876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007f47c40876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007f47c407b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007f47c407ba24 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007f47c40eb5af in wait4 () from /lib64/libc.so.6
#4  0x00007f47c8b35fb6 in ggml_abort () from libggml-base.so
#5  0x00007f47c8c93963 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from libggml-cuda.so
#6  0x00007f47c8c9edbe in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from libggml-cuda.so
#7  0x00007f47c8b4b344 in ggml_backend_sched_graph_compute_async () from libggml-base.so
#8  0x00007f47d5b9d371 in llama_context::graph_compute(ggml_cgraph*, bool) () from libllama.so
#9  0x00007f47d5ba0ef8 in llama_context::decode(llama_batch&) () from libllama.so
#10 0x00007f47d5ba219b in llama_decode () from libllama.so
#11 0x000000000048b040 in server_context::update_slots() ()
#12 0x000000000045b25c in server_queue::start_loop() ()
#13 0x0000000000426020 in main ()
[Inferior 1 (process 63579) detached]
@ggerganov
Copy link
Member

CUDA does not support FA + MLA, although it should not crash like this - it should fallback to the CPU.

@Panchovix
Copy link
Author

Seems this PR fixes this issue #13306

@CISC CISC linked a pull request May 6, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants