Skip to content

Eval bug: b4882 broke t5 #12435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
steampunque opened this issue Mar 17, 2025 · 13 comments · Fixed by #12447
Closed

Eval bug: b4882 broke t5 #12435

steampunque opened this issue Mar 17, 2025 · 13 comments · Fixed by #12447
Labels
bug Something isn't working

Comments

@steampunque
Copy link

Name and Version

version: 4882 (be7c303)
built with cc (GCC) 11.2.0 for x86_64-slackware-linux

Operating systems

Linux

GGML backends

CUDA

Hardware

gtx 1070

Models

madlad400 7b q6_k

Problem description & steps to reproduce

gibberish now comes out of the model after b4882 commit.

First Bad Commit

b4882

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
build: 4882 (be7c3034) with cc (GCC) 11.2.0 for x86_64-slackware-linux
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce GTX 1070) - 7932 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 1110 tensors from /datahd/models/madlad400-7b-mt.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = t5
llama_model_loader: - kv   1:                               general.name str              = T5
llama_model_loader: - kv   2:                          t5.context_length u32              = 512
llama_model_loader: - kv   3:                        t5.embedding_length u32              = 2048
llama_model_loader: - kv   4:                     t5.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                             t5.block_count u32              = 48
llama_model_loader: - kv   6:                    t5.attention.head_count u32              = 16
llama_model_loader: - kv   7:                    t5.attention.key_length u32              = 128
llama_model_loader: - kv   8:                  t5.attention.value_length u32              = 128
llama_model_loader: - kv   9:            t5.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  10:        t5.attention.relative_buckets_count u32              = 32
llama_model_loader: - kv  11:        t5.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                  t5.decoder_start_token_id u32              = 0
llama_model_loader: - kv  13:                          general.file_type u32              = 18
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,256000]  = ["<unk>", "<s>", "</s>", "\n", "<2ace>...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,256000]  = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  20:    tokenizer.ggml.remove_extra_whitespaces bool             = false
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  242 tensors
llama_model_loader: - type q6_K:  866 tensors
llama_model_loader: - type bf16:    2 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q6_K
print_info: file size   = 6.34 GiB (6.56 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 3
load: token to piece cache size = 1.7509 MB
print_info: arch             = t5
print_info: vocab_only       = 0
print_info: n_ctx_train      = 512
print_info: n_embd           = 2048
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = -1
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 512
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = ?B
print_info: model params     = 8.30 B
print_info: general.name     = T5
print_info: vocab type       = UGM
print_info: n_vocab          = 256000
print_info: n_merges         = 0
print_info: EOS token        = 2 '</s>'
print_info: UNK token        = 2 '</s>'
print_info: PAD token        = 1 '<s>'
print_info: LF token         = 805 '▁'
print_info: EOG token        = 2 '</s>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  2917.78 MiB
load_tensors:        CUDA0 model buffer size =  6082.05 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: yarn_log_mul  = 0
llama_context:  CUDA_Host  output buffer size =     0.98 MiB
init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:      CUDA0 KV buffer size =   192.00 MiB
llama_context: KV self size  =  192.00 MiB, K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_context:      CUDA0 compute buffer size =   508.03 MiB
llama_context:  CUDA_Host compute buffer size =    23.00 MiB
llama_context: graph nodes  = 2742
llama_context: graph splits = 98
common_init_from_params: setting dry_penalty_last_n to ctx_size = 512
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CUDA : ARCHS = 520,610,700,750 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 2258604974
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 512
	top_k = 40, top_p = 0.950, min_p = 0.000, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to the AI, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

<2de> Today it rains.- 4e, ldn.-kamgain, da Vinci20000000000000010100010001010180002: Lassen)a) "Usa,5) HPV ’шумф- rigth 1 1600000000000000001 )obs,Gayna,92) ’s) 24) ’s) и
llama_perf_sampler_print:    sampling time =      38.19 ms /   128 runs   (    0.30 ms per token,  3351.84 tokens per second)
llama_perf_context_print:        load time =    4342.57 ms
llama_perf_context_print: prompt eval time =   11356.43 ms /     9 tokens ( 1261.83 ms per token,     0.79 tokens per second)
llama_perf_context_print:        eval time =    6090.48 ms /   120 runs   (   50.75 ms per token,    19.70 tokens per second)
llama_perf_context_print:       total time =   19497.55 ms /   129 tokens
Interrupted by user
@CISC
Copy link
Collaborator

CISC commented Mar 17, 2025

It's extremely unlikely to be that commit, however maybe e0dbec0 did you bisect this or just test b4880 vs b4882?

What's your command line BTW?

@steampunque
Copy link
Author

steampunque commented Mar 17, 2025

It's extremely unlikely to be that commit, however maybe e0dbec0 did you bisect this or just test b4880 vs b4882?

What's your command line BTW?

These changes in the release most likely broke t5:

commit e0dbec0
Author: Georgi Gerganov [email protected]
Date: Thu Mar 13 12:35:44 2025 +0200

llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181)

I don't use any example besides server which I patched to support t5, but the bug can be seen by starting the cli (which I don't really know how to use, but it seemed to be cranking out the same gibberish I see in my server).

llama-cli -m /data3hd/models/madlad400-7b-mt.Q6_K.gguf --color -n -1 --multiline-input --interactive-first -ngl 65 -c 512 -ctk f16 -ctv f16 -b 512 -ub 512 -n 512 --keep 0 --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0 --top-k 40 --top-p 0.95 --typical 1.0 --min-p 0.00 --repeat-last-n 64 --repeat-penalty 1.0 --presence-penalty 0.0 --frequency-penalty 0.0 --mirostat 0 --mirostat-lr 0.1 --mirostat-ent 5.0 -p "" --in-prefix "" --in-suffix ""

EDIT:
I have a vague memory that t5 never worked with interactive cli mode. Therefore this command can be used to demo the bug instead. It should just start cranking out a bunch of gibberish.

llama-cli -m /data3hd/models/madlad400-7b-mt.Q6_K.gguf --color -n -1 -ngl 65 -c 512 -ctk f16 -ctv f16 -b 512 -ub 512 -n 512 --keep 0 --temp 0.0 --dynatemp-range 0.0 --dynatemp-exp 1.0 --top-k 40 --top-p 0.95 --typical 1.0 --min-p 0.00 --repeat-last-n 64 --repeat-penalty 1.0 --presence-penalty 0.0 --frequency-penalty 0.0 --mirostat 0 --mirostat-lr 0.1 --mirostat-ent 5.0 -p "<2de> Today it rains" --in-prefix "" --in-suffix ""

4880 and below will correctly output:

Heute regnet es [end of text]

@CISC CISC added bug Something isn't working and removed bug-unconfirmed labels Mar 18, 2025
@CISC CISC linked a pull request Mar 18, 2025 that will close this issue
@fairydreaming
Copy link
Collaborator

There are two separate problems that broke T5 support, one of them (using causal KQ mask in encoder) will be fixed by #12447, but another fix is still needed for the other one (Vcur reshape removed in 70ef653).

@ggerganov
Copy link
Member

but another fix is still needed for the other one (Vcur reshape removed in 70ef653).

I will push another PR for this now

@rudiservo
Copy link
Contributor

The docker version I can run of Q4_K_S on a GTX1070 is b4823, after that something broke it just keeps restarting when it warms up.

@rudiservo
Copy link
Contributor

rudiservo commented Mar 18, 2025

FYI, I do not know what's wrong but I can't seem to be able to run nomic embedding has well, is it a separate issue?

@steampunque
Copy link
Author

The docker version I can run of Q4_K_S on a GTX1070 is b4823, after that something broke it just keeps restarting when it warms up.

4880 was ok for me on GTX1070. I cant speak for anything above 4823 to <4880. 4882 and above do not work.

@steampunque
Copy link
Author

FYI, I do not know what's wrong but I can't seem to be able to run nomic embedding has well, is it a separate issue?

I believe embedding models and t5 problem are related. Unique to t5 is the encoder part which computes embeddings on the prompt to send to decoder.

@rudiservo
Copy link
Contributor

FYI, I do not know what's wrong but I can't seem to be able to run nomic embedding has well, is it a separate issue?

I believe embedding models and t5 problem are related. Unique to t5 is the encoder part which computes embeddings on the prompt to send to decoder.

IDK, I went all the way back to B4764 (3 weeks ago) without success, using Q4_K_S and Q4_K_L

@steampunque
Copy link
Author

FYI, I do not know what's wrong but I can't seem to be able to run nomic embedding has well, is it a separate issue?

I believe embedding models and t5 problem are related. Unique to t5 is the encoder part which computes embeddings on the prompt to send to decoder.

IDK, I went all the way back to B4764 (3 weeks ago) without success, using Q4_K_S and Q4_K_L

t5 was certainly working over that release range for me since I regress it fairly often (mainly when I have to rebase my server patches, 4882 was a significant rebase due to api changes related to kv cache). I can't speak for embedding models as I don't use them but I think the embedding models are very similar to the way the t5 encoder works (i.e. non-causal processing of the whole prompt at once).

@steampunque
Copy link
Author

t5 is operational again as of release b4919. Thanks to @ggerganov and @fairydreaming for extremely rapid debug of this issue.

@rudiservo
Copy link
Contributor

Sorry to bump on this guys, i'll open a new one but just to confirm.
I am running docker version server-cuda-b4920, everytime it tries to run a prompt it just exists code 132, any model, qwen2.5, llama3.1, etc.
On a RTX 1070 and can only run with b4823.
Haven't tried with RX 7900XTX because it died, vram issue, replacement on the way.

So the dumb question is this a different issue?

Thanks guys, again sorry.

@rudiservo
Copy link
Contributor

@steampunque what is the nvidia and cuda driver version you are using? Trying to debug why it doe not work on my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants