-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Eval bug: b5237 broke Llama Scout #13287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@JohannesGaessler Same issue as in #13286? |
No, the linked PR specifically fixes imatrix. A CUDA error with illegal memory access is almost always an issue with the CUDA code where some edge case is not being considered correctly. |
|
This issue should be fixed by #13294 , please confirm. |
Same issue with this patch applied:
bash-5.1$ llama-perplexity -m /data3hd/models/Llama-4-Scout-17B-16E-Instruct.Q3_K_H.gguf -ngl 10 -c 1024 -b 128 -fa -f short.txt system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 600,610,700,750 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | A smaller gguf is now available if you would like to test it: |
Can you edit line 103 of |
Potential fix: #13299 |
Seems to work with this patch. Will be doing further testing later today. Thanks for fast response! |
I ran some regressions and while it doesn't crash any more the generation quality appears to be noticeably degrading with the Q2_K_H quant on Llama 4 Scout. New perplexity shows 10.47605444724703512860 over 59513 tokens while old (prior to b5237) showed Final perplexity=10.46921429389673307437 over 59513 tokens so a tiny increase in perplexity, however across prompting the b5237 is noticeably worse on 3 test prompts (2 fairly hard questions one code gen, b5237 got both questions wrong and generated worse code and prior to b5237 got both questions right and generated better code). So it seems to have gone backwards in performance for some reason (very tiny in objective perplexity result but very noticeable on test prompts). |
Did you select 3 random problems and the model just happened to be able to solve all of them or did you select those 3 problems specifically because the model could solve them? |
They are 3 problems I use to help optimize the hybrid layer quants. Yesterday I optimized the Q2_K_H Lllama 4 quant with b5236 and it was working very well across my entire gauntlet of test prompts but it went 0 for 3 on the harder problems I use with the b5237 update. Its still functional on easier prompts but suddenly 0 for 3 on the harder prompts is concerning. |
Okay, but that doesn't answer how you selected those 3 prompts. I'm specifically asking because you may be experiencing what is called a regression towards the mean. For example, let's say someone tests 100 prompts, the model can solve 3/100 prompts, and those 3 prompts are then used to determine model performance. The model likely got very lucky on those 3 prompts. If you add a small perturbation to the model it will mostly just shuffle the performance across the 100 prompts around; the performance on the 3 best prompts will likely get worse but there may be other prompts where it performs better. If you selected 3 prompts completely at random and the performance got worse on those 3 prompts that is a very different results than if you selected specifically 3 prompts where the model was performing well above average. Sometime next week I should be able to do a high-precision benchmark run using Elo HeLLM to check your findings. Please also consider https://github.com/ggml-org/llama.cpp/tree/master/tools/perplexity#llama-3-8b-scoreboard , specifically the columns for the token probabilities. At high bit quantization the token probabilities change on average by multiple percent, but on average the probability of predicting the "correct" token barely moves. |
I did some followup testing. The Q2_K_H quant is still very good at writing prose and I am getting no generation artifacts which was my main goal (the tricky questions I wasnt too worried about but I was happy it got them too). Given a strong enough model (this thing is 108B and should be strong) I think its reasonable to assume the model is going to handle a wide range of trick questions right so in my view suddenly faltering on these 3 tests was an alarm bell. I followed up with adaptive beam search (an experimental feature of my server) test and it corrected both of the tricky questions, so that suggests the model might have been on the knife edge of answering correctly. I would have expected the severe quants to dominate performance though not the higher precision computations going on in the attention block. I'm guessing at random there may be some rounding/truncation bias that may have been close to 0 in previous version but is now coming in with a small bias, just enough to kick the model off a correct generation (one early non optimal token can do it). Based on all my tests I don't see any glaring issue so I think this bug report can be close. Thanks again for fast response! |
I ran 100 question math bench on 5236 and 5237. These 100 questions are guaranteed to be unseen by any model since I created them all, about 1/3 quite tricky and 2/3 GSM8K level (fairly simple). Q2_K_H Llama 4 scout: Result confirms small performance degradation on 5237 originally concluded based on my 3-prompt screening test. |
Name and Version
b5237
Operating systems
Linux
GGML backends
CUDA
Hardware
4070
Models
Llama4 scout. Quant should not matter but I am using my hybrid layer quant here:
https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF/blob/main/Llama-4-Scout-17B-16E-Instruct.Q3_K_H.gguf
Problem description & steps to reproduce
crash with illegal memory access running a perplexity:
short.txt is first 9783 bytes of wiki.test.raw
llama-perplexity -m /data3hd/models/Llama-4-Scout-17B-16E-Instruct.Q3_K_H.gguf -ngl 10 -c 1024 -b 128 -fa -f short.txt
First Bad Commit
b5237. b5236 works fine.
Relevant log output
The text was updated successfully, but these errors were encountered: