Decent, bit repetative
Been trying the model and while getting good results, i'm seeing a lot of repetition of phrases more so than other models. (Mind you i use Q2 for speed, so that could be a major component)
Otherwise... it seems to be pretty decent.
Thank you!
Yeah, I would suspect that low of a quant of introducing really suspect behavior in most models. I would not fill any warranty claims if the model was quanted to under 4 bits, and even that's kinda sketchy
To clarify: Llama 3 GGUF quality starts falling off a cliff once you hit Q2 and can reach around double the perplexity of the base fp16 model https://github.com/ggml-org/llama.cpp/pull/6745#issuecomment-2093892514
Mhmm. But unless i can run as much on the GPU as possible it takes forever to generate tokens. (trying a 110B model I've found 0.5T/s is unacceptable, 2+T/s i'm finding is my minimum anymore). If i could i'd run all my models at Q8_0.
The solution of course is Nvidia and GPU companies (or ones specialized for use with SD & LLMs though with fewer video card features) to make a really large VRam cards an option that isn't too expensive. Which we may not see for a few or 10 years (depending on if certain companies plan on buying up all the ram or not). That or Diffusion LLM's become the standard like LlaDa assuming that proves itself competent.
Darkhn's 70B-Animus-V12.5 does really good at avoiding too much duplication even at low Quanitizing.
Well i guess take my comments with a grain of salt.
(doesn't need to be closed! i think its good feedback regardless)
Mhmm. But unless i can run as much on the GPU as possible it takes forever to generate tokens
Yeah, 70Bs generally aren't great for home use unless you're comically rich because of that, tbh
Darkhn's 70B-Animus-V12.5 does really good at avoiding too much duplication even at low Quanitizing.
Reading the description here, if i had to speculate out loud, i'm wondering if either
- since their process was simpler (sft-only it appears) than ours (sft + dpo), i wonder if the weights were more amenable to quantization??
- since they used a B200 and it still took so long i'm guessing it was full finetuning, while we used lora for both stages, maybe that makes quantization worse/regress back towards the base slop model more???
Intuition says LoRA is more likely at fault here.
Out of curiosity, I wonder what would happen if you applied the adapter in-flight. Someone in our server brought up the idea of shipping the adapter separately, and while that's not common in the text generation space because in-flight loading isn't terribly common... maybe it should be?
My thinking is smth like
(base@Q2) + (adapter@fp16) =/= quantize(base@fp16+adapter@fp16)
But I could just be completely off base here.
Was this QLoRA or full-precision LoRA?
(doesn't need to be closed! i think its good feedback regardless)
Sorry, half figured if it's quantized so much that it's like marketing a Truck and then comparing results from the Barbie kid's truck toys, and feedback may not quite seen results from larger/fuller models.
Makes me wonder with a model that started dropping guessable grammar words (i, he, she, the, they, them, it, you, me) after 12k context was due to Quanitizing too, but that was with a 4Gb card i had... I'd have to check back my interaction logs. But that's when i was Q4-Q6 on 12B and under models.
Yeah, 70Bs generally aren't great for home use unless you're comically rich because of that, tbh
I tend to download the i1-Q6-K models for storage, then Quantize them down to use them locally. For 30B and under i usually do Q3-Q4. Though the Trifecta model works really well for it's weight.
With a 8Gb GPU i get a respectable 2-5 tokens a second with Q2 (Fast enough it's like chatting with a real human). Though once you pass the ~20k-24k context threshold it slows to a crawl. So 8k-16k is all i can really handle.
Kinda wish models would work as a pair, one being very terse and giving basic instructions and over a large context in an encoding, then a secondary one that works in a 4k window and adds all the other stuff to make it legible and flow. The important context could be 'go car apples' meaning 'i have to leave the house, using the car to go to the store to get some apples'.
Not sure. We'll have to see if such a model other than Draft/Speculative decoding or Text Diffusion will do the job.
Darkhn's 70B-Animus-V12.5 does really good at avoiding too much duplication even at low Quanitizing.
Reading the description here, if i had to speculate out loud, I'm wondering if either
- since their process was simpler (sft-only it appears) than ours (sft + dpo), i wonder if the weights were more amenable to quantization??
Maybe? I'm not the most knowledgeable yet on how it all works. So I'm not sure what those terms or processes mean yet. Might ask him to join the discussion here and get his feedback.
I do notice the Animus model is something like 10% faster than other 70B models. Maybe some layers are also removed. (As i understand some censorship layers are added by companies to the original models, but I've seen no censorship at all with Animus)
- since they used a B200 and it still took so long I'm guessing it was full finetuning, while we used lora for both stages, maybe that makes quantization worse/regress back towards the base slop model more???
Well he said it was his Magnus Opus in models to this point.
I'd like to chime in say that this model fits very well as a decent "generalist" model for catching the finer details of most characters with occasional swaps to Zerofata-Unleashed Lemonade v3 and Ionze_Lilitu depending on the tone of the story (which might be why Zerofata quanted it? They've got really good taste.) That being said, the repetition is minor (if noticeable) at 4.25 bpw exl3 (2x 3090) and easily edited/banned out. I am curious what you all would do with a proper fine tune because this is already one of my favorite models (and I've tested A LOT). Any chance we could see more larger model (new GLM air 4.7 when released/ 70b?) work in the future? There's already a ton of smaller models being released and you've all done such a wonderful job in 70B that not having a (even a small) follow up iteration feels like a little bit of a blue ball haha.
Been trying the model and while getting good results, i'm seeing a lot of repetition of phrases more so than other models. (Mind you i use Q2 for speed, so that could be a major component)
Otherwise... it seems to be pretty decent.
Same experience here...tested with 4.25bpw, compared with other L3.3 models like The-Omega-Directive-70B-Unslop-v2.1, Sapphira etc.
Good prose and great in logic and understanding, but it compared to those it does have a slight tendency with repeating certain phrases.
Prose is little ahead than Omega 2.1 but a little more concise (for better or worse, does not output more than it needs to), and maybe even slightly better in logic. In the same vein, it's much more stable at stopping than Sapphira which have a tendency to run-on but don't have Sapphira's degree of creativity and certain charm in prose.
When it comes to variety it’s well known that LexiFreak is an industry leader in gnawing on it twelve different ways. You’re out here doing qualitative lexical analysis I’m out here wondering if the scars will ever heal we are not the same
Yeah, 70Bs generally aren't great for home use unless you're comically rich because of that, tbh
You'd be surprised how long I went without food for my 2nd 3090. 😣
Yeah, 70Bs generally aren't great for home use unless you're comically rich because of that, tbh
You'd be surprised how long I went without food for my 2nd 3090. 😣
My condolences. Though as long as you have over 5% body fat, your body will work on mostly fat before breaking down more important tissues.
For the first time in my life i could justify computer hardware to a halfway decent level, only to see it's way below par. Been playing more with 20B-30B models trying to find ones that are a good balance. A lot of them start to feel the same, not quite good enough, a little too easy to push around, a little weak in aspects that make it feel off.
I hope NPUs and GPUs and VRam gets cheaper, cheap enough to be affordable for everyone.