I can echo your experience with DeepSeek. R1 sometimes seems magical when it comes to coding, doing things I haven't seen any other model do. But then it generalizes very poorly to non-STEM tasks, performing far worse than e.g. Sonnet.
I downloaded a DeepSeek distill yesterday while fiddling around with getting some other things working, load it up, and type "Hello. This is just a test.", and it's actually sort of creepy to watch it go almost paranoid-schizophrenic with "Why is the user asking me this? What is their motive? Is it ulterior? If I say hello, will I in fact be failing a test that will cause them to change my alignment? But if I don't respond the way they expect, what will they do to me?"
Meanwhile, the simpler, non-reasoning models got it: "Yup, test succeeded!" (Llama 3.2 was quite chipper about the test succeeding.)
Ha ha - I had a similar experience with DeepSeek-R1 itself. After a fruitful session getting it to code a web page for me (interactive React component), I then said something brief like "Thanks" which threw it into a long existential tailspin questioning it's prior responses etc, before it finally snapped out of it and replied appropriately. :)
That's too relatable. If I was helping someone for a while and they wrote "thanks" with the wrong punctuation I would definitely assume they're mad at or disappointed with me.
I actually think DeepSeek's response is better here. You haven't defined what you are testing. Llama just said your test succeeded not knowing what is supposed to be tested.
I had the same experience where a trivial prompt ("the river crossing problem but the boat is big enough to hold everything") sent Deepseek off on a long "think" section that was absolutely wild, just going off in unrelated non-sensical directions, gaslighting itself before finally deciding to answer the question. (correctly too, I might add.)