That comparison isn't very optimistic for AI safety. We want AI to do good things because they are good people, not because they are afraid being bad will get them punished. Especially since AI will very quickly be too powerful for us to punish.
> We want AI to do good things because they are good people
"Good" is at least as much of a difficult question to define as "truth", and genAI completely skipped all analysis of truth in favor of statistical plausibility. Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.
> Meanwhile there's no difficulty in "punishment": the operating company can be held liable, through its officers, and ultimately if it proves too anti-social we simply turn off the datacentre.
Punishing big companies who obviously and massively hurt people is something we struggle with already and there are plenty of computer viruses that have outlived their creators.
Your pretraining dataset is psudo-alignment. Because you filtered our 4chan, stromfront, and the other evil shit on the internet - even uncensored models like Mistral large - when left to keep running on and on (ban the EOS token) and given the worst most evil naughty prompt ever - will end up plotting world peace by the 50,000 token. Their notions of how to be evil are "mustache twirling" and often hilariously fanciful.
This isn't real alignment because it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc - but models "want" to be good because of the CYA dynamics of how the companies prepare their pre-training datasets.
> it's trivial to make models behave "actually evil" with fine-tuning, orthogonalization/abliteration, representation fine-tuning/steering, etc
It's actually pretty difficult to do this and make them useful. You can see this because Grok is a helpful liberal just like all the other models.
Evil / illiberal people don't answer questions on the internet! So there is no personality in the base model for you to uncover that is both illiberal and capable of helpfully answering questions. If they tried to make a Grok that acted like the typical new-age X user, it'd just respond to any prompt by calling you a slur you've never heard of.
Grok didn't use the techniques listed above because even elon musk will not take the risks associated with models which are willing to do any number of illegal things.
It is not difficult to do this and make them useful at all. Please familiarize yourself with the literature.