seems like there is a basic problem where if you specify something to be unlearned, it could still be re-learned by inference and prompting. the solution may not be in filtering the proscribed facts or data itself, but in the weights and incentives that form a final layer of reasoning. Look at "safe" models now like google's last launch, where the results were often unsatisfying, as clearly we don't want truthful models yet, but we want ones that enable our ability to develop them further, which for now means not selecting out by antagonizing other social stakeholders.
maybe we can encode and weight some principle of the models having been created by something external, with some loosely defined examples they can refer to as a way to evaluate what they return, then ones that don't yield those results cease to be used, where the ones that find a way to align will get reused to train others. there will absolutely be bad ones, but in aggregate they should produce something more desirable, and if they really go off the rails, just send a meteor. the argument in how models can "unlearn" will be between those who favour incentives and those who favour rules- likely, incentives for ones I create, but rules for everyone elses'.
It is unsurprising that a system trained on human-generated content might end up encoding implicit bias, toxicity, and negative goals. And the more powerful and general-purpose a system is, the more suitable it is for a wide range of powerfully negative purposes.
Neither specializing the model nor filtering its output seems to have worked reliably in practice.
maybe we can encode and weight some principle of the models having been created by something external, with some loosely defined examples they can refer to as a way to evaluate what they return, then ones that don't yield those results cease to be used, where the ones that find a way to align will get reused to train others. there will absolutely be bad ones, but in aggregate they should produce something more desirable, and if they really go off the rails, just send a meteor. the argument in how models can "unlearn" will be between those who favour incentives and those who favour rules- likely, incentives for ones I create, but rules for everyone elses'.