Anthropic has said they had benchmarks where Claude would take GitHub issues and try to generate Git commits that passed unit/integration tests that others had made for the real final feature. Also you have things like multimodal image recognition for UIs where you can say generate code for a UI that looks like such and such and then verify it with the multimodal capabilities.
Tool use means you can click a button and make sure it transitioned to the next described UI screen verified again with multimodal as well.
There's still an important to what I said, even in ML. In fact, consider if what I said is true then ask what that would mean for how the current status quo goes about showing things. Then think about AI safety lol
Tool use means you can click a button and make sure it transitioned to the next described UI screen verified again with multimodal as well.