> I would start by making the examples yourself initially What I'm doing right n...

> I would start by making the examples yourself initially

What I'm doing right now is this:

  1) I have X problem to solve using the coding agent.
  2) I ask the agent to do X
  3) I use my own brain: did the agent do it correctly?

If the agent did not do it correctly, I then ask: should the agent have been able to solve this? If so, I try to improve the agent so it's able to do that.

The hardest part about automating this is #3 above --each evaluation is one-off and it would be hard to even formalize the evaluation.

SWE bench, for example uses unit tests for this, and the agent is blind to the unit tests --so the agent has to make a red test (which it has never seen) go green.