From reading the blog post and Twitter, and cost of other models, I think it's evident that it IS actually cost per task, see this tweet: https://files.catbox.moe/z1n8dc.jpg
And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.