If you're a developer, you're really looking to build idempotent and self-healing services/applications as much as possible (especially anything that's being run by a queue/job scheduler/crontab). One thing that will make this happen faster is if developers are on-call and have to fix things.
People like the book 'Site Reliability Engineering: How Google Runs Production Systems'. You could take a look at Netflix's info and tools. They break their system themselves to ensure it recovers well. It would help if you indicated which meaning of resilient you use?
I think Udi Dahan's course Advanced Distributed Systems Design is by far the best and most extensive resource on this topic. It costs money though. Perhaps you can find the videos somewhere.