Large distributed jobs at Google can literally take several times the expected lifetime of an average server to run. When you run such a job, you must expect that it will wind up running on computers of all stages of repair, and some of the computers that it runs on will come out of commission during your job and will have to be replaced by others.
This scale requires defense in depth. Health checks are necessary, but not sufficient. You also need to minimize the problems if the machine you're on is not yet known to be bad. And deal with double checking results because at that scale you can't trust the hardware. And so on.
Large distributed jobs at Google can literally take several times the expected lifetime of an average server to run. When you run such a job, you must expect that it will wind up running on computers of all stages of repair, and some of the computers that it runs on will come out of commission during your job and will have to be replaced by others.
This scale requires defense in depth. Health checks are necessary, but not sufficient. You also need to minimize the problems if the machine you're on is not yet known to be bad. And deal with double checking results because at that scale you can't trust the hardware. And so on.