Title |
Checkpointing |
Abstract |
When running a job for any substantial length of time, you hope that nothing interrupts it so that you won't have to start over. But no matter the environment, interruption is always a possibility. Checkpoint/Restart (C/R) solutions allow you to resume a job from approximately where it left off. Checkpointing strategies also have additional benefits beyond fault tolerance, so learning about these strategies is likely to benefit your own work. This roadmap provides an overview of the checkpointing/restart strategy and a survey of different types of C/R solutions. |
Authors |
['Brandon Barker'] |
Expertise Level |
None |
Learning Outcome |
None |
Learning Resource Type |
asynchronous online training |
Target Group |
['Researchers', 'Research groups', 'Student'] |
Keywords |
['checkpointing', 'fault tolerance', 'best practices', 'containers', 'virtualization'] |
Cost |
None |
Duration |
240 |
Language |
en |
License |
None |
Resource URL Type |
URL |
Start Datetime |
None |
URL |
https://cvw.cac.cornell.edu/checkpoint |
Version Date |
2023-01 |
Provider ID |
urn:ogf.org:glue2:access-ci.org:resource:cider:infrastructure.organizations:898 |
Rating |
None |