Orchestrator
What is the Orchestrator?
Sometimes, partitions contain instable nodes that have, for example, hardware issues. The Orchestrator allows you to react to those circumstances by detecting certain configurable strings in the stdout and stderr of your programs. Given the strings are contained, certain actions are possible. For example, restarting on a different node, restarting in general, and just excluding the node. This allows to detect defective nodes and skip them in production after only a single test job that failed on them automatically.
Example orchestrator.yaml-file
errors:
- name: GPUDisconnected
match_strings:
- "AssertionError: AmpOptimizerWrapper is only available"
behavior: ExcludeNode
- name: Timeout
match_strings:
- "Timeout"
behavior: RestartOnDifferentNode
- name: ExampleRestart
match_strings:
- "StartAgain"
behavior: Restart
- name: StorageError
match_strings:
- "Read/Write failure"
behavior: ExcludeNodeAndRestartAll
This configuration file does the following. When a job ends, and in the output the string...
- ...
AssertionError: AmpOptimizerWrapper is only available
appears, it will exclude that node from all future executions inside the current and all continued jobs from it - ...
Timeout
appears, it will exclude that node from all future executions inside the current and all continued jobs from it, and restart the job on the list of nodes on that partition excluding the one that ran this timeout job - ...
StartAgain
appears, it will restart that job (may end up on the same node) - ...
Read/Write failure
appears, it will exclude the node the job started on and restart it on a different node