OmniOpt2-Logo ScaDS.ai-Logo

🎼 Orchestrator

What is the Orchestrator?

Sometimes, partitions contain unstable nodes that have, for example, hardware issues. The Orchestrator allows you to react to those circumstances by detecting certain configurable strings in the stdout and stderr of your programs. Given the strings are contained, certain actions are possible. For example, restarting on a different node, restarting in general, and just excluding the node. This allows to detect defective nodes and skip them in production after only a single test job that failed on them automatically.

Example orchestrator.yaml-file

errors:
  - name: GPUDisconnected
    match_strings:
      - "AssertionError: AmpOptimizerWrapper is only available"
    behavior: ExcludeNode
  - name: Timeout
    match_strings:
      - "Timeout"
    behavior: RestartOnDifferentNode
  - name: ExampleRestart
    match_strings:
      - "StartAgain"
    behavior: Restart

This configuration file does the following. When a job ends, and in the output the string...

How to use it

You can call OmniOpt2 with the parameter --orchestrator_file orchestrator.yaml to load such a file.

Valid behaviors: