OmniOpt2-Logo ScaDS.ai-Logo
CI Badge PyPI Version

Orchestrator




What is the Orchestrator?


Sometimes, partitions contain instable nodes that have, for example, hardware issues. The Orchestrator allows you to react to those circumstances by detecting certain configurable strings in the stdout and stderr of your programs. Given the strings are contained, certain actions are possible. For example, restarting on a different node, restarting in general, and just excluding the node. This allows to detect defective nodes and skip them in production after only a single test job that failed on them automatically.

Example orchestrator.yaml-file


errors:
  - name: GPUDisconnected
    match_strings:
      - "AssertionError: AmpOptimizerWrapper is only available"
    behavior: ExcludeNode

  - name: Timeout
    match_strings:
      - "Timeout"
    behavior: RestartOnDifferentNode

  - name: ExampleRestart
    match_strings:
      - "StartAgain"
    behavior: Restart

  - name: StorageError
    match_strings:
      - "Read/Write failure"
    behavior: ExcludeNodeAndRestartAll

This configuration file does the following. When a job ends, and in the output the string...