ScaDS.ai-Logo OmniOpt2-Logo
Current CI-Pipeline Badge
Time since last commit
Test coverage
Tutorials&Help GUI Share Statistics v2.0.3534

Search

Orchestrator

What is the Orchestrator?

Sometimes, partitions contain instable nodes that have, for example, hardware issues. The Orchestrator allows you to react to those circumstances by detecting certain configurable strings in the stdout and stderr of your programs. Given the strings are contained, certain actions are possible. For example, restarting on a different node, restarting in general, and just excluding the node. This allows to detect defective nodes and skip them in production after only a single test job that failed on them automatically.

Example orchestrator.yaml-file

errors:
  - name: GPUDisconnected
    match_strings:
      - "AssertionError: ``AmpOptimizerWrapper`` is only available"
    behavior: ExcludeNode

  - name: Timeout
    match_strings:
      - "Timeout"
    behavior: RestartOnDifferentNode

  - name: ExampleRestart
    match_strings:
      - "StartAgain"
    behavior: Restart

  - name: StorageError
    match_strings:
      - "Read/Write failure"
    behavior: ExcludeNodeAndRestartAll

This configuration file does the following. When a job ends, and in the output the string...