ScaDS.ai-Logo OmniOpt2-Logo
Current CI-Pipeline Badge
Time since last commit
Test coverage
Tutorials&Help GUI Share Statistics v2.0.3534

Search

Debugging

How to find errors in my runs?

OmniOpt2 saves many logs in the run-folder. First, check the .stdout-file of the run. OmniOpt2 will tell you many things when it detects errors there.

If you don't find anything useful in the .stdout-file, look into the runs/my_experiment/0/single_runs/-folder. It contains the outputs of each worker in separate directories. It looks something like this:

submitit INFO (2024-07-08 17:34:57,444) - Starting with JobEnvironment(job_id=2387026, hostname=thinkpad44020211128, local_rank=0(1), node=0(1), global_rank=0(1))
submitit INFO (2024-07-08 17:34:57,445) - Loading pickle: /home/norman/repos/OmniOpt/ax/runs/__main__tests__/1/single_runs/2387026/2387026_submitted.pkl
parameters: {'int_param': -100, 'float_param': -100.0, 'choice_param': 1, 'int_param_two': -5}
Debug-Infos:
========
DEBUG INFOS START:
Program-Code: ./.tests/optimization_example --int_param='-100' --float_param='-100.0' --choice_param='1'  --int_param_two='-5'
pwd: /home/norman/repos/OmniOpt/ax
File: ./.tests/optimization_example
Size: 4065 Bytes
Permissions: -rwxr-xr-x
Owner: norman
Last access: 1720413747.240997
Last modification: 1718802483.4208295
Hostname: thinkpad44020211128
========
DEBUG INFOS END

./.tests/optimization_example --int_param='-100' --float_param='-100.0' --choice_param='1'  --int_param_two='-5'
stdout:
RESULT: -222001.32

Result: -222001.32
EXIT_CODE: 0
submitit INFO (2024-07-08 17:34:57,477) - Job completed successfully
submitit INFO (2024-07-08 17:34:57,477) - Exiting after successful completion

The output (stdout and stderr) of your job is after the stdout: and before the EXIT_CODE: 0. Check this output for errors. Here, you'd see Slurm Errors for your job.

Also check the exit-code. Some exit codes have special meanings, like 137, have special meaning. See this table for special exit codes:

The code between DEBUG INFOS START: and DEBUG INFOS END contains info about the string of the command that is about to be executed. It is searched for file paths and the permissions, owner and so on of the file is displayed. This is useful to check for seeing if scripts you call really have the x-flag, or are readable and so on. All pathlike structures will be searched and only printed here if they link to a valid file.

If, for example, you have error code 137, that means you likely ran out of RAM and need to increase the amount of RAM for your workers.

Exit Code Description
0 Success
1 General error
2 Misuse of shell builtins
126 Command invoked cannot execute
127 Command not found
128 Invalid argument to exit
129 Hangup (SIGHUP)
130 Interrupt (SIGINT)
131 Quit (SIGQUIT)
132 Illegal instruction (SIGILL)
133 Trace/breakpoint trap (SIGTRAP)
134 Abort (SIGABRT)
135 Bus error (SIGBUS)
136 Floating-point exception (SIGFPE)
137 Killed (SIGKILL) - maybe caused by OOM killer
138 Segmentation fault (SIGSEGV)
139 Broken pipe (SIGPIPE)
140 Alarm clock (SIGALRM)
141 Termination (SIGTERM)
142 Urgent condition on socket (SIGURG)
143 Socket has been shut down (SIGSTOP)
145 File size limit exceeded (SIGXFSZ)
146 Virtual timer expired (SIGVTALRM)
147 Profiling timer expired (SIGPROF)
148 Window size change (SIGWINCH)
149 I/O now possible (SIGPOLL)
150 Power failure (SIGPWR)
151 Bad system call (SIGSYS)