Debugging
How to find errors in my runs?
OmniOpt2 saves many logs in the run-folder. First, check the .stdout-file of the run. OmniOpt2 will tell you many things when it detects errors there.
If you don't find anything useful in the .stdout-file, look into the runs/my_experiment/0/single_runs/-folder. It contains the outputs of each worker in separate directories. It looks something like this:
submitit INFO (2024-07-08 17:34:57,444) - Starting with JobEnvironment(job_id=2387026, hostname=thinkpad44020211128, local_rank=0(1), node=0(1), global_rank=0(1)) submitit INFO (2024-07-08 17:34:57,445) - Loading pickle: /home/norman/repos/OmniOpt/ax/runs/__main__tests__/1/single_runs/2387026/2387026_submitted.pkl parameters: {'int_param': -100, 'float_param': -100.0, 'choice_param': 1, 'int_param_two': -5} Debug-Infos: ======== DEBUG INFOS START: Program-Code: ./.tests/optimization_example --int_param='-100' --float_param='-100.0' --choice_param='1' --int_param_two='-5' pwd: /home/norman/repos/OmniOpt/ax File: ./.tests/optimization_example Size: 4065 Bytes Permissions: -rwxr-xr-x Owner: norman Last access: 1720413747.240997 Last modification: 1718802483.4208295 Hostname: thinkpad44020211128 ======== DEBUG INFOS END ./.tests/optimization_example --int_param='-100' --float_param='-100.0' --choice_param='1' --int_param_two='-5' stdout: RESULT: -222001.32 Result: -222001.32 EXIT_CODE: 0 submitit INFO (2024-07-08 17:34:57,477) - Job completed successfully submitit INFO (2024-07-08 17:34:57,477) - Exiting after successful completion
The output (stdout and stderr) of your job is after the stdout: and before the EXIT_CODE: 0. Check this output for errors. Here, you'd see Slurm Errors for your job.
Also check the exit-code. Some exit codes have special meanings, like 137, have special meaning. See this table for special exit codes:
The code between DEBUG INFOS START: and DEBUG INFOS END contains info about the string of the command that is about to be executed. It is searched for file paths and the permissions, owner and so on of the file is displayed. This is useful to check for seeing if scripts you call really have the x-flag, or are readable and so on. All pathlike structures will be searched and only printed here if they link to a valid file.
If, for example, you have error code 137, that means you likely ran out of RAM and need to increase the amount of RAM for your workers.
Exit Code | Description |
---|---|
0 | Success |
1 | General error |
2 | Misuse of shell builtins |
126 | Command invoked cannot execute |
127 | Command not found |
128 | Invalid argument to exit |
129 | Hangup (SIGHUP) |
130 | Interrupt (SIGINT) |
131 | Quit (SIGQUIT) |
132 | Illegal instruction (SIGILL) |
133 | Trace/breakpoint trap (SIGTRAP) |
134 | Abort (SIGABRT) |
135 | Bus error (SIGBUS) |
136 | Floating-point exception (SIGFPE) |
137 | Killed (SIGKILL) - maybe caused by OOM killer |
138 | Segmentation fault (SIGSEGV) |
139 | Broken pipe (SIGPIPE) |
140 | Alarm clock (SIGALRM) |
141 | Termination (SIGTERM) |
142 | Urgent condition on socket (SIGURG) |
143 | Socket has been shut down (SIGSTOP) |
145 | File size limit exceeded (SIGXFSZ) |
146 | Virtual timer expired (SIGVTALRM) |
147 | Profiling timer expired (SIGPROF) |
148 | Window size change (SIGWINCH) |
149 | I/O now possible (SIGPOLL) |
150 | Power failure (SIGPWR) |
151 | Bad system call (SIGSYS) |