Hi,
I use two machines to perform OpenIFS experiments: a desktop PC for testing and a small 16-core server for running experiments. The directory within which I'm building and running OpenIFS is mounted on both machines and they should have identical environments, e.g. the same compiler versions etc. However, even though I can build and run on the desktop machine, I can't run the program on the server (though I can build successfully). I get the following backtrace:
signal_drhook(SIGABRT=6): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGBUS=7): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGSEGV=11): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGSTKFLT=16): New handler installed at 0xac378a; old preserved at 0x0 signal_drhook(SIGFPE=8): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGILL=4): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGTRAP=5): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGINT=2): New handler installed at 0xac378a; old preserved at 0x0 signal_drhook(SIGQUIT=3): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGTERM=15): New handler installed at 0xac378a; old preserved at 0x0 signal_drhook(SIGXCPU=24): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 signal_drhook(SIGSYS=31): New handler installed at 0xac378a; old preserved at 0x2afb433b2d60 JSETSIG: sl->active = 0 signal_harakiri(SIGALRM=14): New handler installed at 0xabeae4; old preserved at 0x0 ***Received signal = 4 and ActivatED SIGALRM=14 and calling alarm(10), time = 0.01 [myproc#1,tid#1,pid#2415,signal#4(SIGILL)]: Received signal :: 17MB (heap), 17MB (rss), 0MB (stack), 0 (paging), nsigs 1, time 0.01 tid#1 starting drhook traceback, time = 0.01 [myproc#1,tid#1,pid#2415]: MASTER [myproc#1,tid#1,pid#2415]: CNT0<1> tid#1 starting sigdump traceback, time = 0.01 [gdb__sigdump] : Received signal#4(SIGILL), pid=2415 [LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=2415) : (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109 : master.exe() [0xaf4ce8] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:883 : master.exe() [0xabebe1] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1119 : master.exe() [0xac3b5d] (pid=2415): <Unknown> : libpthread.so.0(+0x10330) [0x2afb43e18330] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/user_clock.F90:67 : master.exe() [0xb007cf] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/gstats.F90:153 : master.exe() [0xad288e] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:112 : master.exe() [0x409f7f] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/programs/master.F90:65 : master.exe() [0x408f06] (pid=2415): <Unknown> : libc.so.6(__libc_start_main+0xf5) [0x2afb44047f45] (pid=2415): <Unknown> : master.exe() [0x408f7d] [LinuxTraceBack] : End of backtrace(s) Done tracebacks, calling exit with sig=4, time = 0.05 ABORT! 1 Dr.Hook calls ABOR1 ... [myproc#1,tid#1,pid#2415]: MASTER [myproc#1,tid#1,pid#2415]: CNT0<1> SDL_TRACEBACK: Calling LINUX_TRBK, THRD = 1 [LinuxTraceBack]: Backtrace(s) for program './master.exe' (pid=2415) : (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:109 : master.exe() [0xaf4ce8] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/utilities/linuxtrbk.c:189 : master.exe() [0xaf4d1d] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/module/sdl_mod.F90:71 : master.exe() [0xb0599f] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/abor1.F90:37 : master.exe() [0xab3417] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/drhook.c:1123 : master.exe() [0xac3bb1] (pid=2415): <Unknown> : libpthread.so.0(+0x10330) [0x2afb43e18330] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/user_clock.F90:67 : master.exe() [0xb007cf] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifsaux/support/gstats.F90:153 : master.exe() [0xad288e] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/ifs/control/cnt0.F90:112 : master.exe() [0x409f7f] (pid=2415): /network/aopp/cirrus/pred/hatfield/openifs-cy38r1/src/programs/master.F90:65 : master.exe() [0x408f06] (pid=2415): <Unknown> : libc.so.6(__libc_start_main+0xf5) [0x2afb44047f45] (pid=2415): <Unknown> : master.exe() [0x408f7d] [LinuxTraceBack] : End of backtrace(s) SDL_TRACEBACK: Done LINUX_TRBK, THRD = 1 -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 2415 on node cirrus1 exited on signal 9 (Killed). --------------------------------------------------------------------------
We have made some modifications to OpenIFS, but I don't think it's a bug on our side because it works fine on the desktop PC. It looks like there's an illegal instruction in one of the clock functions. Any idea what's going wrong?
Previously I was getting a similar error originating from drhook.c line 4040, but that's gone away for some reason.
I build from scratch on both machines with gcc/gfortran version 4.8.3.
Thanks,
Sam Hatfield