TCo2559 test run segmentation fault (model settings)

Hello, I’m testing OpenIFS 43r3 on an Amazon Web Services cloud server. Instance type: hpc8a.96xlarge — AMD EPYC, 4.5 GHz, 192 cores/node, 768 GiB memory, 300 Gbps EFA (Elastic Fabric Adapter). We have successfully installed and run TCo1279; however, when we scaled up to TCo2559 (with identical configurations, compilers, etc.), the following error occurred. We have tried everything we can think of but are still unable to identify the cause. Could anyone please offer some insight? I’m attaching the error message and the run script below.
Error message:
576: 07:19:23 STEP 0 H= 0:00 +CPU=101.688

576: STEP 0 :## EC_MEMINFO | TC | MEMORY USED(MB) | MEMORY FREE(MB) INCLUDING CACHED | %USED %HUGE

576: STEP 0 :## EC_MEMINFO | Malloc| Inc Heap | Numa node 0 | Numa node 1 | |

576: STEP 0 :## EC_MEMINFO Node Name | Heap | RSS(tsk) | Small Huge or | Small Huge or | Total |

576: STEP 0 :## EC_MEMINFO | (tsk) | Small Huge | Only Small | Only Small | Memfree+Cache d |

576: STEP 0 :## EC_MEMINFO 1 compute- 440 5296 0 41 211736 2642 261966 479381 47576 1.0 0.0 s/p

3677: forrtl: severe (174): SIGSEGV, segmentation fault occ urred

3677: forrtl: severe (174): SIGSEGV, segmentation fault oc curred

3689: forrtl: severe (174): SIGSEGV, segmentation fault o ccurred

3689: forrtl: severe (174): SIGSEGV, segmentation fault occur red

3677: Image PC Routine Line Source

3677: libc.so.6 0000154B1483FC30 Unknown Unkno wn Unknown

3677: oifs 00000000011C006E Unknown Unkn own Unknown

3677: oifs 0000000001257537 Unknown Unk nown Unknown

3677: oifs 000000000125419B Unknown Un known Unknown

3677: oifs 000000000124CAE7 Unknown U nknown Unknown

3677: oifs 00000000011F94A3 Unknown Unknown Unknown

3677: libiomp5.so 0000154B1F920909 __kmp_invoke_micr Unknown Unknown

3677: libiomp5.so 0000154B1F8A0FC8 Unknown Unknown Unknown

3677: libiomp5.so 0000154B1F89FCAE Unknown Unknown Unknown

3677: libiomp5.so 0000154B1F921E62 Unknown Unknown Unknown

3677: libc.so.6 0000154B1488B2EA Unknown Unknown Unknown

3677: libc.so.6 0000154B14910500 Unknown Unknown Unknown

3689: forrtl: severe (174): SIGSEGV, segm entation fault occurred

3689: Image PC Routine Line Source

Runscript:
module purge
source /opt/intel/oneapi/2026.0/oneapi-vars.sh

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export FC=mpiifx
export F77=mpiifx
export MPIFC=mpiifx
export FCFLAGS=-free
export CC=mpiicx
export CXX=mpiicpx
export MKLROOT=/opt/intel/oneapi/mkl/2026.0
export IO_LIB_ROOT=/root/libraries_aws_ifx
export PATH=$IO_LIB_ROOT/bin:/opt/python/3.10/bin:$PATH
export LD_LIBRARY_PATH=/root/libraries_aws_ifx/lib:/opt/intel/oneapi/2026.0/lib:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH
export I_MPI_FABRICS=ofi
export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi2.so
export FI_PROVIDER=efa
export I_MPI_OFI_LIBRARY_INTERNAL=0
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=omp
export I_MPI_PIN_ORDER=spread
export I_MPI_SHM_HEAP_VSIZE=4096
export FI_EFA_IFACE=all
export FI_EFA_USE_DEVICE_RDMA=1
export FI_EFA_ENABLE_SHM_TRANSFER=1
export OMP_STACKSIZE=512M
export KMP_STACKSIZE=512M
export OMP_SCHEDULE=STATIC
export KMP_BLOCKTIME=infinite
export OMP_WAIT_POLICY=ACTIVE
export KMP_AFFINITY=granularity=fine,compact,1,0
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export DR_HOOK_IGNORE_SIGNALS=-1
export SZIPROOT=$IO_LIB_ROOT
export HDF5ROOT=$IO_LIB_ROOT
export HDF5_ROOT=$HDF5ROOT
export NETCDFROOT=$IO_LIB_ROOT
export NETCDFFROOT=$IO_LIB_ROOT
export ECCODESROOT=$IO_LIB_ROOT
export NETCDF_DIR=$IO_LIB_ROOT
export HDF5_C_INCLUDE_DIRECTORIES=$HDF5_ROOT/include
export NETCDF_Fortran_INCLUDE_DIRECTORIES=$NETCDFFROOT/include
export NETCDF_C_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export NETCDF_CXX_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export ECCODES_DEFINITION_PATH=/root/libraries_aws_ifx/share/eccodes/definitions
export GRIB_SAMPLES_PATH=$ECCODESROOT/share/eccodes/ifs_samples/grib1_mlgrib2/

Hi Ja-Yeon,

I read your post a while ago but am not sure how to reply or if I can offer anything useful that you haven’t tried yet.

Since Tco1279 runs but Tco2559 crashes right at the beginning at STEP 0 with a SIGSEGV and the traceback includes libiomp5.so, this could be perhaps a scaling issue with the hybrid OpenMP+MPI set up.

Can you rebuild with more debugging options enabled if you haven’t done that already, as this may give more of a clue? Does this happen when you use 1 OpenMP thread? (OMP_NUM_THREADS=1) Can you change the OpenMP/MPI decomposition?

Unfortunately I have no first-hand experience, as we do not use OpenIFS at resolutions beyond Tco1279, and also we don’t recommend it. Given that 43r3 is an old cycle it will be very difficult to find support at ECMWF for this, as always we recommend to upgrade at least to cycle 48r1.

Perhaps your contacts in the DestinE community, who run the IFS at such resolutions, can help you further?

Best regards,

Marcus