Hello, I’m testing OpenIFS 43r3 on an Amazon Web Services cloud server. Instance type: hpc8a.96xlarge — AMD EPYC, 4.5 GHz, 192 cores/node, 768 GiB memory, 300 Gbps EFA (Elastic Fabric Adapter). We have successfully installed and run TCo1279; however, when we scaled up to TCo2559 (with identical configurations, compilers, etc.), the following error occurred. We have tried everything we can think of but are still unable to identify the cause. Could anyone please offer some insight? I’m attaching the error message and the run script below.
Error message:
576: 07:19:23 STEP 0 H= 0:00 +CPU=101.688
576: STEP 0 :## EC_MEMINFO | TC | MEMORY USED(MB) | MEMORY FREE(MB) INCLUDING CACHED | %USED %HUGE
576: STEP 0 :## EC_MEMINFO | Malloc| Inc Heap | Numa node 0 | Numa node 1 | |
576: STEP 0 :## EC_MEMINFO Node Name | Heap | RSS(tsk) | Small Huge or | Small Huge or | Total |
576: STEP 0 :## EC_MEMINFO | (tsk) | Small Huge | Only Small | Only Small | Memfree+Cache d |
576: STEP 0 :## EC_MEMINFO 1 compute- 440 5296 0 41 211736 2642 261966 479381 47576 1.0 0.0 s/p
3677: forrtl: severe (174): SIGSEGV, segmentation fault occ urred
3677: forrtl: severe (174): SIGSEGV, segmentation fault oc curred
3689: forrtl: severe (174): SIGSEGV, segmentation fault o ccurred
3689: forrtl: severe (174): SIGSEGV, segmentation fault occur red
…
3677: Image PC Routine Line Source
3677: libc.so.6 0000154B1483FC30 Unknown Unkno wn Unknown
3677: oifs 00000000011C006E Unknown Unkn own Unknown
3677: oifs 0000000001257537 Unknown Unk nown Unknown
3677: oifs 000000000125419B Unknown Un known Unknown
3677: oifs 000000000124CAE7 Unknown U nknown Unknown
3677: oifs 00000000011F94A3 Unknown Unknown Unknown
3677: libiomp5.so 0000154B1F920909 __kmp_invoke_micr Unknown Unknown
3677: libiomp5.so 0000154B1F8A0FC8 Unknown Unknown Unknown
3677: libiomp5.so 0000154B1F89FCAE Unknown Unknown Unknown
3677: libiomp5.so 0000154B1F921E62 Unknown Unknown Unknown
3677: libc.so.6 0000154B1488B2EA Unknown Unknown Unknown
3677: libc.so.6 0000154B14910500 Unknown Unknown Unknown
3689: forrtl: severe (174): SIGSEGV, segm entation fault occurred
3689: Image PC Routine Line Source
Runscript:
module purge
source /opt/intel/oneapi/2026.0/oneapi-vars.sh
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export FC=mpiifx
export F77=mpiifx
export MPIFC=mpiifx
export FCFLAGS=-free
export CC=mpiicx
export CXX=mpiicpx
export MKLROOT=/opt/intel/oneapi/mkl/2026.0
export IO_LIB_ROOT=/root/libraries_aws_ifx
export PATH=$IO_LIB_ROOT/bin:/opt/python/3.10/bin:$PATH
export LD_LIBRARY_PATH=/root/libraries_aws_ifx/lib:/opt/intel/oneapi/2026.0/lib:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH
export I_MPI_FABRICS=ofi
export I_MPI_PMI_LIBRARY=/opt/slurm/lib/libpmi2.so
export FI_PROVIDER=efa
export I_MPI_OFI_LIBRARY_INTERNAL=0
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=omp
export I_MPI_PIN_ORDER=spread
export I_MPI_SHM_HEAP_VSIZE=4096
export FI_EFA_IFACE=all
export FI_EFA_USE_DEVICE_RDMA=1
export FI_EFA_ENABLE_SHM_TRANSFER=1
export OMP_STACKSIZE=512M
export KMP_STACKSIZE=512M
export OMP_SCHEDULE=STATIC
export KMP_BLOCKTIME=infinite
export OMP_WAIT_POLICY=ACTIVE
export KMP_AFFINITY=granularity=fine,compact,1,0
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export DR_HOOK_IGNORE_SIGNALS=-1
export SZIPROOT=$IO_LIB_ROOT
export HDF5ROOT=$IO_LIB_ROOT
export HDF5_ROOT=$HDF5ROOT
export NETCDFROOT=$IO_LIB_ROOT
export NETCDFFROOT=$IO_LIB_ROOT
export ECCODESROOT=$IO_LIB_ROOT
export NETCDF_DIR=$IO_LIB_ROOT
export HDF5_C_INCLUDE_DIRECTORIES=$HDF5_ROOT/include
export NETCDF_Fortran_INCLUDE_DIRECTORIES=$NETCDFFROOT/include
export NETCDF_C_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export NETCDF_CXX_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export ECCODES_DEFINITION_PATH=/root/libraries_aws_ifx/share/eccodes/definitions
export GRIB_SAMPLES_PATH=$ECCODESROOT/share/eccodes/ifs_samples/grib1_mlgrib2/