Has anyone got dmtcp-2.6.1~rc1 to work

Has anyone got dmtcp-2.6.1~rc1 from the Rocky repo to work?
BTW, dmctp is a RHEL/Rocky/CentOS rpm that allows user’s programs to perform checkpoint/restart.
It is great for long-running (days) programs like Machine Learning training.

Also, tried installing the source rpm and doing a “rpmbuild -ba dmtcp.spec” but fails
in the “make check” stanza is the spec file for "plugin-init "

Well, I installed it, and then used it to launch and checkpoint nano, so it does work. The package comes from EPEL repository though, not Rocky’s repositories.

Best thing here would have been to copy and paste the output from the make check command so that people could actually view the error messages. Without that, it’s impossible to help.

Good point Ian! Also, it was not in test plugin-init but dmtcp1

$ rpmbuild -ba dmtcp-2.6.1~rc1

Thread model: posix
gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)
CFLAGS: -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection
CXXFLAGS: -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection
LDFLAGS: -Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld
openjdk version “1.8.0_302”
OpenJDK Runtime Environment (build 1.8.0_302-b08)
OpenJDK 64-Bit Server VM (build 25.302-b08, mixed mode)
javac 1.8.0_302

Making all in mtcp
Making all in plugin
Verifying there is enough disk space …
== Tests ==
dmtcp1 ckpt:PASSED; rstr:

HANGS but will FAIL when it hits the timeout value for the test case

Here is the original problem reported to me by a user using the dmtcp-2.6.1~rc1 rpm in the Rocky/EPEL depot that I installed.

yum install dmtcp-2.6.1~rc1

"There is a problem with the admin DMTCP installation - it segfaults with the simplest of examples.

I’m using a simple C program that counts integers:
(base) login1.ls6(1067)$ more countIntegers.c
#include <stdio.h>
#include <unistd.h>
int main(void) {
unsigned long ii = 0;
printf(“Counting Integers”);
while (1) {
printf("%lu ", ii);
ii = ii + 1;

Now I’ll demonstrate what happens with the binaries from the admin installation - it segfaults:
(base) login1(1001)$ which dmtcp_launch
(base) login1(1002)$ dmtcp_launch -i 3 countIntegers
Counting Integers
0 1 2 3 4 5 6 7 ^C
(base) login1(1003)$ dmtcp_restart_script.sh
[1294494] mtcp_restart.c:340 restore_brk:
current_brk 0x5555557a1000, saved_brk 0x5dd000, restore_begin 0x153a26fa9000, restore_end 0x153a279a9000
Segmentation fault (core dumped)
(base) login1(1004)$"

This is where I tried to download the src.rpm and use rpmbuild which lead to the initial question in this thread.

Would need to be reported to EPEL:

[root@rocky ~]# dnf search dmtcp
Last metadata expiration check: 1:54:49 ago on Fri 01 Jul 2022 17:24:33 CEST.
================================================= Name Exactly Matched: dmtcp =================================================
dmtcp.x86_64 : Checkpoint/Restart functionality for Linux processes
================================================ Name & Summary Matched: dmtcp ================================================
dmtcp-devel.x86_64 : DMTCP developer package

[root@rocky ~]# dnf info dmtcp
Last metadata expiration check: 1:54:55 ago on Fri 01 Jul 2022 17:24:33 CEST.
Available Packages
Name         : dmtcp
Version      : 2.6.1~rc1
Release      : 0.1.el8
Architecture : x86_64
Size         : 854 k
Source       : dmtcp-2.6.1~rc1-0.1.el8.src.rpm
Repository   : epel
Summary      : Checkpoint/Restart functionality for Linux processes
URL          : http://dmtcp.sourceforge.net
License      : LGPLv3+ and ASL-2.0
Description  : DMTCP (Distributed MultiThreaded Checkpointing) is a tool to
             : transparently checkpointing the state of an arbitrary group of
             : applications including multi-threaded and distributed computations.
             : It operates directly on the user binary executable, with no Linux kernel
             : modules or other kernel mods.
             : Among the applications supported by DMTCP are Open MPI, MVAPICH2, MATLAB,
             : R, Python, Perl, and many programming languages and shell scripting
             : languages.  It supports both TCP sockets and InfiniBand connections.
             : With the use of TightVNC, it can also checkpoint and restart X-Window
             : applications.  The OpenGL library for 3D graphics is supported through
             : a special plugin.
             : This package contains DMTCP binaries.

if you build from the source rpm which the EPEL package was built from, then chances are it’s also going to segfault as well (I guess). Better would be getting it direct from the dmtcp source code and build that. Or, open a bug report with EPEL to get them to fix.

Seems they have the project on github, but no release since 2019: GitHub - dmtcp/dmtcp: DMTCP: Distributed MultiThreaded CheckPointing commits have been made within the last 11 days or so, which means still active.

Could be easy enough just to clone in and then follow the instructions in install.md on how to compile.

Thanks!~ I also saw those commits. I am able to build a working version when I follow the install.md instructions, using the current EPEL rpmbuild/BUILD/dmctp-2.6.1~rc1 source tree and “make check” will run successfully! So, something in the dmtcp.spec this breaking the code, resulting in a bad EPEL dmtcp-2.6.1~rc1 rpm.

Ian thanks again and I will report the problem to EPEL for Rocky.

1 Like

I just cloned it also, currently running the make check as the configure and make worked fine. With the EPEL package being an "RC1 obviously RC being Release Candidate it’s going to be unstable. After checking the rpm and attempting to restart my nano session it also did some segfault stuff. I will recheck it using the cloned code.

What’s also interesting is, the latest version pushed out is 2.6.0 from the Github releases, so they haven’t made a new stable release yet. Thus EPEL is running something newer than what the project has officially released. Potential that the code they cloned at the time they built 2.6.1rc1 wasn’t completely finished, hence the segfaults. But will see how my clone works shortly. That will at least confirm that kind of scenario.

Yeah, even with a clone and running a restart I get this:

[root@rocky bin]# ./dmtcp_restart_script_13f8c48414c-40000-1ad11da8a65.sh 
WARNING:  Running dmtcp_restart as root can be dangerous.
  An unknown checkpoint image or bugs in DMTCP may lead to unforeseen
  consequences.  Continuing as root ....
[95577] mtcp_restart.c:348 restore_brk:
  error: new/current break (0x1140f000) != saved break (0x55fff3f48000)
[40000] NOTE at processinfo.cpp:434 in restoreHeap; REASON='Failed to restore area between saved_break and curr_break.'
     _savedBrk = 94557797908480
     curBrk = 289468416
     (strerror((*__errno_location ()))) = Cannot allocate memory

however I pressed CTRL-C and nano opened along with the text I left on the screen when the checkpoint was made. Not entirely sure if this is normal, or if dmtcp doesn’t support nano so maybe my example isn’t a good one. I’ll checkout the 2.6.0 release to see if it works any different. However, using 2.6.1rc1 CTRL-C didn’t work and could only kill the command with a reboot.

A checkout of 2.6.0 does the same, so would assume that nano isn’t completely supported as an option. Needed a CTRL-C but it did open where nano had been left with the relevant text I had typed.

Add the following to the dmtcp.spec file, ran rpmbuild -ba, installed new the rpm, and was able to run checkpoint seccussfully.

%description -n tacc-dmtcp-devel
This package provides files for developing DMTCP plugins.

%undefine _hardened_build
%global _hardened_cflags -Wl,-z,lazy
%global _hardened_ldflags -Wl,-z,lazy

%setup -q