Rocky 9.3 + OpenDKIM + Latest SSSD packages

I run a number of mail servers used for only sending out mail from internal app servers to the public internet. These servers run Postfix for the MTA and OpenDKIM to add signatures to outgoing mail for domains configured for this. These servers are also joined to an active directory domain for admin user login, using realmd/sssd.

Recently a package update the servers received on December 16th caused OpenDKIM to start having a run on file descriptors until errors start being thrown and things begin to break as the service stops responding as Postfix is expecting.

In the mail logs I will see:

opendkim[89244]: OpenDKIM Filter: accept() returned invalid socket (Numerical result out of range), try again
postfix/smtpd[94333]: warning: milter unix:/var/spool/opendkim/opendkim.sock: can't read SMFIC_OPTNEG reply packet header: Connection reset by peer
postfix/smtpd[94333]: warning: milter unix:/var/spool/opendkim/opendkim.sock: read error in initial handshake

This will hit the logs after OpenDKIM has exhausted its file descriptor allotment, so not really a helpful error, just a result of the problem.

LSOF will highlight the problem when showing everything for the opendkim user, normally there are only a handful of these lines present on a healthy server, but on a problem server they will increase until it breaks:

opendkim 231880 opendkim   58u  unix 0xffff9352878650c0      0t0   1227972 type=STREAM (CONNECTED)
opendkim 231880 opendkim   59u  unix 0xffff935291b70880      0t0   1227387 type=STREAM (CONNECTED)
opendkim 231880 opendkim   60u  unix 0xffff935291b72a80      0t0   1227530 type=STREAM (CONNECTED)
opendkim 231880 opendkim   61u  unix 0xffff9353654faec0      0t0   1226695 type=STREAM (CONNECTED)
opendkim 231880 opendkim   62u  unix 0xffff935291b75d80      0t0   1227535 type=STREAM (CONNECTED)
opendkim 231880 opendkim   63u  unix 0xffff93534a4a3300      0t0   1229268 type=STREAM (CONNECTED)
opendkim 231880 opendkim   64u  unix 0xffff935287862200      0t0   1228156 type=STREAM (CONNECTED)
opendkim 231880 opendkim   65u  unix 0xffff9353654fbfc0      0t0   1229846 type=STREAM (CONNECTED)
opendkim 231880 opendkim   66u  unix 0xffff935283732ec0      0t0   1229863 type=STREAM (CONNECTED)
opendkim 231880 opendkim   67u  unix 0xffff93534a7050c0      0t0   1227583 type=STREAM (CONNECTED)
opendkim 231880 opendkim   68u  unix 0xffff9352856b2a80      0t0   1227614 type=STREAM (CONNECTED)

In testing I have narrowed down the problem packages to these ones:

libipa_hbac-2.9.1-4.el9_3.1
libsss_certmap-2.9.1-4.el9_3.1
libsss_idmap-2.9.1-4.el9_3.1
libsss_nss_idmap-2.9.1-4.el9_3.1
libsss_sudo-2.9.1-4.el9_3.1
sssd-2.9.1-4.el9_3.1
sssd-ad-2.9.1-4.el9_3.1
sssd-client-2.9.1-4.el9_3.1
sssd-common-2.9.1-4.el9_3.1
sssd-common-pac-2.9.1-4.el9_3.1
sssd-ipa-2.9.1-4.el9_3.1
sssd-kcm-2.9.1-4.el9_3.1
sssd-krb5-2.9.1-4.el9_3.1
sssd-krb5-common-2.9.1-4.el9_3.1
sssd-ldap-2.9.1-4.el9_3.1
sssd-nfs-idmap-2.9.1-4.el9_3.1
sssd-proxy-2.9.1-4.el9_3.1

When I roll back a test server to the 2.8.2-3 versions of the above packages and restart the OpenDKIM service the problem completely goes away.

I have been unable to find any log output anywhere that is providing anything helpful. To observe the issue I have to hit my test server with a load of about 10 emails/sec, and from a domain that OpenDKIM is configured to sign. If all the mail is from domains that aren’t in the signingtable then the problem does not occur.

Any thoughts on this would be appreciated
Thanks

What do the sssd log files show during this time? I know you’ve concentrated on the opendkim logs but you roll back sssd packages, so makes me think maybe something in sssd logs will give a clue.

I should have noted that initially, nothing shows up in the sssd logs at all. Have monitored these logs when starting/stopping services and testing mail flow:

/var/log/sssd/krb5_child.log
/var/log/sssd/ldap_child.log
/var/log/sssd/sssd.log
/var/log/sssd/sssd_nss.log
/var/log/sssd/sssd_pac.log
/var/log/sssd/sssd_pam.log
/var/log/sssd/sssd_DOMAIN.com.log

Nothing shows up in these logs at all

I should also add that domain login works just fine with the latest and previous versions of the sssd packages. I just can’t figure out exactly how OpenDKIM and the SSSD packages are connected, I am sure they arn’t directly connected, but there must be a behavior change somewhere that is indirectly affecting it.

What does

 cat /proc/sys/fs/file-max

shows, and what

# $i proc id of opendkim
cat /proc/$i/limits

with the old and new packages?

With old sssd packages:

[root@testserver ~]# cat /proc/sys/fs/file-max
9223372036854775807

[root@testserver ~]# cat /proc/1162330/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             30406                30406                processes
Max open files            524288               524288               files
Max locked memory         8388608              8388608              bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       30406                30406                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

With new sssd packages:

[root@testserver ~]# cat /proc/sys/fs/file-max
9223372036854775807

[root@testserver ~]# cat /proc/545457/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             30406                30406                processes
Max open files            524288               524288               files
Max locked memory         8388608              8388608              bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       30406                30406                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

No change that I can see here, I had previously gone through various limits also unable to find any changes. I really don’t think its specifically a limit issue, the problem is that after the sssd updates it appears like the stream handles just aren’t getting closed, or not happening as quickly. So they just build up and build up. This just doesn’t happen when the old sssd versions are installed or on any of the Rocky 8.9 servers that are running the same services.

I added +E to my LSOF and see this:

opendkim  1222322 opendkim   46u  unix 0xffff935296550440      0t0   6756313 type=STREAM ->INO=6757034 1220828,sssd_nss,74u (CONNECTED)
opendkim  1222322 opendkim   47u  unix 0xffff935283dac400      0t0   6756316 type=STREAM ->INO=6757042 1220828,sssd_nss,78u (CONNECTED)
opendkim  1222322 opendkim   48u  unix 0xffff93534a569540      0t0   6757093 type=STREAM ->INO=6757094 1220828,sssd_nss,79u (CONNECTED)
opendkim  1222322 opendkim   49u  unix 0xffff9352965d3740      0t0   6759512 type=STREAM ->INO=6759513 1220828,sssd_nss,81u (CONNECTED)
opendkim  1222322 opendkim   50u  unix 0xffff9353669c7700      0t0   6758434 type=STREAM ->INO=6757098 1220828,sssd_nss,80u (CONNECTED)
opendkim  1222322 opendkim   51u  unix 0xffff93534a56fb40      0t0   6757204 type=STREAM ->INO=6759525 1220828,sssd_nss,82u (CONNECTED)
opendkim  1222322 opendkim   52u  unix 0xffff9352965d2640      0t0   6759540 type=STREAM ->INO=6759541 1220828,sssd_nss,83u (CONNECTED)
opendkim  1222322 opendkim   53u  unix 0xffff93534a569dc0      0t0   6757233 type=STREAM ->INO=6759555 1220828,sssd_nss,84u (CONNECTED)
opendkim  1222322 opendkim   54u  unix 0xffff93531cf972c0      0t0   6758110 type=STREAM ->INO=6757354 1220828,sssd_nss,88u (CONNECTED)
opendkim  1222322 opendkim   55u  unix 0xffff9352965d1dc0      0t0   6759726 type=STREAM ->INO=6757349 1220828,sssd_nss,85u (CONNECTED)
opendkim  1222322 opendkim   56u  unix 0xffff9353669c1dc0      0t0   6758654 type=STREAM ->INO=6757350 1220828,sssd_nss,86u (CONNECTED)
opendkim  1222322 opendkim   57u  unix 0xffff9352d1a06a40      0t0   6758749 type=STREAM ->INO=6758210 1220828,sssd_nss,91u (CONNECTED)
opendkim  1222322 opendkim   58u  unix 0xffff9352d1a04400      0t0   6758743 type=STREAM ->INO=6758196 1220828,sssd_nss,89u (CONNECTED)

So OpenDKIM does have something to do with sssd is some way apparently, going to keep digging.

I enabled debugging for nss and an seeing this is the nss log:

[nss] [accept_fd_handler] (0x0400): [CID#8065] Client [cmd /usr/sbin/opendkim][uid 979][0x55e2e9c96160][41] connected!

These are constantly being logged ( ~2/sec while my mail test is running) and coincides with the count of unix stream sockets the OpenDKIM service has open. Counting the connections in the log matches the increase in open sockets. So something is causing OpenDKIM to not close/reuse its connections to nss.

I am no expert on NSS and exactly how its used, so trying to untangle that web as I go.

I grabbed the original 9.3 release sssd packages from the kickstart dir (2.9.1-4.el9_3) tested and these work just fine, no problems. When I upgrade them to the current (2.9.1-4.el9_3.1) that is when things break.

The newer packages exist as part of a bugfix noted here:
https://access.redhat.com/errata/RHBA-2023:6742

I have to assume this is a bug in the latest release at this point, just can’t quite pin it for sure.

Well, pragmatically, if it works then I would stay with the downgrade. The next minor release of the os (~May 2024) will have 2.9.4 or later. C9S has already 2.9.3. If you are brave enough and have a testsystem around I would also test these packages. At least to be sure that the next minor release will fix this for you.

Good idea, so I tried the latest 2.9.3-2 form the CS9 repo, same issue is unfortunately present. We have plenty of free RedHat licenses, so I might setup a test RedHat 9 box and then work it out with their support unless I find another solution.

Did you ever figure out another solution? Started seeing this myself recently. Running outdated sssd seems to fix opendkim.

No, no solutions as of yet. We have far too many audits and security checks for me to consider staying on older packages to make it work, so for now I am running OpenDKIM separately on some EL 8.9 servers until this is resolved.

Redhat support believes the burden of proof is on the OpenDKIM side so they won’t help unless the EPEL team determines the problem to be in SSSD. I have a bugzilla case open with them, but 40 days later and nothing so far.