Possible memory leak in `dnf makecache` (Rocky 9.4 / dnf 4.14.0-9 / libdnf 0.69.0-8)

Hi everyone,

for the past two months or so I’ve been seeing some of my Rocky Linux 9.4 systems sometimes crash, freeze, or processes getting OOM-killed during the night.

The logs reveal that dnf makecache startet seconds before the crash (triggered via systemd dnf-makecache.timer) but never reach the point where the completion message Metadata cache created. is logged.
/var/log/messages

Sep 19 06:33:02 hostname systemd[1]: Starting dnf makecache...
Sep 19 06:33:02 hostname dnf[62376]: Amazon Corretto                                  41 kB/s | 2.9 kB     00:00
Sep 19 06:33:02 hostname dnf[62376]: Rocky Linux 9 - Base                            1.4 MB/s | 4.1 kB     00:00
Sep 19 06:33:02 hostname dnf[62376]: Rocky Linux 9 - Base                             44 MB/s | 2.3 MB     00:00
Sep 19 06:33:02 hostname dnf[62376]: Rocky Linux 9 - AppStream                       1.4 MB/s | 4.5 kB     00:00
Sep 19 06:33:03 hostname dnf[62376]: Rocky Linux 9 - AppStream                        33 MB/s | 8.0 MB     00:00
Sep 19 06:33:04 hostname dnf[62376]: Rocky Linux 9 - Extras                          1.0 MB/s | 2.9 kB     00:00
Sep 19 06:33:04 hostname dnf[62376]: Rocky Linux 9 - HighAvailability                1.7 MB/s | 4.0 kB     00:00
Sep 19 06:33:04 hostname dnf[62376]: Extra Packages for Enterprise Linux 9 - x86_64  1.4 MB/s | 4.3 kB     00:00
Sep 19 06:33:04 hostname dnf[62376]: Extra Packages for Enterprise Linux 9 - x86_64   52 MB/s |  23 MB     00:00

/var/log/dnf.log

2024-09-19T06:33:04+0200 DEBUG reviving: failed for 'localmirror-epel', mismatched repomd.
2024-09-19T06:33:04+0200 DEBUG repo: downloading from remote: localmirror-epel

So far I know that

  • This only affects Rocky Linux 9.4 systems
  • All affected systems do not have excessive free Memory ressources (maybe 500-800MB) but were running fine the months and years prior (some since weeks after the initial release of Rocky 9.0)

This leads me to the conclusion that

  • either there is a memory leak during the dnf makecache operation
  • Something has significantly changed about the way dnf makecache operates

Have some of you observed similar issues in the past months?
I’ve also found RL9 dnf OOMs with EPEL but only on one machine! - #6 by iwalker which looks like it could the same issue.

Update: I’ve checked the dnf.rpm.log. Judging by date this could relate to the following dnf updates from May:

2024-05-13T23:17:35+0200 SUBDEBUG Upgrade: libdnf-0.69.0-8.el9.x86_64
2024-05-13T23:18:10+0200 SUBDEBUG Upgrade: python3-libdnf-0.69.0-8.el9.x86_64
2024-05-13T23:18:11+0200 SUBDEBUG Upgrade: dnf-data-4.14.0-9.el9.noarch
2024-05-13T23:18:11+0200 SUBDEBUG Upgrade: python3-dnf-4.14.0-9.el9.noarch
2024-05-13T23:18:12+0200 SUBDEBUG Upgrade: dnf-4.14.0-9.el9.noarch
2024-05-13T23:18:12+0200 SUBDEBUG Upgrade: python3-dnf-plugins-core-4.3.0-13.el9.noarch
2024-05-13T23:18:12+0200 SUBDEBUG Upgrade: dnf-plugins-core-4.3.0-13.el9.noarch

It appears to be related to the size of the repo. Dnf-makecache.timer is a major liability is also related and includes a link to Fedora upstream where there’s a lot of discussion.

1 Like

Thanks @sweh for providing the link to the related issue & discussion.

500+MB of RAM for an package manager is excessive and would mean that even a minimal headless server would probably require more than 2GB of RAM in order to run stable.

I’ve been working with the “Enterprise Linux”-ecosystem since 2012 and always enjoyed that such stability issues were sorted out beforehand and never (at least I never noticed one) made it into a release.

I’ll look into disabling the dnf-makecache.timer or adding swapfiles for now and follow the discussion to see where it leads…

Linux newbie here.

I’m running Rocky LInux 9.4 on AWS since a few days ago, and it crashed almost every day.
Looking at the messages log, I noticed that even if it runs every 1.5 hours or so, it only crashes when the download size from the EPEL-x86_64 repo is 23 MB.

 dnf[5427]: Extra Packages for Enterprise Linux 9 - x86_64   12 MB/s |  23 MB     00:01

Sometimes when it’s only 17kB or 7.6kB etc, it doesn’t crash.

Looking at the CPU usage record in AWS, when it crashes, it keeps the CPU usage as about 50% on my 2-vCPU system, which if I understand correctly, 1 vCPU was running at 100%, non-stop until I restarted the system.

I’m not sure what information is required for further troubleshooting, but I am happy to provide some other info if needed.