snmpd memory leak fixes

Description

We have 2 customers now that have shown that the snmpd process can use an extreme amount of memory. I'm attaching the debug of a customer where it shows that snmpd process is using ~2.2GB of resident memory.

There are also strange error messages related to snmpd:

Nov 10 09:15:35 snmpd[6049]: SWInst: error initializing pkgng db
Nov 11 09:15:34 snmpd[6049]: SWInst: error initializing pkgng db
Nov 12 09:15:36 snmpd[6049]: SWInst: error initializing pkgng db
Nov 12 13:35:07 snmpd[6049]: actual retrieval of routing table: Cannot allocate memory
Limiting open port RST response from 336 to 200 packets/sec
Nov 13 00:35:25 snmpd[6049]: actual retrieval of routing table: Cannot allocate memory
Limiting open port RST response from 274 to 200 packets/sec
Nov 13 09:15:34 snmpd[6049]: SWInst: error initializing pkgng db

In the debug, in the ps output section, it shows RSS size ~2.2GB:

root 6049 0.0 0.8 2238896 2222604 - S 13Aug19 1144:31.14 |-- /usr/local/sbin/snmpd -p /var/run/net_snmpd.pid -c /etc/local/snmpd.conf -LS3d
root 6052 0.0 0.0 6352 2032 - Is 13Aug19 0:00.00 |-- daemon: /usr/local/bin/snmp-agent.py[6053] (daemon)

Problem/Justification

None

Impact

None

SmartDraw Connector

Katalon Manual Tests (BETA)

Activity

Show:

Dru Lavigne December 23, 2019 at 7:29 PM

Bug Clerk December 23, 2019 at 7:25 PM

Caleb December 18, 2019 at 2:58 PM
Edited

I've confirmed, via valgrind that the fix I made here: 7844b49061902b4a00520be7b03ccd0c9be922f3 has fixed the memory leak.

This is the valgrind output after running snmpwalk for 10mins for the entire OID tree.

After running the same snmpwalk command on the patched snmpd for 1 hour, this is the valgrind output.

I can confidently state that the memory leak that has plagued snmpd for awhile has been "fixed"

Caleb December 18, 2019 at 12:33 PM
Edited

The commit that fixes the biggest leak is the one I made here: 7844b49061902b4a00520be7b03ccd0c9be922f3 which fixes a missing check for the CONTAINER_INSERT function in swrun_kinfo.c.

net-snmp code is building a process table by running the kvm_getprocs() syscall with the KERN_PROC_ALL flag. Since this returns all processes AND kernel visible threads, the CONTAINER_INSERT function was returning a -1 error because of duplicate PID's (this is expected since KERN_PROC_ALL was passed). I found a commit upstream where someone changes from KERN_PROC_ALL to KERN_PROC_PROC which I've pulled in.

I've also ensured that we check the return code of CONTAINER_INSERT and free() the associated memory if needed.

After making the above changes, the snmpd process seems to have REALLY slowed down. At the time of writing, it's only used 50492K of resident memory. This is after running a while true loop to walk the entire OID tree for 14 hours. Without these commits snmpd would have been close to 100MB or more of resident memory by now.

Ryan Moeller December 17, 2019 at 7:31 PM

We've also found 024c6316eb1f37a7c3645e0a15032173d04e2c11 to only list processes instead of all kernel threads, and we discovered a missing check for the return value of CONTAINER_INSERT() in SWRun which we should add for good measure.

Complete

Details

Assignee

Reporter

Labels

Support Ticket

Components

Affects versions

Priority

More fields

Katalon Platform

Created November 14, 2019 at 2:21 PM
Updated July 1, 2022 at 4:43 PM
Resolved December 23, 2019 at 7:25 PM