My crappy memory leak

For 3 weeks we had trouble with our server. Usually this server operates very stable, but every Wednesday we had to reboot it due to nobody being able to access the Outlook Web Access (OWA). Everyone got a http 404 error and if I looked at the services, everything looked normal and was up and running. A reboot solved the problem, but next week the problem was back.

After analyzing the event log I saw a number of events from the Application log:

event id: 8026 MSExchangeAL LDAP Bind was unsuccessful on directory (server DNS name) for distinguished name ”. Directory returned error:[0x51] Server Down.

I also had

event id: 1053 Userenv Windows cannot determine the user or computer name. (Not enough storage is available to complete this operation. ). Group Policy processing aborted.

event id: 7001 VssAdmin: Unable to create a shadow copy: Ran out of resources while running the command.

Command-line: ‘C:WINDOWSsystem32vssadmin.exe Create Shadow /AutoRetry=5 /For=?Volume{7166b389-8ec4-11dc-b352-0019bbca120c}’.

And from the System log:

event id: 5 Volsnap The shadow copy of volume D: could not be created due to insufficient non-paged memory pool for a bitmap structure.

event id: 2019 Srv The server was unable to allocate from the system nonpaged pool because the pool was empty.

After searching the Internet, I thought I had it figured, since we recently implemented shadow copies. So I requested a hotfix from Microsoft which addressed memory leaks in VSS http://support.microsoft.com/kb/940252.

But still, next week the error was back. And I know; I jumped the gun, since I should have done more testing to see if there was an actual increase I the counters MS refers to in the KB.

I all the situations, the http error was there long before these events appeared in the event log, and after further checking the MS knowledge base, I figured out why. According to http://support.microsoft.com/kb/933844/en-us IIS stops accepting connections when the nonpaged memory pool goes below 20MB. You will also have a number of “connection refused” in the httperr.log.

2008-05-21 07:18:44 – – – – – – – – – 1_Connections_Refused –

2008-05-21 07:19:44 – – – – – – – – – 1_Connections_Refused –

2008-05-21 07:20:44 – – – – – – – – – 1_Connections_Refused –

If http is critical you can bypass this by setting the registry key in the KB and restart the services. This gives you and extra 12MB of nonpaged memory so operations may continue while you try to find the root cause.

I then started with the tedious job of monitoring the memory for leaks using poolmon.exe. Since the error implied trouble with the nonpaged pool, I concentrated on this. An hour after booting Task Manager showed that the nonpaged pool was 65888K (about 11 AM), 5 hours later, the counter showed 74700K. When I came back the day after at 9 AM, the counter had reached 95960K of nonpaged memory and the counter kept going up without any apparent reason.

Not being very experienced in troubleshooting memory leaks the poolmon output didn’t give me much of an idea, since the counter that kept going up are counters that seem to incorporate several items. MMCM, File and Thre where the ones that seemed to increase the most. Further checking and an excellent post in an MSN blog http://blogs.msdn.com/sudeepg/archive/2007/09/11/iis-not-serving-pages-page-cannot-be-displayed.aspx pointed me in the right direction. The Thread counter represent Thread objects so further checking of processes with a lot of threads or handles could be worthwhile. Bringing up this showed that two processes named statusclient.exe which belongs to HP Printers had more handles than the total of all other processes.

Figure 1 Process Explorer showing processes sorted by handles. Cropped for size reasons

After killing these two processes I expected that nonpaged memory should drop instantly, but instead the utilization dropped slowly and within the first hour, it had dropped from 95960K to 86700K. Checking again a couple of hours later, nonpaged memory use had dropped to 82632K and actually seemed to keep dropping.

The process statusclient.exe doesn’t seem to use much memory by itself but in some way ties up nonpaged memory through its handles. Killing it made the memory count drop considerably.

As a test, I decided to start up the statusclient.exe again to see if the count started to increase, but it still kept dropping!

At start point it had about a 100 handles and nonpaged memory was 82380K. After running for 2 hours the handle count had increased to 3996 and np memory was down to 78908K

Figure 2 Statusclient.exe handles after 2 hours

Figure 3 Nonpaged pool 2 hours after starting Statusclient.exe

After 21 hours, the handle count has increased to 42382. Nonpaged memory is still down and is now running at 77484K.

Figure 4 Statusclient.exe handles after 21 hours

Figure 5 Nonpaged pool after 21 hours

I now decided to restart the process and also establish a second and third RDP connection to the server to launch a second instance of statusclient.exe to see what happens.

I checked the handles again after 9 hours

01:40 NonPaged Pool memory 182196

Session 0 123372 handles

Session 1 130018 handles

Session 2 129845 handles

01:45 kill session 1 statusclient.exe 182196 immediate drop to 178980 NPM and the a gradual drop to 178928.

02:00 NP memory 182196

Session 0 123829

Session 2 130389

02:00 kill session 2 statusclient.exe 178928 immediate drop to 159644 NPM and the a further drop to 158440

Killing the last session resulted in a further drop in the nonpaged pool memory, and remaining stable.

Conclusion

This error was probably the result of someone installing the wrong printer driver on the server (I’m innocent!). I haven’t checked with HP, but I’m assuming it was never intended for use on a server OS. Selecting the right printer driver is vital to achieve stability.

—Terje Skulstad—

This is a repost of a previously published article due to transition to WordPress

2 thoughts on “My crappy memory leak”

  1. Terje I’m trying to troubleshoot the exact same problem… however when I have process explorer loaded I can’t seem to get the Handles column to display… any ideas?

  2. No problem. Right click in the column headings and choose “Select columns…” Go to the Process Performance tab and tick “Handle count” Click Ok and you’re there 🙂

    —Terje—

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.