Ever have one of those bugs that customers complain about, but you just cannot reproduce it? Here is a good one…
Customers were complaining about being logged out when clicking a download link.
This particular setup is a Cisco CSS 11501 series load balancer with 2 Dell Poweredge web servers sitting behind it. Each webserver is running apache, as well as an application server (python) which handles authentication and processing for THAT server.
For weeks, I could not reproduce this bug. So tonight when I finally got bit by it (at home), I was clueless for a while. The code is so simple. A simple key lookup in a simple dictionary, yet it just was not making sense.
Here is the story:
A while ago, we were having problems with Internet Explorer downloading content over SSL. This turns out to be a common problem with IE, so to fix it, I caused the downloads to not use SSL, which is more efficient anyway.
We use a cisco hardware load balancer which balances incoming requests to different backend servers. It has a feature called STICKY SOURCE IP, which means that any connections routed from the same IP to the same site will be delivered to the same backend server. This is nice, because you are always visiting the same server.
So as it turns out, by turning the download SSL off, the load balancer was using another “site” definition to handle the DOWNLOAD request. STICKY SOURCE IP was out the window, and the request was being passed back to a “random” webserver.
About 50% of the time, users (like me tonight) were tossed to the other server, which knew nothing about the user login. That is why it was complaining about the “WB4_App::$DSEG and/or WB4_App::$AuthToken must be set in order to contact the applications server.” error message, which is not one that should normally be shown.
To make matters worse, our IP address at work was apparently always using the same server, so I could not reproduce the problem. I’m lucky that it happened to me at home, or I would still be banging my head against the desk…