First a question, if your web application or portal are not performing as you expect it to, what would be the first thing to target ? In most of the cases, the action would be to review and optimize the database queries, looking for memory leaks and ensuring resource cleanup. One thing that normally comes last into mind is the resource configuration parameters and other server configurations. How often do you see a server configuration and resource parameters altered from their default values while setting up a server for production environment? Besides only a few organizations, especially in banking and financial domains, most of the organizations dont pay enough attention to this aspect.
I came across several instances when after a lot of effort of optimizing the application, the real problem came out only in the configuration of the server or the resource like data source etc. I am citing a couple of those instances here without mentioning the project and customer details (not sure if thats allowed here and is surely not ethical).
This one time, what we now call as The 9 PM to Midnight problem, we spent several nights trying all kind of stuff. The issue was that every night from 9 PM till midnight, our applications’ performance degraded a lot. Now, we were having a clustered environment which included 12 clones for the portal and 12 clones for the services backing up the portal and other channels. The performance hit was more apparant on channels other than the portal. We digged into our applications which take most of the load on the services side, as we thought they could have been the culprits if they were not managing their resources well. After complete profiling, optimizing and minimal logging, 3 applications were launched into production the very next night but no resolution. What bothered us most was why does the problem happen dot at 9. The next step was to rollback all the changes rolled out 2 weeks preceding the problem – still no go. We made it a point to sit in office at 9 PM, and observe every single thing around the enviroment, which could affect the performance. We even observed the http connections open on the servers, which amazing shot up big time right after 9 – from 120 to 2300 per node (each of our nodes contains 3 clones each). We thought lets look into http logs to see what are the kind of requests we are getting and it turned out that there were a whole lot of http connect requests (not GET or POST as expected). We suspected this might be hacker attempt to crash our server or something (we were stupid enuff to think that our website was that popular and hack favorite). Anyways, turns out that wasnt the case. After spending a few more nights, we troubleshooted a lot of other aspect and came across the real picture. We had 20 database servers and 20 replicas. All the retreival was done from replica databases and updates were done in production db. Now, the replicas were down from 9 to 12 for synchronization with production databases and we had a failover in place to go to production in case the replica databases were down. The issue was that the data source time out was set to default which was 300 seconds on the services server, therefore, it took 5 mins for the application to NOT get a connection from replica and then, look for a production db connection and hence, the issue.
Now what bothered us a lot was the HTTP connection count during this period rising drastically when never showed up after the timeout was fixed. That was because the HTTP server had a retrial mechanism which would resend the request to another clone in case it didnt receive the response within a timeout period which in this case was only 15 seconds. Interesting, isnt it. Thats how to HTTP connections rose exponentially. Also, in the HTTP server configuration file httpd.conf, there is a property called KeepAlive which was set to On which means that after sending a response to a request, the http connection for that client would be kept open for another timeout period to save overhead time to serve another request from the same client. This property is extremely important in case of a web server where the same browser would ask for another page soon enough, however, in case of a services server, this property would only contribute to the number of utilized connections and reaching the max no. of the support connections.
Hmmm.. by now I have written so much I dont even remember where I started from (Well actually I started riting this at home around 8 PM and got a call from my PM in the middle to drop down to office to handle an emergency situation which got resolved mins back and I am finishing it right now around 2:30 AM). Back to the point, dont u think it would be a good idea to tune your servers and resources to the project’s requirements right at the time of setting up of the server. I guess that should form an important part of knowing a server – how to configure it for what situation.
Well, enough for this time… Good Nite