Programmer's Log

Tuesday, September 19, 2006


We have solved a lots of requests last week. Today, I have free time again to update this blog with the actions of last weekend. It was totally a busy weekend.


I finished implementing the reducing bandwidth for our member sites. The implementation wasn't to the point that I wanted it to be but it was alright, simple version first and more complicated version will come later. In comparison to the ASP version, our version reduces probably 15% more of the data transfer to the clients' browsers.

Last Saturday was a really chaotic day. We encountered with so many unknown problems. We kept receiving complains from our operations about the site being slows, and we experienced the lagness ourselves too. The network was really smooth, the ping plotter program didn't show any disruption. The CPU usage of the database was at 30%. Everything seemed normal except the sites were slow. The situation continued getting worse, the trading general dispute director also came to our IT department to check what went wrong; our big bosses were all present. Eventually we pinpointed the problem which probably was the upgrade of 2 RAMSAN units. We did upgraded them to double the units capacity on Thursday. Our supervisor tried to change the log of SQL server to different RAMSAN unit. The situation seemed better a little bit; however, there were still complains.

In technical term, our business is database-centric. Like an MMORPG game, the business relies extensively on the database performance with concurrently log in of more than 38000 users at peak time. Our database has problems; as a result, our business also suffers.

A half an hour later, we used the window performance monitoring system to monitor the write and read disk queue of those RAMSAN units. Everything was normal again. Strange, there was no big changes. At the time of the problem, we did a serial of changes to the disk; it didn't help. Then how everything became normal ?. The load to the database remained the same, even a bit more than half an hour ago. STRANGE !!! We had so many factors to consider in order to pinpoint the problem exactly.

Our Agent web server got problem as well. It reached the its capacity limits. Its CPU usuage was always at 97-99% at peak time. "On Sunday, we will try hardware load balancing from the F5 for the Agent system" my supervisor said to me. I went to set up the server for him to prepare for Sunday.

Saturday night became a long night for us. Database performance still remains our biggest issues.

Sunday night, with just a bit less load than Saturday, the system performed smoothly. Was it our external consultant's assumption correct? We didn't know, and could not claim what he found to be correct problems that we had on Saturday, too many factors to concern. He told me in the afternoon that he found the problems; the problems maybe came from those unused indecies which we left in the main table. He removed it out.

When it came to the settlement part, we still have big problems even though we were using our new codes for the settlement; we did test it thoroughly and it was running much faster than the older versions.

We succeeded with the load balancing from our F5 firewall with one of our site. That was the only good news for us for the whole weekend.

... definitely have to monitor again on the coming weekend ...

0 Comments:

Post a Comment

<< Home