Programmer's Log

Tuesday, October 24, 2006

It has been 14 days since I last updated my blogs. Time sure flies. My schedule was filling with huge amount works and lots of problems on the performance sides of the system. I stopped implementing new requests, passed my tasks to my juniors, and concentrated more on improving the performance of our previous implementation.

We launched 3 sites of our .NET system. For the first week, everything went fine except for the odds information display trouble. The second week of launching, we started running into problems. Everything also started with the odds information display. Our odds cache wasn't stable. In order to relieve the load to the web server and the system itself we implemented a middle layer with .NET Web Services to obtain the odds information. We implemented it in the way that there was a infinite polling mechanism to get the information from the database and kept that information always fresh in the cache of the Web Services. For our problems, our polling mechanism always stopped unexpectedly. Before launching the new 3 sites, I worked around this problem by stopping the IIS to recycle its worker process on IIS schedule. This idea worked fine when we had less sites and the more sites we had problem started showing. The complains that we got from the customer more and more as my 3 sites stayed in the production. I listed down 3 things, I had to solve immediately.

  • Odds Cache -> use .NET Cache Object because it has the automatically expiration parameters for each objects stored in there.

  • Improve the display speed of odds information.

  • Check up our server. It starts to respond really slow on some requests

I tried to improve the odds display first. The solution for this was so much easier than I thought, I just changed the way of calling a function which is used to create our betting slip. Something could surprise people that it had an impact on the rendering engine of Internet Explorer 6.

<span onclick="parent.frameA.buildSlip()">0.98</span>


<span onclick="a();">0.98</span>
<script type="text/javascript">
function a() { parent.frameA.buildSlip();}

It was not just out of no where I made this changes. I went to read some articles about IE having some memory leak issues in its javascript engine. And the part of circular references of register onclick event caught my attention.

I changed the odds cache to the new way using the Cache Object, and did more tunning on our display scripts for my last chance before the management's orders to switch 2 of our sites back to the good old ASP version. My tests showed positive results, deployed new changes to the production was the last thing I did before leaving the office.

On our critical night, our web server drive was filled up with logs files from IIS. I forgot to turn off the logs file. Next morning, my email was full of complains from the operators. :( They should've called me ... As promised with the management, we switched back 2 sites to our old versions.

... and to be continued ...

Tuesday, October 10, 2006

Another week started with a lots of goals to achieve. Our first iteration of the new implementation of our Admin 1 system will conclude at the end of this week. My part hasn't done yet. I'm behind my own schedule.

Last week, our external consultant brought in a specialist who was from Taiwan to help us on improving our system. The specialist was introduced as Microsoft Product expert. However, the first impression we had for him was totally opposite. He introduced COM+ to us, a technology which Microsoft is replacing with .NET. The whole week, we spent our time in those COM+ meetings. The really good thing he tried to implement for us a COM+ server to try instead of telling us to implement what we didn't want to do.

We tried the COM+ servers over the weekend. Coincidently, the last weekend all the major leagues were off for the EURO 2008 qualification rounds. The load of the system won't be so high. My supervisor took one of our largest sites to test how the COM+ performed. The site was slower than one without COM+. It was obvious that things's slow down because now a request from a web server had to go through COM+ server to get the objects and after than came back to the COM+ server to execute the request. COM+ server introduced an intermediate step to connect to the database. Personally, I think this is one of the reasons for Microsoft to drop the COM and COM+ technology by .NET. ADO.NET layers provides the same thing what the COM+ does in term of reducing the number of connections from the web servers to the database servers with a pool of SQL connections. I also think our specialist knows this, and I hope that too.

Advanced Betting System from London came to our company in the middle of the week to introduce their betting software for bookies. First time, we got to attend these demo like this. It was a really well-organized system, and using service oriented software platform. They didn't use XML as the data transportation, but another Servlet thin layer instead to reduce the amount of data to transfer over the network. This is exactly what we want our software to be like in the future. My colleague was talking about this couple months ago and we are trying to design our Admin 1 system in the way that later on we could switch our system into services oriented software platform in the easiest way. My colleague is really good at visioning and look for the future development of our software. His Master degree in Computing Science shows its true values.

Saturday night was really peaceful for us. The load to the database wasn't high. There were no major problems except some little bugs on the interface of our .NET member sites. We had launched 3 more small sites and the .NET system is increasing its portion in our system.

... Sunday, our off day ...

Sunday, October 01, 2006

Being too confident has become a bad characteristic for us. Our headaches - our DB performance problems - came back. Our database was jammed again this Saturday.

Last week, we worked hard to get most the load of main database out to replication. Going to this Saturday, we were pretty confident that there would be a smooth night. 9PM peak time came with more than 40 live matches ranged from all the soccer leagues around the world. As usual, all the performance graphs and SQL Server profilers were turned on. This time I have learnt how to set up a profiler for myself. I also need to monitor the Admin 2, yesterday I continued to move out some pages to access the replication instead of the main Database. Everything was running smoothly. The main DB CPU usage was only around 10%. We started to relax a little.

There were some couple problems with our ASP sites. My supervisor changed the way we stored our odds change time. His display odds javascript didn't run correctly anymore. Problems were soon gone away, since he is my supervisor ... hehe i'm good he must be better ... jeese starting to be cocky ... (I should look at my first sentence again *winks*)

Since all the servers were running nicely. I went to fix bugs in our .NET member sites and ready for the launch of 9 more sites in the futures. All the sites must pass the acceptance tests. We failed 3 times already; however, the errors were not from our codes, it was just that we didn't have the correct data and enough data for the testing. The data we copied from the data warehouse into our development environment wasn't enough. I decided to deploy the rest of 9 sites into real production environment for a final testing.

11PM, the peak time almost passed and the load became less. Everything was running smoothly. My colleague left work. I still had some deployment on the way, and I decided to stay a little late. At the time I looked at the performance graph for my usual check. Something unexpected happened. The performance graph showed unusual behaviours. Database was jammed again. I looked over to my supervisor, and we started receiving complains.

... SETTLEMENT again ...

The operation did the calculation for the win/loss after the matches were over. This is called settlement process. And the database was jammed when the settlement code was executing. The problem was why it didn't happen when my supervisor executed the settlement himself. How to solve this? What could we do now? We are running out of options.

... I was off on Sunday ... but still kept monitoring the system from my house ... next week, we will have to continue to look into this issue again.

Database is not my specialty ... the only thing I know is to listen to my colleague and my supervisor ... well I will learn ...

This Saturday, our external consultant also invited a SQL specialist who was from Taiwan to help us to improve the database performance. He introduced us to COM+ ... a technology which Microsoft is replacing with .NET Remoting and Web services. The first impression from us wasn't too good for him. Let's see what he will have for us.