Programmer's Log

Thursday, September 28, 2006


Adapt to "constant" changes


Technology is changing every day, we have to update ourselves constantly to keep up with the pace. In 2002, our software was implemented with ASP and SQL 2000 and now we are replacing the old software with ASP.NET 2.0 and SQL 2005. Next year, we may change to something else ... who knows ... It is not only just the technology changes but also customers' requirements also change day by day. Today the requirement will be a square; but tomorrow, the square may turn into an oval. The businesses are changing, the requirements must be changed as well. To cope with new demands, new technology will be employed, and new ways of doing things will be used. Putting everything together creates a cirle. That cirle creates an indefinitely flow for those who choose to be in this technology industry.

Last night, I brought up new features to our Admin 2 website as well. I also found couple of bugs, bugs in the new features and also bugs in one of the old features.

  • Special character in the team name, my javascript will return an error when this team name occurs

  • Back functions of the live delete trace is not working



These were rare errors that's why our operators could not pick them up.

Today we start implementing the Company Admin website after a series of delay due to something called "PROCESS". The management wanted "PROCESS" and "PROCESS" reduced the productivities. "PROCESS" is necessary for any development firm; however, implementing it could not be done in a sort period of time. I have learnt a lots out of this process improvement period. For our company, heavy-weight process like CMMI is too much because our development team is pretty small. Light-weight process like Agile Development is more suitable to start with.

... first iteration is on the way ...

Monday, September 25, 2006


Weekends has become our days for tuning and monitoring for the performance of the whole system. Our database performance issues is always on top our our lists. Last weekends wasn't an exception. We had done through a series of changes to improve our system. Costs and expensive lessons always accompanied with those changes. Expensive lessons were learnt, costs was paid under the expense of 3 hours downtime of our system on one of the most busy day of our business.

Saturday


Last week we implemented hardware load balancing. The load balancing worked like a charm. The hardware sent additional cookies in the URL and the response sent back to the firewall, based on the cookies it will send the requests to the correct server. Couple months back, we tried this method and we failed. Now with new hardware, it works so good. I won't have to do the software load balancing anymore for our Agent system.

Our night started with unusual behaviours of our servers which were detected from the graph of Window Performance Tools. At the moment, we have 2 servers for the Agent System. We deployed our Admin 2 on our Agent Server #1. This enabled us to have an idea about how our main database performed.


Explanation: The top window in the picture was the monitoring graph of our Agent Server #1, the underneath window was the the monitoring graph of Agent Server #2. The second graph showed the normal activities of a server. Red Line: CPU usage of the server; Blue Line: request executing in the CPU; Yellow Line: requests comming to the server per second. The first graph indicated that the database was being locked, and our website became slow on the customers' sides. The request executing ran high that meant there were request comming and some of them had to wait for the database to timeout. Time to do something to the database to release all the locks.


Explanation: This picture showed the normal activities of both Agent Servers.

The main CPU database remained at around 30%, but we still received complains from our operation that our web admin site was slow, and the Admin 2 as well. Database was jammed again? My supervisor then killed the physical connection which led to our main database. The situation became better. We learnt one more thing. When the database CPU is low, it doesn't mean that the system will perform normally. Even though the CPU is low, the database is still locked at somewhere.

My colleague added more counters into the Window Performance Tools to monitor the "lock requests" and "lock waits" for the database. After a while of monitoring, we found what we wanted. Before the database was jammed, the "lock waits" counter jumped higher, and so did the "lock request". At the same time, the performance tools on our Agent Web Servers also showed "Request Executing" Counter getting higher. My supervisor had to kill the physical network connection again in order for our database solved all the current locks.

We continued to monitor some more again. There were still lots of reading in our main database

  • balance of our agency system

  • bet list of nearly 40,000 customers

  • reports of our admin 2

  • total bets, and forecast in our company admin



"We will move all those read out of our main database." That was the conclusion for that we had for the night. I was in charged of changing the code of the agent system, and we encountered a big issue. Our current source code inside the repository was the ongoing development code. I could not fix on that source code. I had to rip out the production source, created a new project and made my amendments in there. Big room for errors was awaiting me.

Hardware issues sometimes also had affects on the performance? Our external consultant thought the hyper thread functionalities might create the performance issue as well. The external consultant also decided to turn the hyper thread feature off on the next reboot time of the system.


Sunday


Another weekend I had to come to work, and my supervisor has worked like this for the past 2 years to maintain the system up and running. I guess I'm starting following him. Well, it is the nature of our jobs as programmers.

I had 2 jobs to finish

  • remove all report store procedures call from our admin 2 to replication database

  • bring master agent list to replication and put some delays for the customers to see the new updates from replication



I finished the first task weekly. I grabbed my colleague junior to test my new changes while I proceeded the next task. I guessed he was being unlucky when he let me see him online on his off day. The test was fast, and everything was working fine. I deployed new version for our admin 2 quickly. Everything proceeded so smooth, and I should be out of the office pretty soon I thought.

Yesterday, before I left, I already set up the new project for our agent code in order to implement new codes without bringing our ongoing development into our development servers. I already tested the codes in our development environment. Everything rans smoothly. After getting signal from my supervisor, another version of agent systems were deployed again in our servers. I made a test site to test our new codes. I got errors, totally could not log in. There were strange errors. Parameter could not be null. Parameter value: value! What was the h*** going on? In the development system, everything was working fine. I had to revert on our old code back, and tried debugging the codes. After a while, I found out our new codes for new functionalities was causing problem. I passed in a null value into the String.IndexOf function of .NET . It was totally going off my mind when I saw the error message like that for that kind of mistake which was thrown out by the .NET Framework.

Database was restarted. Anxiously to wait for the main DB to be up again, strange things happened. In the performance windows, the CPU of the main DB quickly raised up to 100%. Our supervisors killed the LAN connection to the main DB again. After he killed the connection, the power in his house went out and it was time for him to come to the airport to catch the plane to Manila. No one turned the LAN connection back on. I had to call our external consultant to turn the connection back on. There were no remote connection for me to go in anymore. When the DB was back on, the CPU raised up to 100% again. We thought that it'd be the hyper thread function turned off causing the problem. Main DB was shut down again and restarted.

Our external consultant monitored the SQL profiler and told me everything single SP call from the agent system was going to the main DB instead of going replication servers. I moved the old codes back immediately and I grabbed my colleague to monitor for me and I went to figure out what went wrong. There were 2 things I picked up.


  • My junior changed my default setting inside one of the main components of our agent software packages.

  • The default connection to the main DB wasn't changed, it must be changed.



This was totally my mistake. I should've checked more carefully. Expensive lessons for me, I should've been more careful, and replaced the code one by one from the least access sites to the more access sites and always monitored the SQL profiler along the way. I underestimated the changes I did. The system I built and I still made mistakes. After fixing all the bugs, I learnt this time, and replaced the agents website one by one and monitored closely the profiler which my colleagues had set up for me.

After 3 hours downtime and unstable performance, our system was back on again. At the same time, my supervisor was doing the peformance improving implementation as well, he moved all the report from our member sites to replication. I decided to stay at work through the peak time to monitor a bit more closely, and to make sure everything running normally. Our system performed extremely good, no more jams for the night. Our agent servers didn't show anything unsual anymore. The main DB CPU was under 15% for the whole time. Next week, we will have to monitor again. Hopefully everything solved.

It was a long weekend ... going home and looking for my bed ... that was the only things left in my brain when I left work ...

Tuesday, September 19, 2006


We have solved a lots of requests last week. Today, I have free time again to update this blog with the actions of last weekend. It was totally a busy weekend.


I finished implementing the reducing bandwidth for our member sites. The implementation wasn't to the point that I wanted it to be but it was alright, simple version first and more complicated version will come later. In comparison to the ASP version, our version reduces probably 15% more of the data transfer to the clients' browsers.

Last Saturday was a really chaotic day. We encountered with so many unknown problems. We kept receiving complains from our operations about the site being slows, and we experienced the lagness ourselves too. The network was really smooth, the ping plotter program didn't show any disruption. The CPU usage of the database was at 30%. Everything seemed normal except the sites were slow. The situation continued getting worse, the trading general dispute director also came to our IT department to check what went wrong; our big bosses were all present. Eventually we pinpointed the problem which probably was the upgrade of 2 RAMSAN units. We did upgraded them to double the units capacity on Thursday. Our supervisor tried to change the log of SQL server to different RAMSAN unit. The situation seemed better a little bit; however, there were still complains.

In technical term, our business is database-centric. Like an MMORPG game, the business relies extensively on the database performance with concurrently log in of more than 38000 users at peak time. Our database has problems; as a result, our business also suffers.

A half an hour later, we used the window performance monitoring system to monitor the write and read disk queue of those RAMSAN units. Everything was normal again. Strange, there was no big changes. At the time of the problem, we did a serial of changes to the disk; it didn't help. Then how everything became normal ?. The load to the database remained the same, even a bit more than half an hour ago. STRANGE !!! We had so many factors to consider in order to pinpoint the problem exactly.

Our Agent web server got problem as well. It reached the its capacity limits. Its CPU usuage was always at 97-99% at peak time. "On Sunday, we will try hardware load balancing from the F5 for the Agent system" my supervisor said to me. I went to set up the server for him to prepare for Sunday.

Saturday night became a long night for us. Database performance still remains our biggest issues.

Sunday night, with just a bit less load than Saturday, the system performed smoothly. Was it our external consultant's assumption correct? We didn't know, and could not claim what he found to be correct problems that we had on Saturday, too many factors to concern. He told me in the afternoon that he found the problems; the problems maybe came from those unused indecies which we left in the main table. He removed it out.

When it came to the settlement part, we still have big problems even though we were using our new codes for the settlement; we did test it thoroughly and it was running much faster than the older versions.

We succeeded with the load balancing from our F5 firewall with one of our site. That was the only good news for us for the whole weekend.

... definitely have to monitor again on the coming weekend ...

Wednesday, September 13, 2006


"We will down the system for ..." is always the sentence that no one in the online business want to hear from their IT department; however, system upgrade always require downtime. Today, we delivered new functionalities to our system, mainly in the AGENT system.


To implement these auto/manual position taking functionalities, our DBAs had done lots of changes to our database. System down time were unavoidable and management allotted us two hours to deliver the software. 2 hours was supposed to be a lots of time for us. In the end, we just barely made it. All the replication servers had to be resynchronized, we had 7 replication servers. Of course, synchronization for all of these servers took some of our precious time.

My supervisor, my colleage, and I started our schedule earlier than usual to make it to the time allotted for the system downtime. My supervisor took care of resynchronization of all the data and all the replications. My colleague was to update the database code and to prepare new database structure scripts. I was in charged of the application code. Eventhough we had tested thoroughly with test scripts and passed the user acceptance tests, I was still a bit nervous. The funny thing was that I had done these things so many times, and still feel a little sweaty.

11:45 AM Vietnam time, the system was up again. It took us an hour to finish.

BUGS ... ERRORS ... we found those words again for this update. After new realeases were deployed, one of our DBAs - my colleague junior - figured one critical bug. And this bugs will only appear when agents and masters interact together on the same functions. This one was the "must" fix one ... I dealt with the bugs myself, even though those codes were developed by juniors and they were all present. While fixing bugs, it was a chance for me to check back their codes.

Even at the time of deploying, I also let my juniors to implement a small new request, I forgot this request somehow before the deploying. Last minutes implementation always still happens. Something called "PROCESS" isn't functioning well. I do know why and still observing. Precious experience.

I was really impressed with my junior; she is really improved. Definitely she could replace me, I will train her in depth everything I know.

I haven't finished improving odds display yet, so many "must" requests intervened. Anyways, I've gotta conclude it in the next 2 days...

Tuesday, September 12, 2006


Another week passed by so fast. I had not been so productive last week. Things must be changed this week. Everyday, we received more and more requests. I just cleaned my desk at work; but definitely, it will quickly be filled again with papers. There are so many things on our hands: implementing new requests, improving and optimizing the current code, and training our juniors. Well, I guess nothing is for free, then just live with it.


Starting day of this week wasn't so bad. I decided to stay a bit late to finish all the pending requests that the management marked as URGENT on the ASP.NET member

  • Delay of 2 seconds between mix parlay bet was implemented. This one DBAs finished it. We don't show to the user that they are restricted to bet a new parlay within 2 seconds, we just tell them that they can't bet because of "Odds changed"

  • Provide Help section for the mix parlay

  • Implemented the count number of current games on the menu.

  • Preparing all the codes for the splitting of the servers tomorrow



Today, Boss comes back ... definitely more things to come ... *sigh* . We will have actions soon.

Saturday, September 09, 2006


Backlash



The past few days, I have been struggling with new odds display and reducing the bandwidth problems with my own code. Couple months ago, I implemented our Member website with only one purpose in mind "make it better than the ASP version, as much as possible". I did succeed, our ASP.NET version is better, faster, going back to the database much less on betting procedure. However, I want to improve it and face so many obstacles; they drive me to the point to reimplement everything. The inexperience in designing and coding now show its true shadow.


In order to improve the bandwidth, I must detect all possible odds changes.

  • In the ASP versions, store procedure returns all the bet type by rows, each odds will be on one row, detecting odds changes will be on one row only which is the odds change time

  • In ASP.NET version, for my convenience of coding, I requested to put all the odds type of a match in one rows in the record sets. Bettype 1,3,7,8 will be on 1 rows with 4 different change times. To detect odds changes, I will have to check all 4 values. It gains in performance because the cost of moving down a row in the dataset more than we check for 4 values. But the source code will be definitely more complicated.



After detecting all odds changes, I must send out all these changes into a compact form; then populate the those datas into correct odds position which will be identified by the odds id. Everything sounds to be easy. ..BUT.. things are not easy as it should be; I will have to deal with my improvement of bet engine which I used to be proud of.

  • In ASP version, each odds is just a link. Once the user click on the link, they will go to the database to get all the information about the choice that they want to bet; one more trip to the database to create the bet slip.

  • My improvement in ASP.NET version was to store all the possible information when I built the display into something called Javascript "betstring". This "betstring" contained all the necessary information about the choice of the users. There is no need to go back to the database to collect all the information any more.



Because of building the bet slip by "betstring", I need all the information about that user choice including the team names, league names and score. Those information usually contains most of the data sent to browser. If I continue to include these information into refresh data, then I won't save much the bandwidth. Previous improvement now has impacts

Everything has a way to solve except my code will look UGLY.

Monday, September 04, 2006


Monday, another working week starts again. We were supposed to have long weekends this week because our national holiday was on Saturday. I didn't think that I'd have that luxury, and as usual I was right.


Problems arised. I got messages from our Customer Support Department at 1:30 AM in the morning. I was unlucky that being online at that time to check my email, and got caught. Anyways, problems had been detected; they had to be fixed.

  • 777goal correct score, total goal didn't display --> I pointed the cache server into different database, and the database store procedure on there were not up to date.

  • s33bet members could not bet parlay bets --> Data conversion problem from the database. We just did some modifications on performance improvement for the database, some columns had been converted into different datatype. Bugs spotted.

  • s33bet minimum bet displayed incorrectly --> new minimum bet value had been introduced to our ASP version, we hadn't updated it into your .NET version



SOLUTION: come to work on holidays. quick, simple but not dirty as it sounds ...

In the evening, my mailbox received 2 more new requests. Just display requests for s33bet; geared on, I marked DONE status for those in 10 minutes. Anyways, I must update our requirement collector about these requests. He has the responsibility to document everything we did. That's called PROCESS. I'm being thinking on making a blog on PROCESS in our company soon. Hopefully it won't be called CHAOS ... j.k ...

Continue bring new method of reducing the bandwidth into our .NET member now ...

Friday, September 01, 2006


It is time to implement new odds display functionality into our .NET member site. I created a different set of javascript and new set of code-behind files. I will start from scratch. Let's see what we have to do.


  • Users navigate the odds display page for the first time, need to build the tables.

  • When users refresh, the code-behind must send just enough informations to the browser, and the javascript will spot which odds are changed in order to update the display.

  • Deal with situations such as new live games, odds closed, live game ends, new games offered e.t.c


I broke everything down into small problems, and start sovling one by one.

1. Remove all the duplicate data in our old odds display engine first.
    Team information of multiple handicap odds and the show time information will be removed first. Since we introduced double handicap for each match , this will reduce our bandwidth by quite a bit.

2. Implement code to detect all changes from the last refresh
This one will be a challenge. So many possible cases have to be considered. The hardest cases would be a match or an odds being removed, or added. I will detect just odds value changes first.