Archive for June, 2012

A surprise learning opportunity from SQL Azure

June 28, 2012

Have you ever noticed that saying something like “I never get caught speeding.” seems to just invite trouble? (No, I haven’t gotten caught. Yet.) (Crap. I can’t believe I just said that!)

In my last article about GoldMail migrating to Windows Azure, I stated one of the unforeseen benefits: “Since we went into production a year and a half ago, Operations and Engineering haven’t been called in the middle of the night or on a weekend even once because of a problem with the services and applications running on Windows Azure.”.

Yep, just asking for trouble. Fewer than 10 days later, I get a call from the CEO after work hours asking if we’re having a problem, because he can’t seem to upload a GoldMail. He sent me the log file for the desktop application (the Composer); it was timing out when calling the backend (WCF Services running on Windows Azure) to update the database. Uh-oh.

I fired up Cerebrata’s Azure Diagnostics Manager and asked for the last hour of diagnostics for the Composer’s Azure Service. It proceeded to pull thousands and thousands of records. And thousands more. And then some more thousands. So I stopped it, and tried the last 5 minutes. (Still thousands of records). There were repeated tracing statements like this:

[InitializeMail] GMID created = xxxxxxxxxxxx …
[CopyAGoldMail] Adding record for GMID xxxxxxxxxxxx …

(We always put the method in square brackets as the first part of all trace statements.)

Did I mention there were thousands of these statements in the logs, over and over again, each set with a different GMID? I think when it was all said and done, there were hundreds of thousands of records written to the diagnostics in the time between this problem happening and somebody reporting it.

Lesson #1: Don’t have your Azure service send you an e-mail whenever there’s an error in your service. Yes, we’ve considered this, and after looking at this trace logging, I was never so glad as I was then that we didn’t implement this idea. If I ever do implement something like this, I’ll be sure to put in some kind of check so it doesn’t send me a hundred thousand e-mails. Yikes!

Lesson #2: Windows Azure Table Storage is pretty damn performant. It was writing around 40,000 records per minute to the diagnostics.

Seeing all of these [CopyAGoldMail] errors was weird, because CopyAGoldMail is only called from a utility that I wrote for the systems engineer, and he hadn’t run it. The immediate problem was that it was obviously running in an infinite loop. We checked the Azure instance, and both instances were running at almost 100% CPU and were very difficult to RDP into. We rebooted the instances to stop the infinite loop.

In the meantime, I looked at the code, and it looked like it was happening when executing a stored procedure that added a record to the database. Rather than return a value for @@identity, it was returning 0. The code wasn’t checking for this specifically, and I thought maybe it was causing a problem, so I decided to change the stored procedure to return an error code that was specifically being checked for until I could examine the problem more closely later.

I logged into the SQLServer Management Studio, opened the database, right-clicked on the stored procedure and scripted it out (drop and create), changed it and tried to execute it. It didn’t work. I got back an error that our SQL Azure database had hit its size limit and I could not add anything. SURPRISE!

A little frantic bing-ing and one of these later:


And everything started writing to the database again.

Lesson #3: You won’t get any notification that you are going to hit the size limit of the SQL Azure database, or that you HAVE hit the size limit of your SQL Azure database.

Another interesting fact: it started at exactly 5 p.m. PDT. Coincidence? Or did we really hit the limit at exactly 5 p.m.? Or does Azure check the limit at specific times rather than continuously?

After it was all said and done, I still needed to figure out where the infinite loop was, and who the heck had run CopyAGoldMail. Every call to our service authenticates the caller, so I just needed to look backwards through the trace diagnostics and find the first entry for CopyAGoldMail. So I started by retrieving trace diagnostics in 15-minute intervals from 5 p.m. forward until I found an interval with thousands and thousands of records. Then I backed off and started looking for 5-minute intervals, until I found where the CopyAGoldMail tracing started.

What was interesting is that most of the writes to the database failed with a SQL Exception and were trace logged accordingly:

System.Data.SqlClient.SqlException (0x80131904): The database ‘GoldMail’ has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning()
at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)
at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe)
at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
at ComposerWebRole.<snip>

But the calls to add a new record to the MAIL table did not throw exceptions. Every execution returned 0 for the @@identity value. It was a straight INSERT statement, no rocket science involved, and not much different from the INSERT statement that resulted in the aforementioned exception.

Lesson #4: Not all SQL Azure errors result in thrown exceptions.

After tracking down the start of the CopyAGoldMail tracing, I still couldn’t determine who had called it. It looked like somebody had uploaded a GoldMail and then it had just jumped from adding a GoldMail to copying a GoldMail on its own for no reason I could discern. Both methods are included in the service contract, but there’s no overlap between them.

Sherlock Holmes said that when you have eliminated the impossible, whatever remains, however improbable, must be the truth. It’s impossible for a service to randomly execute code in a different class. What remains? The trace logging. Yes, it turned out that CopyAGoldMail and AddAGoldMail are very similar, and someone copied some of the code and didn’t change the name of the method in the trace logging. (Yes, it was me.) So it wasn’t CopyAGoldMail that had the infinite loop, which is why I couldn’t find it there. It was in AddAGoldMail. It wasn’t handling the case of @@identity being returned with a value of 0, and it was looping infinitely until it got a non-zero value back, which it would have after updating the database size if we hadn’t rebooted the instances.

Lesson #5: If you’re going to rely on your trace logging when you have a problem, make sure it’s right!

3/8/2014 GoldMail is now doing business as PointAcross.

GoldMail and Windows Azure

June 12, 2012

I am the director of software engineering at a small startup called GoldMail. We have a desktop product that lets you add images, text, Microsoft PowerPoint presentations, etc., and then record voice over them. You then share them, which uploads them to our backend, and we give you a URL. You can then send this URL to someone in an e-mail with a hyperlinked image, post it on facebook or twitter or linkedin, or even embed it in your website. When the recipient clicks the link, the GoldMail plays in any browser using either our Flash player or our html5 player, depending on the device. We track the views, so you can see how many people actually watched your GoldMail.

GoldMail is like video without the overhead. I use it here on my blog, and many of our customers use it for marketing and sales. (I also use it for vacation pictures.)

What can Azure do for us?

About a year and a half ago, I attended an MSDN event hosted by Bruno Terkaly talking about Windows Azure, and I was impressed, especially at the possibilities it provided for a startup. Rather than buying and hosting enough servers to handle your “Oprah moment” – the day she talks about your product on her show, followed by a huge increase in traffic which tests whether you really did scale your infrastructure correctly – you can start small with low cost, and then scale up as you need to. (Presumably, your rising infrastructure needs will mirror your rising revenue stream!). Also, we primarily use the Microsoft development stack, and it was very appealing to leverage the .NET skills we already had.

At the time, we had several servers maintained by a hosting company in Silicon Valley, and the cost was a substantial part of our monthly burn rate. I met with the VP of Operations (Samar Kawar) and we estimated what we thought it would cost to host our infrastructure entirely in Azure. The estimate was so low, we doubled it before presenting it to management.

Unfortunately, our contract with our hosting company was about to roll over for another year. We couldn’t do the migration before the end date of the contract, so the project was shelved. A month or two later, we discovered that the contract rolled over to a month-to-month contract. Not surprisingly, the project was suddenly resurrected and given top priority!

My total experience at that point with Azure was:

  1. attending a Microsoft event explaining the capabilities,
  2. reading the book Azure in Action by Brian Prince and Chris Hay, and
  3. spending a lot of time thinking about it in the shower.

So of course I told management we could finish the development in about 30 days, and have everything go into production within 60 days.

My boss thought I was completely nuts. So I told him I could get our Silverlight application to work in Azure in 15 minutes + deployment time. He was skeptical. So I showed him the quick way to turn a web application (or Silverlight, in this case) into a cloud project. I did this (which is outlined in the first page of this article):

  1. Opened the Visual Studio solution.
  2. Added a cloud project with no roles.
  3. Right-clicked on the cloud project and selected “Add web role in project”, selected the Silverlight project, and hit OK.
  4. Set the Azure configuration values for the storage account.
  5. Right-clicked on the cloud project and published it to Azure.
  6. Ran the application, and it worked.

My boss said, “Okay, go for it. I still think you’re nuts, but maybe not completely nuts.”

What do we have?

Let’s talk infrastructure. Here’s what we had:

  • Desktop client application (used to create the GoldMails)

  • Flash client application (used to play the GoldMails)
  • HTML5 client application (used to play the GoldMails on mobile devices)
  • Silverlight application (communicates with CRM system and SQLServer database)
  • A bunch of .NET 2.0 asmx web services used to talk to the SQL Server database from the client applications
  • Web services used by client applications to talk to our CRM system
  • Web applications for user management and affiliate management; these used the web services to communicate with our CRM system
  • Our company website
  • SQL Server database
  • Integration service to communicate between our CRM system and the SQL Server database.

We wanted to migrate everything except for the CRM system.

I set benchmarks for the project, and we set to work.

What’s the big plan?

Migrate the SQL Server database to SQL Azure. We did this using the SQL Azure Migration Wizard on codeplex; it worked great. I discovered some triggers and some CLR routines that wouldn’t migrate, but they weren’t critical, and were easily re-architected. We did this migration first, because you can’t test anything without the database!

Migrate the .NET 2.0 asmx web services. All access to the SQL Server (now SQL Azure) database goes through these services. I re-architected the services, separating them by product so we could scale them separately. And because no programmer worth his (or her) salt is going to miss an opportunity to upgrade to the latest version, I converted them to .NET 4 WCF services running in web roles.

I had never used WCF, so I had to stop and figure it out. I used two books for reference, Learning WCF by Michele Bustamante and Essential WCF by Resnick, Crane, and Bowen. Between those two books, I managed to grasp the principles and create simple WCF services. (Trying to learn WCF quickly was actually more painful than learning Azure.)

The WCF service for the players writes update requests to a queue, and a worker role pulls the entries from the queue and writes them to the database. I did this because I didn’t want to customer to have to wait for a response from the web service to continue with the customer’s workflow, and to even out some of the database access.

The WCF service used by the desktop application submits an entry to a queue after a customer shares a GoldMail. A worker role retrieves the entry from the queue and creates small versions of the customer’s large slides, to be used for our mobile player. I removed this function from the client application, which reduced the amount of time it took to share a GoldMail by 50%.

Migrate the Integration service (CRM <—> SQL Server). This was a SQL Server job with a lot of big queries. We migrated this to a worker role, feeding the changes into queues which were then processed separately. This was a critical path project; without this piece, none of the client applications could go into production. This was completely re-architected to introduce significant improvements in the process.

Migrate the Silverlight application. As I did in the demo to my boss, I added a cloud project, set the Silverlight application project as the web role, and updated the Azure configuration. This application is used by our desktop product.

Modify the desktop application. This had to be changed to call the services in Azure instead of the old .NET 2.0 asmx web services. Because all of the access methods are in a proxy layer that resides in one method, only that one method had to be changed to update our desktop product. Plus, as noted above, I removed the creation of the smaller images for the mobile player.

Modify the flash application. This had to be changed to call the new services. I had to do a bit of trial-and-error to figure out the right bindings for the WCF service so it would be callable from flash.

Migrate the html5 application. This was easily migrated to a web role.

Migrate the web services used to talk to the CRM system. We converted these to .NET 4 WCF services running in a web role.

Migrate the web applications (signup/affiliate) that update the CRM system. We changed these to access the new services and migrated the web applications to web roles.

Migrate our website. We had a component that was installed on the webserver that we were using to check for browser and machine type (Mac/PC) and redirect accordingly. I removed the component and changed the webpage that used it to check the user agent string instead. We also use URL Rewrites, so I had to figure out how to configure that in my web.config so it would work in Azure. Then I just added a cloud project, set the website as the web role, and updated the Azure configuration (just like the Silverlight application).

The devil is in the details

I figured out how to do the builds and set up the configurations and turned it over to our release manager/build engineer. I provided information to the other engineers about installing the Azure tools and how to set up a WCF service, set up the configurations for the projects, moved values from the web.config to the Azure configurations, and handled a hundred other small details.

As we progressed, we put the services and changes into staging for QA to test them while we were working on the next set of products. 33 days in, we migrated the major bits that used the SQL Azure database — the WCF services, the desktop application, the Flash player, the Silverlight application and the Integration service. It took us about a day to put all of it in production, primarily because the migration of the database took so long.

A week later, everything else was released to production. So the whole cycle took us about 40 days. We would never have been able to do this migration without a great team of people really pulling together and making it happen.

Of course, just because everything went into production, it didn’t mean we were finished. We had to deal with any problems that came up, right? Most of these were small and easily fixed, but we did hit one that caused us some grief : SQL Azure connection problems.

SQL Azure Connection Management

We had done load testing, but hadn’t seen any problems with SQL Azure. After going into production, the trace logs were full of exceptions from trying to open connections to the database or execute stored procedures. I talked to Microsoft about the issues, and they recommended putting in “exponential retry code”. (To be honest, I was a little disturbed that they had an official phrase for it.) They sent me to their connection management article to explain why.

Exponential retry code means if the call to SQL Azure fails, you call again immediately. Then if the second call fails, you wait a few seconds and try again. If it fails again, you wait longer and try again, etc. They do have a framework that they recommend you use, but I wasn’t aware of it at the time. I put retry code in using the brute force method – I added retry code to all of our services that call SQL Server. This helped the problem, but not enough. After being a squeaky wheel, Microsoft assigned someone from the SQL Azure team to me to look at the problem.

Microsoft took copies of our code, our database and our trace logging, and went off to ruminate on it. They came back about a month later and said, “It’s not you, it’s us.” I told them I wanted to break up. (Ha!). They were handling the case of large databases and large resource requirements – they had throttling and that sort of thing set up – but they didn’t have any minimum resource levels, and hadn’t given a great deal of thought to companies like ours with services that only connect periodically. Basically, we were being kept out of the playground by the bullies having all the fun.

Over the following months, we saw huge improvements in SQL Azure performance. A year and a half later, we rarely see connection problems, and when we do, they usually succeed on the first retry.

What was the final outcome of the migration?

Overall, the migration took us about two months. But in the interest of full disclosure, I have to admit it didn’t really take two months. I worked over 700 hours of overtime in that two months. I worked 9 a.m. to 2 a.m. pretty much every day, splitting my time between doing the programming, providing architecture advice and programming help to the other developers, and managing the project. It was very intense, and a lot of fun, and the outcome gave me great satisfaction.

I was really interested to see what our costs would be after doing the migration. We had sized all of our instances to match the servers we had, which was too large, but we figured it was better than too small. After the dust settled, we found that the cost of all of our services, databases, etc., was reduced by about 85% when we moved from a traditional IT environment to running on Windows Azure.

Over time, I added performance indicators and sized our instances more appropriately. Even after adding more hosted services, we are now paying 90% less than we used to pay for actual servers and hosting by running in the cloud.

Windows Azure is a great platform for startups because you can start with minimal cost and ramp up as your company expands. You used to have to buy the hardware to support your Oprah moment even though you didn’t need that much resource to start with, but this is no longer necessary. (See current Azure pricing here.)

An additional benefit that I didn’t foresee was the simplification of deployment for our applications. We used to manually deploy changes to the web services and applications out to all of the servers whenever we had a change. Manual upgrades are always fraught with peril, and it took time to make sure the right versions were in the right places. Now the build engineer just publishes the services to the staging instances, and then we VIP swap everything at once.

And here’s one of the best benefits for me personally – Since we went into production a year and a half ago, Operations and Engineering haven’t been called in the middle of the night or on a weekend even once because of a problem with the services and applications running on Windows Azure. And we know we can quickly and easily scale our infrastructure, even though Oprah has retired.

Note: GoldMail is now doing business as PointAcross. 3/7/2014