Have you ever noticed that saying something like “I never get caught speeding.” seems to just invite trouble? (No, I haven’t gotten caught. Yet.) (Crap. I can’t believe I just said that!)
In my last article about GoldMail migrating to Windows Azure, I stated one of the unforeseen benefits: “Since we went into production a year and a half ago, Operations and Engineering haven’t been called in the middle of the night or on a weekend even once because of a problem with the services and applications running on Windows Azure.”.
Yep, just asking for trouble. Fewer than 10 days later, I get a call from the CEO after work hours asking if we’re having a problem, because he can’t seem to upload a GoldMail. He sent me the log file for the desktop application (the Composer); it was timing out when calling the backend (WCF Services running on Windows Azure) to update the database. Uh-oh.
I fired up Cerebrata’s Azure Diagnostics Manager and asked for the last hour of diagnostics for the Composer’s Azure Service. It proceeded to pull thousands and thousands of records. And thousands more. And then some more thousands. So I stopped it, and tried the last 5 minutes. (Still thousands of records). There were repeated tracing statements like this:
[InitializeMail] GMID created = xxxxxxxxxxxx …
[CopyAGoldMail] Adding record for GMID xxxxxxxxxxxx …
(We always put the method in square brackets as the first part of all trace statements.)
Did I mention there were thousands of these statements in the logs, over and over again, each set with a different GMID? I think when it was all said and done, there were hundreds of thousands of records written to the diagnostics in the time between this problem happening and somebody reporting it.
Lesson #1: Don’t have your Azure service send you an e-mail whenever there’s an error in your service. Yes, we’ve considered this, and after looking at this trace logging, I was never so glad as I was then that we didn’t implement this idea. If I ever do implement something like this, I’ll be sure to put in some kind of check so it doesn’t send me a hundred thousand e-mails. Yikes!
Lesson #2: Windows Azure Table Storage is pretty damn performant. It was writing around 40,000 records per minute to the diagnostics.
Seeing all of these [CopyAGoldMail] errors was weird, because CopyAGoldMail is only called from a utility that I wrote for the systems engineer, and he hadn’t run it. The immediate problem was that it was obviously running in an infinite loop. We checked the Azure instance, and both instances were running at almost 100% CPU and were very difficult to RDP into. We rebooted the instances to stop the infinite loop.
In the meantime, I looked at the code, and it looked like it was happening when executing a stored procedure that added a record to the database. Rather than return a value for @@identity, it was returning 0. The code wasn’t checking for this specifically, and I thought maybe it was causing a problem, so I decided to change the stored procedure to return an error code that was specifically being checked for until I could examine the problem more closely later.
I logged into the SQLServer Management Studio, opened the database, right-clicked on the stored procedure and scripted it out (drop and create), changed it and tried to execute it. It didn’t work. I got back an error that our SQL Azure database had hit its size limit and I could not add anything. SURPRISE!
A little frantic bing-ing and one of these later:
ALTER DATABASE GoldMail MODIFY (EDITION = ‘BUSINESS’, MAXSIZE = 10GB)
And everything started writing to the database again.
Lesson #3: You won’t get any notification that you are going to hit the size limit of the SQL Azure database, or that you HAVE hit the size limit of your SQL Azure database.
Another interesting fact: it started at exactly 5 p.m. PDT. Coincidence? Or did we really hit the limit at exactly 5 p.m.? Or does Azure check the limit at specific times rather than continuously?
After it was all said and done, I still needed to figure out where the infinite loop was, and who the heck had run CopyAGoldMail. Every call to our service authenticates the caller, so I just needed to look backwards through the trace diagnostics and find the first entry for CopyAGoldMail. So I started by retrieving trace diagnostics in 15-minute intervals from 5 p.m. forward until I found an interval with thousands and thousands of records. Then I backed off and started looking for 5-minute intervals, until I found where the CopyAGoldMail tracing started.
What was interesting is that most of the writes to the database failed with a SQL Exception and were trace logged accordingly:
System.Data.SqlClient.SqlException (0×80131904): The database ‘GoldMail’ has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)
at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe)
But the calls to add a new record to the MAIL table did not throw exceptions. Every execution returned 0 for the @@identity value. It was a straight INSERT statement, no rocket science involved, and not much different from the INSERT statement that resulted in the aforementioned exception.
Lesson #4: Not all SQL Azure errors result in thrown exceptions.
After tracking down the start of the CopyAGoldMail tracing, I still couldn’t determine who had called it. It looked like somebody had uploaded a GoldMail and then it had just jumped from adding a GoldMail to copying a GoldMail on its own for no reason I could discern. Both methods are included in the service contract, but there’s no overlap between them.
Sherlock Holmes said that when you have eliminated the impossible, whatever remains, however improbable, must be the truth. It’s impossible for a service to randomly execute code in a different class. What remains? The trace logging. Yes, it turned out that CopyAGoldMail and AddAGoldMail are very similar, and someone copied some of the code and didn’t change the name of the method in the trace logging. (Yes, it was me.) So it wasn’t CopyAGoldMail that had the infinite loop, which is why I couldn’t find it there. It was in AddAGoldMail. It wasn’t handling the case of @@identity being returned with a value of 0, and it was looping infinitely until it got a non-zero value back, which it would have after updating the database size if we hadn’t rebooted the instances.
Lesson #5: If you’re going to rely on your trace logging when you have a problem, make sure it’s right!