A surprise learning opportunity from SQL Azure

Have you ever noticed that saying something like “I never get caught speeding.” seems to just invite trouble? (No, I haven’t gotten caught. Yet.) (Crap. I can’t believe I just said that!)

In my last article about GoldMail migrating to Windows Azure, I stated one of the unforeseen benefits: “Since we went into production a year and a half ago, Operations and Engineering haven’t been called in the middle of the night or on a weekend even once because of a problem with the services and applications running on Windows Azure.”.

Yep, just asking for trouble. Fewer than 10 days later, I get a call from the CEO after work hours asking if we’re having a problem, because he can’t seem to upload a GoldMail. He sent me the log file for the desktop application (the Composer); it was timing out when calling the backend (WCF Services running on Windows Azure) to update the database. Uh-oh.

I fired up Cerebrata’s Azure Diagnostics Manager and asked for the last hour of diagnostics for the Composer’s Azure Service. It proceeded to pull thousands and thousands of records. And thousands more. And then some more thousands. So I stopped it, and tried the last 5 minutes. (Still thousands of records). There were repeated tracing statements like this:

[InitializeMail] GMID created = xxxxxxxxxxxx …
[CopyAGoldMail] Adding record for GMID xxxxxxxxxxxx …

(We always put the method in square brackets as the first part of all trace statements.)

Did I mention there were thousands of these statements in the logs, over and over again, each set with a different GMID? I think when it was all said and done, there were hundreds of thousands of records written to the diagnostics in the time between this problem happening and somebody reporting it.

Lesson #1: Don’t have your Azure service send you an e-mail whenever there’s an error in your service. Yes, we’ve considered this, and after looking at this trace logging, I was never so glad as I was then that we didn’t implement this idea. If I ever do implement something like this, I’ll be sure to put in some kind of check so it doesn’t send me a hundred thousand e-mails. Yikes!

Lesson #2: Windows Azure Table Storage is pretty damn performant. It was writing around 40,000 records per minute to the diagnostics.

Seeing all of these [CopyAGoldMail] errors was weird, because CopyAGoldMail is only called from a utility that I wrote for the systems engineer, and he hadn’t run it. The immediate problem was that it was obviously running in an infinite loop. We checked the Azure instance, and both instances were running at almost 100% CPU and were very difficult to RDP into. We rebooted the instances to stop the infinite loop.

In the meantime, I looked at the code, and it looked like it was happening when executing a stored procedure that added a record to the database. Rather than return a value for @@identity, it was returning 0. The code wasn’t checking for this specifically, and I thought maybe it was causing a problem, so I decided to change the stored procedure to return an error code that was specifically being checked for until I could examine the problem more closely later.

I logged into the SQLServer Management Studio, opened the database, right-clicked on the stored procedure and scripted it out (drop and create), changed it and tried to execute it. It didn’t work. I got back an error that our SQL Azure database had hit its size limit and I could not add anything. SURPRISE!

A little frantic bing-ing and one of these later:

ALTER DATABASE GoldMail MODIFY (EDITION = ‘BUSINESS’, MAXSIZE = 10GB)

And everything started writing to the database again.

Lesson #3: You won’t get any notification that you are going to hit the size limit of the SQL Azure database, or that you HAVE hit the size limit of your SQL Azure database.

Another interesting fact: it started at exactly 5 p.m. PDT. Coincidence? Or did we really hit the limit at exactly 5 p.m.? Or does Azure check the limit at specific times rather than continuously?

After it was all said and done, I still needed to figure out where the infinite loop was, and who the heck had run CopyAGoldMail. Every call to our service authenticates the caller, so I just needed to look backwards through the trace diagnostics and find the first entry for CopyAGoldMail. So I started by retrieving trace diagnostics in 15-minute intervals from 5 p.m. forward until I found an interval with thousands and thousands of records. Then I backed off and started looking for 5-minute intervals, until I found where the CopyAGoldMail tracing started.

What was interesting is that most of the writes to the database failed with a SQL Exception and were trace logged accordingly:

System.Data.SqlClient.SqlException (0x80131904): The database ‘GoldMail’ has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning()
at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)
at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)
at System.Data.SqlClient.SqlCommand.InternalExecuteNonQuery(DbAsyncResult result, String methodName, Boolean sendToPipe)
at System.Data.SqlClient.SqlCommand.ExecuteNonQuery()
at ComposerWebRole.<snip>

But the calls to add a new record to the MAIL table did not throw exceptions. Every execution returned 0 for the @@identity value. It was a straight INSERT statement, no rocket science involved, and not much different from the INSERT statement that resulted in the aforementioned exception.

Lesson #4: Not all SQL Azure errors result in thrown exceptions.

After tracking down the start of the CopyAGoldMail tracing, I still couldn’t determine who had called it. It looked like somebody had uploaded a GoldMail and then it had just jumped from adding a GoldMail to copying a GoldMail on its own for no reason I could discern. Both methods are included in the service contract, but there’s no overlap between them.

Sherlock Holmes said that when you have eliminated the impossible, whatever remains, however improbable, must be the truth. It’s impossible for a service to randomly execute code in a different class. What remains? The trace logging. Yes, it turned out that CopyAGoldMail and AddAGoldMail are very similar, and someone copied some of the code and didn’t change the name of the method in the trace logging. (Yes, it was me.) So it wasn’t CopyAGoldMail that had the infinite loop, which is why I couldn’t find it there. It was in AddAGoldMail. It wasn’t handling the case of @@identity being returned with a value of 0, and it was looping infinitely until it got a non-zero value back, which it would have after updating the database size if we hadn’t rebooted the instances.

Lesson #5: If you’re going to rely on your trace logging when you have a problem, make sure it’s right!

3/8/2014 GoldMail is now doing business as PointAcross.

Tags:

5 Responses to “A surprise learning opportunity from SQL Azure”

  1. spottedmahn Says:

    Great post, thanks for sharing.

    Related to the logging and emailing I’ve learned the same thing. The emailing of error needs to be limited. 1,000 emails of the same error don’t help me! I like to leave the limit around 25 or so so when I get a quick 10+ emails I know there’s a problem with out even checking an email.

    As to the method name I utilize the following to prevent that:

    public static string GetCategory(StackFrame sFrameIn)
    {
    var ClassName = Common.GetClassName(sFrameIn);
    var MethName = Common.GetMethodName(sFrameIn);

    //Category = [ClassName].[MethodName]
    var Category = string.Format(“{0}.{1}”, ClassName, MethName);
    return Category;
    }

    static string GetClassName(StackFrame sFrameIn)
    {
    var MethodBase = sFrameIn.GetMethod();
    var DeclaringTypeName = MethodBase.DeclaringType.Name;
    return DeclaringTypeName;
    }

    static string GetMethodName(StackFrame InSFrame)
    {
    var MethodBase = InSFrame.GetMethod();

    var MethodName = MethodBase.Name;

    //prior to adding if in debug log was: WcmsCommon..ctor
    //want WcmsCommon.Constructor instead

    if (MethodName == “.ctor”)
    {
    MethodName = “Constructor”;
    }

    return MethodName;
    }
    public static void Trace(this T ObjThis, string InMsg)
    {
    var prev = Common.GetPreviousStackFrame(new StackTrace(true));
    Common.TraceMethodNameAsCategory(InMsg, prev);
    }
    where Common.TraceMethodNameAsCategory eventually calls:
    Trace.WriteLine(TextIn, CategoryIn);

    Usage:
    someObject.TraceFormat(“My Message: {0}”, someValue);

    It will then log the Class and Method name automatically!

    Regards,
    Michael DePouw

    • robindotnet Says:

      Hi Michael,

      Thanks for the information. I agree about the e-mails. We used to have a Windows service that was critical to our operations, and I set it up to e-mail us when it had a problem. I didn’t put a limit on it at first, but quickly realized while testing that this was folly (after getting 200 e-mails in about 3 seconds). I added code so it would stop sending messages after the 10th one in a few minutes, thank goodness. It was always a contest to see if the CEO would notice there was a problem before the e-mails arrived. (He seems to be using the product all the time, and catches bugs that nobody else can find! Isn’t that always the way?)

      As for the class name and method name, thanks for the code. I’m sure that people will find that useful. We use the following:

      string className = System.Reflection.MethodBase.GetCurrentMethod().DeclaringType;

      We use this in the top of our classes in our desktop product to get the class name. I flipflopped back and forth over doing this in our Azure services, and finally decided not to. I think Reflection is expensive, and if the class is being called thousands of times, I’m concerned it might impact performance. Do you have any information about that?

      Thanks,
      Robin

  2. spottedmahn Says:

    Hi Robin,

    Thanks for the the reply 🙂

    For us/me the concern is debugging not performance. I want code that is quick to write and rich with debugging information. The the volume is such we haven’t seen an performance issues and I like to log! Our volume is low though.

    http://stackoverflow.com/questions/224232/what-is-the-cost-of-net-reflection

    http://www.west-wind.com/weblog/posts/2004/Apr/11/Net-Reflection-and-Performance

    Regards,
    Mike DePouw

    • robindotnet Says:

      Thanks for the links. For me, hardcoding the method name in the tracing statements gives me both debugging and performance. 🙂 I do a *lot* of trace logging. In fact, I sometimes lay awake at night and wonder how soon we’re going to hit that 1TB limit on our storage account because we haven’t cleared out the diagnostics since we went live in November of 2010! After all, I now know what happens if you hit the limit on your SQL Azure database, as noted in this post. Why not learn that about storage, too?

      In the second link, he says this: “This process is slow compared to direct access of a property, field or method, but it’s hardly a show stopper if used sparingly.” If I were going to use reflection rather than hardcoding the method name, I would not be using it sparingly. Like I said, it’s one thing in the desktop application, where the user is running one instance, but for a WCF service that can be called by thousands of people simultaneously, I think it would have an impact. So for us, I’m sticking with my current method. If your method works for you, then you should stick with that, and maybe reconsider it when you have a huge volume of hits to your service. I’d do it in a heartbeat if performance wasn’t an issue.

      Robin

      • spottedmahn Says:

        Yeah, we’re not doing that much logging! If one is developing a WCF service that can be called by thousands of people simultaneously then performance is a much bigger concern. For our desktop client app it works great.

        Thanks again for sharing and conversing 🙂

        Regards,
        Mike D.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: