Diagnostics logging and structured logging

This post is going to be about one of my favorite things, diagnostics logging. When I worked for a startup, I was responsible for the entire backend for all of the products. This comprised of a lot of WCF services running in Web Roles, an Azure SQL Database, and some Worker Roles. By habit, I always include a lot of trace logging. I thought maybe I had overdone it in this case, as I was tracing entry points for all methods in the WCF services called by their clients, and all of the variables as well, and some additional information, but then I went to this talk by the Halo group, and they talked about how much logging they do, and I realized I’m an amateur in comparison!

Diagnostics logging in action

Included in all of my tracing was the user GUID, which allowed us to pull together a bunch of information about what a particular customer was doing, to help troubleshoot when he had a problem. You could see when he logged in, when he called the backend and what for. This was very helpful when a customer had a problem. Additionally, the back end was largely developed when we migrated from a traditional hosting center to Azure, and it was .NET 2.0 asmx web services. This was back in 2010 before Azure was mainstream, and we didn’t have the capabilities we have today to attach the debugger in Visual Studio and debug the service running in the cloud. If you wanted to know what was going on, you added trace logging.

For example, we had a worker role that would pull requests from a queue to resize the photos from the customer into mobile-sized pictures. This role would read the queue, pull messages off of the queue and deconstruct them into whatever structure was needed, download the pictures involved, resize them, then upload them again. Every step was a failure point. If you have a hundred pictures, that’s 100 downloads, 100 resizes, 100 uploads, and there was deconstructing the message – what if it was in an invalid format and unrecognizable? I put in trace logging throughout the code, so if something happened, I would know exactly where it went wrong and why.

All of my trace logging was done straight into the Windows Azure Diagnostics Logs table. A couple of years later, I discovered ETW (Event Tracing for Windows) and SLAB – the Semantic Logging Application Block. I would have rewritten all of our logging, but at a startup, that’s not a luxury you have! It’s definitely something I would recommend and use going forward.

What’s wrong with plain ordinary trace logging

The problem with ordinary trace logging is that you create millions of rows of unstructured data containing the identifying information about each row (like timestamp and partition key), and a single field for whatever message you want to convey. This made it very time-consuming and difficult to look through the logs and find what I needed. I used the Cerebrata Azure Management Studio, which did some built-in filtering by service, which was helpful in reducing the amount of information I had to wade through, but I still had to search millions of records for the user GUID to find out what was going on.

There were some services, like the one that continuously polled our CRM system for changes and when found, migrated them to the SQL Server database, that generated so much logging, you had to be very determined to find what you wanted. Aside from troubleshooting, it would have been really helpful to have some kind of structure to that logging so we could gather some information, consolidate it, and learn something from it. But all you have with WADiagnostics is a single string field.

Rather than a single field that said something like “user [GUID] reloaded project abc”, how much more useful would it be to have something that has fields like this that you can query directly: { user:”guid”, project:”abc”, action:”reload” }. This would be possible if all the logging for the services used structured logging and was stored in  Azure Table Storage. This would have allowed us to query the table for specific values of GUID, projects, and actions. Think of the metrics you could gather – how many customers are accessing project “abc”, how many are reloading the project versus abandoning it and creating a new project, etc.

This is what structured logging is all about – making it easier to consume logs.

The problem is that structured logging is a completely different paradigm, and you don’t just put in random trace statements whenever you want to. You have to think about what you want to log, where you want to log it, and how. How do you want to query the data? Do you want to be able to retrieve all database retry errors together, all method entry points together, all data for a specific user GUID? These are some of the things you have to think about.

Summary

This post talked about the practice of diagnostics logging, why structured logging is a good idea, and how it can help you do troubleshooting and data mining. In my next post, I will talk specifically about the features of ETW and SLAB.

Tags: , , ,

One Response to “Diagnostics logging and structured logging”

  1. Introduction to the Semantic Logging Application Block (SLAB) | RobinDotNet's Blog Says:

    […] Azure Like It / ClickOnce « Diagnostics logging and structured logging […]

Leave a comment