Query Avro data by using Azure Data Lake Analytics

This article discusses how to query Avro data to efficiently route messages from Azure IoT Hub to Azure services. Message routing allows you to filter data using rich queries based on message properties, message body, device twin tags, and device twin properties. To learn more about the querying capabilities in Message Routing, see IoT Hub message routing query syntax.

The challenge is that when Azure IoT Hub routes messages to Azure Blob storage, by default IoT Hub writes the content in Avro format, which has both a message body property and a message property. The Avro format isn't used for any other endpoints. Although the Avro format is great for data and message preservation, it's a challenge to use it to query data. In comparison, JSON or CSV format is easier for querying data. IoT Hub now supports writing data to Blob storage in JSON and AVRO.

For more information, see Azure Storage as a routing endpoint.

To address nonrelational big-data needs and formats and overcome this challenge, you can use many of the big-data patterns for both transforming and scaling data. One of the patterns, "pay per query," is Azure Data Lake Analytics, which is the focus of this article. Although you can easily execute the query in Hadoop or other solutions, Data Lake Analytics is often better suited for this "pay per query" approach.

There's an "extractor" for Avro in U-SQL. For more information, see U-SQL Avro example.

Query and export Avro data to a CSV file

In this section, you query Avro data and export it to a CSV file in Azure Blob storage, although you could easily place the data in other repositories or data stores.

  1. Set up Azure IoT Hub to route data to an Azure Blob storage endpoint by using a property in the message body to select messages.

    Screen capture showing the Custom endpoints tab from the Message Routing working pane for an IoT hub in the Azure portal, highlighting the Blob storage section.

    Screen capture showing the routes tab from the Message Routing working pane for an IoT hub in the Azure portal, highlighting the routing query and endpoint for a route.

    For more information on settings up routes and custom endpoints, see Create and delete routes and endpoints by using the Azure portal.

  2. Ensure that your device has the encoding, content type, and needed data in either the properties or the message body, as referenced in the product documentation. When you view these attributes in Device Explorer, as shown here, you can verify that they're set correctly.

    Screenshot of the Data tab from Device Explorer, highlighting the content-type and content-encoding attributes of a message.

  3. Set up an Azure Data Lake Store instance and a Data Lake Analytics instance. Azure IoT Hub doesn't route to a Data Lake Store instance, but a Data Lake Analytics instance requires one.

    Screen capture of the working pane for a resource group in the Azure portal, highlighting a Data Lake Analytics instance and a Data Lake Store instance.

  4. In Data Lake Analytics, configure Azure Blob storage as an additional store, the same Blob storage that Azure IoT Hub routes data to.

    Screenshot of the Data sources pane from Data Lake Analytics, highlighting an Azure Storage instance as an additional data source.

  5. As discussed in the U-SQL Avro example, you need four DLL files. Upload these files to a location in your Data Lake Store instance.

    Screenshot of the Data explorer pane from Data Lake Store, highlighting four uploaded DLL files.

  6. In Visual Studio, create a U-SQL project.

    Screenshot of the New Project dialog from Visual Studio, highlighting the U-SQL Project template.

  7. Paste the content of the following script into the newly created file. Modify the three highlighted sections: your Data Lake Analytics account, the associated DLL file paths, and the correct path for your storage account.

    Screen capture of the U-SQL editor in the working pane of Visual Studio for a U-SQL script, highlighting the three sections to be modified.

    The actual U-SQL script for simple output to a CSV file:

        DROP ASSEMBLY IF EXISTS [Avro];
        CREATE ASSEMBLY [Avro] FROM @"/Assemblies/Avro/Avro.dll";
        DROP ASSEMBLY IF EXISTS [Microsoft.Analytics.Samples.Formats];
        CREATE ASSEMBLY [Microsoft.Analytics.Samples.Formats] FROM @"/Assemblies/Avro/Microsoft.Analytics.Samples.Formats.dll";
        DROP ASSEMBLY IF EXISTS [Newtonsoft.Json];
        CREATE ASSEMBLY [Newtonsoft.Json] FROM @"/Assemblies/Avro/Newtonsoft.Json.dll";
        DROP ASSEMBLY IF EXISTS [log4net];
        CREATE ASSEMBLY [log4net] FROM @"/Assemblies/Avro/log4net.dll";
    
        REFERENCE ASSEMBLY [Newtonsoft.Json];
        REFERENCE ASSEMBLY [log4net];
        REFERENCE ASSEMBLY [Avro];
        REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
    
        // Blob container storage account filenames, with any path
        DECLARE @input_file string = @"wasb://hottubrawdata@kevinsayazstorage/kevinsayIoT/{*}/{*}/{*}/{*}/{*}/{*}";
        DECLARE @output_file string = @"/output/output.csv";
    
        @rs =
        EXTRACT
        EnqueuedTimeUtc string,
        Body byte[]
        FROM @input_file
    
        USING new Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor(@"
        {
            ""type"":""record"",
            ""name"":""Message"",
            ""namespace"":""Microsoft.Azure.Devices"",
            ""fields"":
           [{
                ""name"":""EnqueuedTimeUtc"",
                ""type"":""string""
            },
            {
                ""name"":""Properties"",
                ""type"":
                {
                    ""type"":""map"",
                    ""values"":""string""
                }
            },
            {
                ""name"":""SystemProperties"",
                ""type"":
                {
                    ""type"":""map"",
                    ""values"":""string""
                }
            },
            {
                ""name"":""Body"",
                ""type"":[""null"",""bytes""]
            }]
        }"
        );
    
        @cnt =
        SELECT EnqueuedTimeUtc AS time, Encoding.UTF8.GetString(Body) AS jsonmessage
        FROM @rs;
    
        OUTPUT @cnt TO @output_file USING Outputters.Text(); 
    

    It took Data Lake Analytics five minutes to run the following script, which was limited to 10 analytic units and processed 177 files. The result is shown in the CSV-file output that's displayed in the following image:

    Screen capture of the Job View tab in the working pane of Visual Studio for the submitted U-SQL script, showing the job graph.

    Screen capture of the File Preview tab in the working pane of Visual Studio for the submitted U-SQL script, showing the output converted into rows in a comma-separated values (.csv) file.

    To parse the JSON, continue to step 8.

  8. Most IoT messages are in JSON file format. By adding the following lines, you can parse the message into a JSON file, which lets you add the WHERE clauses and output only the needed data.

       @jsonify =
         SELECT Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(Encoding.UTF8.GetString(Body))
           AS message FROM @rs;
    
        /*
        @cnt =
            SELECT EnqueuedTimeUtc AS time, Encoding.UTF8.GetString(Body) AS jsonmessage
            FROM @rs;
    
        OUTPUT @cnt TO @output_file USING Outputters.Text();
        */
    
        @cnt =
            SELECT message["message"] AS iotmessage,
                message["event"] AS msgevent,
                message["object"] AS msgobject,
                message["status"] AS msgstatus,
                message["host"] AS msghost
            FROM @jsonify;
    
        OUTPUT @cnt TO @output_file USING Outputters.Text();
    

    The output displays a column for each item in the SELECT command.

    Screen capture of the File Preview tab in the working pane of Visual Studio for the submitted U-SQL script, showing the queried output of a JSON file converted into a comma-separated values (.csv) file.

Next steps

In this tutorial, you learned how to query Avro data to efficiently route messages from Azure IoT Hub to Azure services.