Parsing JSON Data Sources using SSIS

json_ssisSQL Server Integration Services or SSIS has long been the go-to tool in the BI developer’s toolkit. Any time a customer or co-worker offers up a CSV file with the proper formatting and delimiters, one can quickly turn it into a workable data set. However, not all flat-file data sources have the clean structure that we have come to expect. As the popularity of web services increase, we are seeing more and more JSON encoded result sets. SSIS was not natively built to handle JSON objects since they are not strictly tabular and can contain nested objects (one of the features that makes JSON so appealing).

In this article I will demonstrate how you can take a JSON data source and use it in SSIS with a few lines of code in a Script Component.  This tutorial is for the 2012 version of SQL Server Integration Services but can likely be used in preceding versions the same way.

Getting the Dataset ready
The following steps are assuming that you already have the JSON file stored locally.  The example file I am using is from the Yelp dataset challenge.  Yelp has kindly made available millions of customer reviews in JSON format to the public for purposes of data analytics.  Since the reviews file is over 2 GB in size I had to use a much smaller sample to ensure I had things working.  In order to do that I copied it over to my Linux box and ran the head command:

head --lines 1000 yelp_academic_dataset_review.json > yelp_academic_dataset_review_short.json

The file contains one review per line in the following JSON object format:

{
	"votes": {
			  "funny": 0, 
			  "useful": 0, 
			  "cool": 0
			}, 
	"user_id": "r1Uk-h_bddXb2ScCetoMPA", 
	"review_id": "PY6nIMMLP2pqARlbIOJuRA", 
	"stars": 5, 
	"date": "2014-05-01", 
	"text": "My experience here has been consistently great and memorable.", 
	"type": "review", 
	"business_id": "mVHrayjG3uZ_RLHkLj-AMg"
}

Note that the “votes” property has its own structure with “funny”, “useful”, and “cool” vote types nested within? This is an example of how JSON can contain many levels of object hierarchies. This normally can create quite an issue for tools such as SSIS that require a tablular format to work with. In the following section we will cover how to overcome this.

Creating the Connection in SSIS

In SSIS, create a new package and drag a Data Flow component onto the control flow canvas. Double-click the data flow component to edit. Now that we are in the data flow it is time to create a new connection. At the bottom of the canvas, right-click in the Data Connections area and select “New Flat File Connection…”. Browse to the location where the yelp reviews JSON file is stored. Then configure the connection in the following way:

General Tab of the Connection Manager

Before closing the dialog, select the Advanced tab and configure like below.  It is important to note that only one column is selected and the data type is “text stream [DT_TEXT]”.
ConnectionManager2

On the preview tab it should clearly look like each json object is on its own line starting with a curly brace.  The row delimiter should be {LF}.
ConnectionManager3

Once the connection is made in the connection manager section, you can now add it as a data source.  To do this, open the SSIS Tools menu and drag a “Flat File Data Source” component onto the data flow canvas.  Choose the connection that was made in the previous step and check the box for what should be the only available column.

DataFlowSourceComponent

Parsing JSON using the Script Component

Now that we have a pipeline of JSON text coming in, we need to turn it into a data set that SSIS can work with.  This can be accomplished with just a few lines of C# code.  Drag a “Script Component” onto the data flow canvas.  Once dropped, you will be prompted to choose how this component will behave.  Choose the “Transformation” option.  Then connect the output of the flat file data source to it.  Before going into the code editor we need to set up the inputs and outputs so we can refer to them in our code.  Double-click the script component and select the input page.  Check the flat file output column.  The usage of this should be read-only.

ScriptTransformationInputs

On the Inputs and Outputs page, choose “Output 0” in the tree.  The “Add Column” button should now be enabled.  Add the columns with the following data types:

  • reviewID – string [DT_STR] (50)
  • userID – string [DT_STR] (50)
  • reviewStars – four-byte signed integer [DT_I4]
  • reviewDate – date [DT_DATE]
  • reviewType – string [DT_STR] (50)
  • businessID – string [DT_STR] (50)
  • reviewText – string [DT_STR] (8000)
  • voteFunny – four-byte signed integer [DT_I4]
  • voteCool – four-byte signed integer [DT_I4]
  • voteUseful – four-byte signed integer [DT_I4]

ScriptTransformationOutputs

Now that everything is set up it is time to write the code!  On the Script page, click the “Edit Script” button at the bottom of the dialog.  Once in the script editor, one important requirement is to add a reference.  In the “Project” menu, select “Add Reference”. On the “Framework” page, select “System.Web.Extensions”. This will allow us to use the JavaScript Deserialize method.

system_web_extensions_reference

The goal of the Script transformation is to take the JSON (viewed as raw text by SSIS) and transform it into metadata that SSIS can understand and work with. So to do this we will need to create 2 new classes to assign the data as it comes into the transformation.

The Votes Class

This class will hold the “votes” data. This is actually nested within the parent “review” object in the JSON sting.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SC_8e4c7c591ab04bccabdbd7222237ed5d
{
    class Votes
    {
        public int funny { get; set; }
        public int useful { get; set; }
        public int cool { get; set; }
    }
}

The Yelp Review Class

This class holds the review data. Note that the votes property is of the “Votes” class we created before.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SC_8e4c7c591ab04bccabdbd7222237ed5d
{
    class YelpReview
    {
        public Votes votes { get; set; }
        public string user_id { get; set; }
        public string review_id { get; set; }
        public int stars { get; set; }
        public DateTime date { get; set; }
        public string text { get; set; }
        public string type { get; set; }
        public string business_id { get; set; }
    }
}

Back on the main script page, we will need to add 2 references. The following can be added to the top of the script file (You may need to expand the #references section to see them.

using System.Web.Script.Serialization;
using Microsoft.SqlServer.Dts.Pipeline;

One additional item that needs to be added is the namespace. Since the 2 classes created earlier were created under the SC_8e4c7c591ab04bccabdbd7222237ed5d namespace, the main script needs to be as well.

namespace SC_8e4c7c591ab04bccabdbd7222237ed5d // Add this to see other classes

The script component in “Transformation” mode has a method that is fired for each row that passes through the pipeline. This method is called inputName_ProcessInputRow(inputNameBuffer Row) where inputName in this case is “input0”. Within this method we turn the text blob into a deserialized data object:

JavaScriptSerializer js = new JavaScriptSerializer();

// Give the input column a variable to make it easier to reference.
BlobColumn combinedColumn = Row.Column0;
        
// Convert from blob to string
string reviewConverted = System.Text.Encoding.ASCII.GetString(combinedColumn.GetBlobData(0, Convert.ToInt32(combinedColumn.Length)));
        
// Deserialize the string
YelpReview yelpReview = js.Deserialize<YelpReview>(reviewConverted);

And finally we can assign the output columns to the “YelpReview” object properties.

// Assign values to output columns
Row.reviewDate = yelpReview.date;
Row.reviewStars = yelpReview.stars;
Row.reviewID = yelpReview.review_id;
Row.reviewText = yelpReview.text;
Row.userID = yelpReview.user_id;
Row.voteCool = yelpReview.votes.cool;
Row.voteFunny = yelpReview.votes.funny;
Row.voteUseful = yelpReview.votes.useful;

And here is the final script!

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Web.Script.Serialization;
using Microsoft.SqlServer.Dts.Pipeline;

namespace SC_8e4c7c591ab04bccabdbd7222237ed5d // Add this to see other classes
{
    ///<summary>
    /// This is the class to which to add your code.  Do not change the name, attributes, or parent
    /// of this class.
    /// </summary>


    [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
    public class ScriptMain : UserComponent
    {

        ///<summary>
        /// This method is called once for every row that passes through the component from Input0.
        /// </summary>


        ///<param name="Row">The row that is currently passing through the component</param>;
        public override void Input0_ProcessInputRow(Input0Buffer Row)
        {     
            JavaScriptSerializer js = new JavaScriptSerializer();

            // Give the input column a variable to make it easier to reference.
            BlobColumn combinedColumn = Row.Column0;
        
            // Convert from blob to string
            string reviewConverted = System.Text.Encoding.ASCII.GetString(combinedColumn.GetBlobData(0, Convert.ToInt32(combinedColumn.Length)));
        
            // Deserialize the string
            YelpReview yelpReview = js.Deserialize<YelpReview>(reviewConverted);
        
            // Assign values to output columns
            Row.reviewDate = yelpReview.date;
            Row.reviewStars = yelpReview.stars;
            Row.reviewID = yelpReview.review_id;
            Row.reviewText = yelpReview.text;
            Row.userID = yelpReview.user_id;
            Row.voteCool = yelpReview.votes.cool;
            Row.voteFunny = yelpReview.votes.funny;
            Row.voteUseful = yelpReview.votes.useful;
        }
  }
}

Test the code by selecting “Build” -> “Build Solution” from the main menu. If successful, close the Script Editor. To view the data coming from the transformation, you can add a component that doesn’t require a lot of configuration like a “Multicast” to the canvas. Connect the output pipeline from the Script task to the new component and right-click the pipeline to “Enable Data Viewer”.
EnableDataViewer
Now we are ready to test! Save project and hit “f5” to run it through. You should see green checks at each component and the data viewer should show the columns like the following:
OutputAfterDeserialization
Now that we have a “Normal” data set created, we can add downstream destinations or other transformations to analyze the text. In fact, this will be covered in a subsequent post!

14 Comments


  1. Thanks for the post, I got a question, where add the Votes and YelpReview classes?

    Regards

    Reply

    1. Hi Miguel,

      You could either add those classes to the bottom of your script component script after the methods but before the ending namespace curly brace, or in solution explorer: Right-click the project name then click “Add” -> “Class” and give it the same name as Votes / YelpReview.

      Reply

  2. Did the yelp file contain line feeds after every json object? The file connection manager is not able to recognize json objects for the json file I’m using.

    Reply

    1. Hi Michael,

      Sorry for the late reply….

      As I remember, I didn’t do anything special to the file as for the line breaks. They should be detected by the connection manager as {LF} under “Misc” properties section (Advanced page).

      Reply

  3. Great post. I have managed to implement this within my project. Just would like to know. How can you use this method for JSON arrays?

    Reply

    1. Hi Neal,

      I have only tested on files using one JSON object per line. One thing you could do is run the arrays through a script component first and split them into a single object on each line.

      Here is an example that uses using Newtonsoft.Json.Linq:
      https://dotnetfiddle.net/uox4Vt

      Reply

  4. Very informative post . But I am getting an error at step
    BlobColumn combinedColumn = Row.Column0;

    Cannot implicitly convert type ‘string’ to ‘Microsoft.SqlServer.Dts.Pipeline.BlobColumn’

    Am I missing something or any help

    Many thanks

    Reply

  5. Sorry the error I am getting is

    System.ArgumentException: Invalid JSON primitive

    Reply

  6. Nice topic. Lately I faced the need to parse json data coming for a webservice. At first glance, I wanted to implement something like this. This article helped me to understand how things work.That is why I wanted to say THANK YOU for this content.
    However, for each interface/webservice I needed to perform tons of repetitive tasks, so I searched the web until I found this: https://jsonsource.codeplex.com. The develoepr is a young universitary student, yet he rocks. Jsonsource basically does what you do in this article but provides an hady GUI. Have you tried that component?

    Reply

  7. Hi,
    I have inheritance of 3 classes [ unlike 2 classes shown in the example)
    When I try to refer the element of the third class
    I am getting Object reference not set for the object
    Any help

    Reply

  8. In addition I ma getting this error

    “JSON Deserialization Type is not supported for deserialization of an array”

    as there is one of the list is an array element

    Reply

  9. I am getting Object reference not set for the object

    Reply

  10. Thanks for the post.

    However, would appreciate any help on the below error code

    // Give the input column a variable to make it easier to reference.
    BlobColumn combinedColumn = Row.Column0;

    Cannot implicitly convert type ‘string’ to ‘Microsoft.SqlServer.Dts.Pipeline.BlobColumn’

    I’m new to C#, so need pointers to get this error fixed. Thanks.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *