To JSON or not to XML, that is the question!

Monday February 15th, 2016

In today’s broad and diverse analytics environment, being able to work effectively with multi-structured data is the key to unlocking new value and insights.  In an ideal world we would be able to direct data assets onto the best platform and tools available to process it, perform our analysis and deliver the insights to exactly where they are needed.  However a pragmatic approach is the reality for many, as the analytics ecosystem within the organisation is still maturing and evolving, with only parts of it available in a production capacity.

An example of this pragmatism is making use of existing data assets in your Enterprise Data Warehouse (EDW) , in particular XML data. XML has been around for a long time, but has fallen out of favour in recent times, being effectively replaced by JSON.  New analytics tools and platforms tend to be much more JSON friendly, and XML remains complex to work with.

Within the Teradata Database, there is currently no built-in direct conversion from XML to JSON.  However a feature introduced in version 15.0 offers us a path to follow: the table operator framework has been extended to support scripting languages in-database.  The languages supported are Ruby, Perl, R, Python and shell scripts. This offers us some intriguing capabilities, as we can code custom data conversions/transformations using these languages and have them executed in parallel.

So let’s see how we can use the script feature to convert data from XML to JSON.   I have chosen Perl for this example, as I personally have experience with it, and it has a variety of modules available for handling XML and JSON data.

Tip: The Teradata  Express VM’s won’t have the add-on modules pre-installed, if you are interested in the details of getting this done, please see the Appendix section.  If you are trying this on a real Teradata system, talk to your friendly Teradata Customer Services representative to organize installation.

I am going to use the XML::XML2JSON module.

To test this on a Teradata system, I will use xml data which is readily available – the xml explain plan.

The Teradata xml explain plan data is good for testing since it’s large, complex and “ugly”, in that it is split into multiple lines when captured.  For example:

blog1

Since it’s split into multiple lines, we need to ensure that the different parts of the xml record are given to the script in the correct order, to enable us to re-assemble the entire xml message.  Note that the encoding is UTF-16, so we will ensure the data is processed using UNICODE, to preserve as much content as possible.

The script accepts all input on standard in (STDIN), reading a line at a time.  Note that with xml, a single xml document will have many lines of input, so we will need to accumulate the “lines” and build up the document, until we have the complete xml document to process.

We will also need an easy way to determine when a new record has been given to us, so that we can start a new xml document, in the query I will be selecting a literal “A NEW LINE” to do this.

Here is the perl code to perform the data conversion:

blog5

This code will loop through reading lines from standard input.  If it finds a line containing “A NEW LINE” we know it’s a new xml document, so get the xml content of the line (Teradata uses the tab character as a delimiter by default, you can change it if required) which is anything after the “A NEW LINE” text.

Tip: Even though we don’t use the id_column  and row_number columns supplied to the script, we still need them to ensure Teradata sends the rows to the correct AMP’s in the right order.

We next “chomp” the line, which just removes any trailing new line characters, this is required due to Teradata splitting up the  xml into multiple records, so we need to “stitch” these records back together.

If the line contains the string “<?xml” we know it’s a new xml document, so reset the full_xml variable.

When we see a line with the string “</QryPLanXML>” we know It’s the end of the xml document, so we can go ahead and convert it into json.  We have the json conversion wrapped in an “eval” statement to catch errors, so any xml failing the conversion will result in empty json data.

Now that we have the code to do the conversion, we will install the script into Teradata.

Tip: The detailed instructions are found in the Teradata SQL Functions, Operators, Expressions and Predicates manual, which you can download or browse online at www.info.teradata.com.

I created a database named “xml_to_json” to store my script and output objects. The commands to install the script are:

blog6

Specifically, the user (DBC in my case) needs CREATE EXTERNAL PROCEDURE access to the database to install the script file.  One quirk is that having the script code on the client resulted in an error stating that “source code on the client is not supported by the driver” (I was using ODBC), so there appears to be some restrictions in place. So place your script onto a directory that the server (ie: Teradata Node) can access.

If you make changes to the script, you need to reload it, which can be done with the replace file function:

blog7

All messages output by executing the script are placed into a file on the Teradata Node(s).  This file is:  /var/opt/teradata/tdtemp/uiflib/scriptlog

Tip: If you write to standard error (in perl: print STDERR “some message…”;) the output also gets placed into scriptlog, which allows you to add debug messages whilst fine tuning your code.

To execute the script, we use the new “SCRIPT” syntax, in my example the converted json output is placed into a new table, so we only have to do the parsing once and re-use the answer set many times for analytics.

blog8

Firstly we will concentrate on the FROM clause, which is selecting data from output of the SCRIPT clause.  The “ON” section defines the data input, in our case a SELECT statement reading data from the DBC Query Log XML View for the 2nd February 2016. The select list contains QueryID, XMLRowNo, literal “A NEW LINE” and finally the XMLTextInfo (ie: the xml data).  The QueryID and XMLRowNo columns are not required by out perl script, however we must have them in the select list to use them for partitioning and ordering.

The PARTITION BY QueryID ORDER BY QueryID, XMLRowNo ASC clause is needed to ensure our records are processed correctly in parallel.  Records are processed by every AMP on the system, so the PARTITION BY clause ensures that rows for the same Query ID are processed by the same AMP.  The ORDER BY clause is needed to present the multiple parts of the xml document in the correct order (ie: part 1, part 2, part 3).

The SCRIPT_COMMAND clause tells Teradata how to execute the script within a shell, Teradata will dynamically copy the script into the file system on the Teradata Node, with the script location as: ./<database name>/<script file name>

The RETURNS clause defines the data being returned by the script, in our case a single column “json_data” with a CLOB datatype.

Finally the outermost SELECT performs a CAST of json_data into the JSON Datatype, and notice that we can use the resulting JSON datatype to immediately query it, using the new JSON dot notation to access the key/value pairs:

blog9

After this query executes successfully, we have an output table with the query explain plans converted into JSON.

We can now query this data using the Teradata 15 JSON features, for example:

blog10

The output of my test data results in:

blog2

Notice that I have 791 records which didn’t have a match for StatementType, this warrants further investigation as it could be an indication that the script needs further refinement (ie: it may be that some xml is failing to be parsed).

blog11

A subsequent query shows that most of the queries are “EXEC <macro>” statements, which makes sense as the query plan is not generated at execution time for a macro, rather it is generated and stored when the macro is created.  We do expect some xml explain plans to fail parsing, as accuracy of captured DBQL data is sacrificed in some circumstances, for example when the system is in flow control, but these should be relatively few in number.

Having data in JSON format provides convenient querying functionality that is simple and easy for business users to understand.  It also enables the data to be consumed and analysed by the many current (and next) generation tools available which support JSON natively.  Combined with Teradata REST services we have a complete set of functionality to interact with the broader data ecosystem of today’s modern enterprises.

The introduction of the scripting functionality within Teradata is very exciting. The possibilities to understand, discover and extract  value from your existing data assets are broader than ever before.

Understanding the data that is available within your analytics ecosystem today is the key to making informed decisions about where that data, the processing of that data, and the analytic capability, should be in the future. It’s now more important than ever to be making data driven decisions given the multitude and complexity of the options available in our big data world.

Appendix:  Installing Perl Libraries on a Teradata Express VM

Firstly, the Teradata  Express VM’s won’t have custom libraries installed, so that needs to be done.  This example is for the Teradata 15.10 VM running SLES11.1, other versions may require different steps to get the install done.

First, we need the VM to talk to the internet:

From the VMWare player, access the virtual machine settings: Player -> Manage -> Virtual Machine Settings…

Click on the “Network Adapter” entry and enable “Bridged” mode as shown below.

blog3

Restart/Boot the VM.  From the command prompt in the VM, issue: shutdown –i6 –g0 –y

When the VM has booted again, login as root and execute the command: netstat -rn

Note the IP address of your gateway (usually your router) for destination “0.0.0.0”.

blog4

Edit /etc/resolv.conf, comment out any existing lines (put “#” in front of them), and add the following line: nameserver <gateway ip address>

You should now be able to access the internet from the VM.

I highly recommend you use “cpan” to install libraries, as it also resolves all dependencies for you and downloads and installs everything for you.  To use cpan:

cpan

To install the library we need:

install XML::XML2JSON

Take the default answers, and select “yes” for everything and after a few minutes all should be installed.

One quirk I did find was that the library files didn’t have the correct permissions set, fix this in one foul swoop with:

chmod –R a+rx /usr/lib/perl5/*

This just adds read and execute permissions to all files in your perl libraries.

The XML::XML2JSON module also, for some unknown reason, raises specific warning messages when it loads the JSON module to perform a conversion.  This get’s written to standard error, therefore it ends up in the scriptlog file, one message per xml document processed.  To fix this, I commented out that line in the perl module:

vi /usr/lib/perl5/site_perl/5.10.0/XML/XML2JSON.pm

Change this line:

warn “loaded module: $Self->{_loaded_module}”;

To:

#warn “loaded module: $Self->{_loaded_module}”;

This will disable the warning, reducing the risk of scriptlog getting large in size and causing problems.  Remember to repeat this change if the module is upgraded.

 

Nathan Green is a Senior Ecosystem Architect, and been working for Teradata for more than 23 years. In this role, Nathan is focused on providing trusted advice to customers on how to effectively build a diverse ecosystem for analytics, given the rapid change and evolution in tools and technologies within the analytics landscape.

Nathan is recognized as one of the Architecture leaders within Teradata and is involved with helping Teradata evolve to continue to innovate and provide thought leadership to our customers in this big data world.

Prior to joining Teradata, Nathan studied at Monash University (Caulfield Chisolm campus), completing his Bachelor of Computing (Information Technology) there in 1991. After graduating he worked with a small software development company, developing in C, SQL and Ingres. He was then employed by NCR as a C programmer before moving into the Teradata division as an Engineer, where he has progressed through many varied roles since.

 

Category: Other
avatar

About Nathan Green

Nathan Green is a Senior Ecosystem Architect, and been working for Teradata for more than 23 years. In this role, Nathan is focused on providing trusted advice to customers on how to effectively build a diverse ecosystem for analytics, given the rapid change and evolution in tools and technologies within the analytics landscape. Nathan is recognized as one of the Architecture leaders within Teradata, and is involved with helping Teradata evolve to continue to innovate and provide thought leadership to our customers in this big data world. Prior to joining Teradata, Nathan studied at Monash University (Caulfield Chisolm campus), completing his Bachelor of Computing (Information Technology) there in 1991. After graduating he worked with a small software development company, developing in C, SQL and Ingres. He was then employed by NCR as a C programmer before moving into the Teradata division as an Engineer, where he has progressed through many varied roles since.

Leave a Reply

Your email address will not be published. Required fields are marked *


*