When we talk to our customers today about Teradata’s place in the enterprise, it’s always in the context of an Analytics Ecosystem. This reflects our core belief that, to unlock the full potential of analytics within an organisation, you need a variety of platforms, tools and technologies that work together to enable new questions to be answered (which is the purpose of analytics: to answer business questions). This environment to ask and answer questions must be sustainable and be able to grow (ie: scale), hence the ecosystem nomenclature. Investment in building this ecosystem should be primarily focused on enabling new capability – how can the tools that I have available allow me to answer new questions when used together?
Teradata has been steadily adding features for extensibility and investing in products to enable ecosystem platforms to process data together (eg: Querygrid and Presto). One of the cool extensibility features released in Teradata 15.0 was the table script operators, allowing us to specify a script to execute as part of a query. I have previously written a blog using these table script operators to convert XML into JSON.
Also, over the last 12 months using REST API’s to provide interfaces between applications has increased in popularity, to the point now that you expect all products by default to provide some form of REST API.
What if we put the two together? What can we enable if we use the Teradata Script Operator to perform REST API calls to an external processing service, from within an SQL Query? This would allow us to leverage the strengths of Teradata: complex joins, efficient redistribution of data across a scalable parallel data processing platform, filtering, aggregation etc… and enhance that functionality with external services that can be called on demand from within a query – in parallel. These external services could be practically anything that provides us with a REST API, and is interactive (ie: performant) enough that our query still completes in a reasonable time.
To demonstrate this concept, I thought an interesting example would be to see if we could use one of the “hot” open source technologies of today – Spark. Can we bring Teradata and Spark together, by using a table script operator (written in python) to conduct REST calls to a Spark cluster, to perform some custom analytic functions and return the result to our SQL query.
Setting up the Spark Cluster
Spark is still very much a batch oriented processing framework, although it does have the ability to provide low latency responses through long running job contexts. By itself the REST API is also very batch focused, and requires uploading of jar files to initiate a job. This doesn’t sound very interactive or low latency for us to use within a Teradata SQL query.
Luckily, some clever people from Cloudera have developed a REST server called Livy which provides the capability to issue “interactive” scala or python against a spark cluster, similar to the interactive command line use of spark-shell, but via REST calls.
So using Livy + Spark provides us with a model for “Spark as a service”, allowing us to interactively provide code to the spark cluster for execution.
To setup the Spark cluster I used a Centos VM, installed docker and created the Spark cluster using containers, with the Livy server running on the Spark Master node. For details of getting this up and running, see “Appendix A: Configuring the Spark Cluster” at the end of this blog.
Setting up the Teradata VM
There is some setup required on the Teradata VM to install the python “requests” module, which we will use to perform the REST calls to the spark cluster.
For details of setting up the Teradata VM, see “Appendix B: Configuring the Teradata VM” at the end of this blog.
Performing analysis on Spark
Now that we have our Teradata Database and Spark Cluster setup, we can use the two together to perform analytics.
For this example we will use Spark to perform some simple text analytics on string data provided by Teradata (SQL statement text). We will supply the string data to Spark, tokenize it and produce ngrams.
Python is used to provide the REST function calls to interact with Livy. We will execute the “analytics” using scala.
The code used in this example can be found here.
Remember I mentioned Spark provides interactive processing through long running job contexts? What that means for us, is that we have to startup a session, wait for it to start, then submit our code for each row of input, then close the session when we are finished. So we do incur overhead of starting up the session, but while we process each row in parallel (on each AMP in the Teradata system, 2 for the VM) we are executing the code in the “interactive” mode.
When the session has started up, we can issue the spark “query” for each record of input.
The SQL is passed in as a text string and a dataframe is created (“sentenceDataFrame”). The dataframe is tokenized and a list of tokens is returned. The tokens are passed into the NGram function and we return the list of ngram’s generated.
The results of the REST call to the spark cluster is returned to Teradata as a JSON object, in this case with a list of ngram’s for each SQL statement processed.
For the Teradata side of things, we need to create the table script operator and execute it.
Example code for setup of the table script operator is here.
The example SQL will execute the script operator (ie: the python code) which performs the REST calls to the spark cluster, generates the ngrams and returns the results to be stored as a JSON structure for each record, ready for further analysis (using Teradata’s convenient “dot” notation).
This is just one example of the concept of using the combination of an open source platform and Teradata to enable new and innovative types of analytics within the ecosystem, leveraging the strengths of the different platforms, tools and technologies.
Using REST, API’s, Teradata’s extensibility features and programming languages allow us to create analytic capability that was never possible before. It won’t all be smooth sailing, as many of the combinations of these tools will be untried and untested, but the effort is worthwhile to create innovative analytic capability which is a leap ahead of your competition.
Appendix A: Configuring the Spark Cluster
1. Download Centos VM Image
2. Follow setup instructions here
3. Startup VM, wait for a few minutes, as the image will auto-update. It will notify you that updates are available, choose the “install and restart” option.
4.When it has restarted and all updates completed, install docker.
5. Follow the Centos install instructions here
6. We will also be using docker compose, to build the spark cluster.
7. Install docker compose with instructions here: https://docs.docker.com/compose/install/
8. The curl commands to install docker-compose must be ran as root (not sudo).
#curl -L https://github.com/docker/compose/releases/download/1.11.1/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose
#chmod +x /usr/local/bin/docker-compose
9. Pull the docker centos container: sudo docker pull centos
10. Make a directory to store your docker configuration files: mkdir spark
11. I used the spark docker config files here as a basis for my cluster configuration
12. We need to adjust for centos (yum instead of apt-get), get the specific version of Spark that is compatible with Livy and also configure the Spark Master container to also run the Livy REST server.
13. The two files needed to create the spark cluster are Dockerfile and docker-compose.yml. You can download them from github here.
14. Create a directory and place the files inside (or clone the git repository)
15. Build the image for the containers:
sudo docker build -t spark-2:latest .
16. Startup the Spark cluster:
17.You can see the status of the cluster via the web ui: <vm ip address>:8080
18. To stop the cluster press <ctrl-c>.
19. To remove containers/cleanup:
20. If you reboot the VM and the docker demon doesn’t start automatically, issue this command to start it:
service docker start
Appendix B: Configuring the Teradata VM
1. The default version of python on the VM (2.6.9) does not have pip installed, so we have to manually install the requests module.
2. Firstly setup internet access on the VM.
3. Comment out existing nameserver lines.
4. Add this line (or modify an existing one):
5. Download and install the requests module:
wget –no-check-certificate here
tar -xvf ./requests-2.6.0.tar.gz
python ./setup.py install
6. When you install as root, the “other” permissions are blank, so no other users can access the modules installed, this needs to be fixed:
chmod -R a+rx /usr/local/lib64/python2.6
7. The memory limit available for script operators also needs to be increased. This is done using the Teradata “cufconfig” utility (as the root user).
8. Firstly, setup the terminal window to use Teradata commands:
9. You can see the current settings with:
10. Create a file with the new setting in it:
11. The file contents should have this single line:
12. Use the cufconfig utility again to set the new limit:
cufconfig -f /tmp/newcuf.txt