Teradata 15.0 has gotten tremendous interest from customers and the press because it enables SQL access to native JSON data. This heralds the end of the belief that data warehouses can’t handle unstructured data. But there’s an equally momentous innovation in this release called Teradata QueryGrid.
What is Teradata QueryGrid?
In Teradata’s Unified Data Architecture (UDA), there are three primary platforms: the data warehouse, the discovery platform, and the data platform. The huge gray arrows represent data flowing between these systems. A year or two ago, these arrows were extract files moved in batch mode.
Teradata QueryGrid is both a vision and a technology. The vision --simply said-- is that a business person connected to the Teradata Database or Aster Database can submit a single SQL query that joins data together from a second system for analysis. There’s no need to plead with the programmers to extract data and load it into another machine. The business person doesn’t have to care where the data is – they can simply combine relational tables in Teradata with tables or flat files found in Hadoop on demand. Imagine a data scientist working on an Aster discovery problem and needing data from Hadoop. By simply adjusting the queries she is already using, Hadoop data is fetched and combined with tables in the Aster Database. That should be a huge “WOW” all by itself but let’s look further.
You might be saying “That’s not new. We’ve had data virtualization queries for many years.” Teradata QueryGrid is indeed a form of data virtualization. But Teradata QueryGrid doesn’t suffer from the normal limitations of data virtualization such as slow performance, clogged networks, and security concerns.
Today, the vision is translated into reality as connections between Teradata Database and Hadoop as well as Aster Databases and Hadoop. Teradata QueryGrid also connects the Teradata Data Warehouse to Oracle databases. In the near future, it will extend to all combinations of UDA servers such as Teradata to Aster, Aster to Aster, Teradata to Teradata, and so on.
Seven League Boots for SQL
With QueryGrid, you can add a clause in a SQL statement that says “Call up Hadoop, pass Hive a SQL request, receive the Hive results, and join it to the data warehouse tables.” Running a single SQL statement spanning Hadoop and Teradata is amazing in itself – a giant step forward. Notice too that all the database security, advanced SQL functions, and system management in the Teradata or Aster system is supporting these queries. The only effort required is for the database administrator to set up a “view” that connects the systems. It’s self-service for the business user after that. Score: complexity zero, business users one.
Parallel Performance, Performance, Performance
Historically, data virtualization tools lack the ability to move data between systems in parallel. Such tools send a request to a remote database and the data comes back serially through an Ethernet wire. Teradata QueryGrid is built to connect to remote systems in parallel and exchange data through many network connections simultaneously. Wanna move a terabyte per minute? With the right configurations it can be done. Parallel processing by both systems makes this incredibly fast. I know of no data virtualization system that does this today.
Inevitably, the Hadoop cluster will have a different number of servers compared to the Teradata or Aster MPP systems. The Teradata and Aster systems start the parallel data exchange by matching up units of parallelism between the two systems. That is, all the Teradata parallel workers (called AMPs) connect to a buddy Hadoop worker node for maximum throughput. Anytime the configuration changes, the workers match-up changes. This is non-trivial rocket-science class technology. Trust me – you don’t want to do this yourself and the worst situation would be to expose this to the business users. But Teradata QueryGrid does it all for you completely invisible to the user.
Put Data in the Data Lake FAST
Imagine complex predictive analytics using R® or SAS® are run inside the Teradata data warehouse as part of a merger and acquisition project. In this case, we want to pass this data to the Hadoop Data Lake where it is combined with temporary data from the company being acquired. With moderately simple SQL stuffed in a database view, the answers calculated by the Teradata Database can be sent to Hadoop to help finish up some reports. Bi-directional data exchange is another breakthrough in the Teradata Query Grid, new in release 15.0. The common thread in all these innovations is that the data moves from the memory of one system to the memory of the other. No extracts, no landing the data on disk until the final processing step – and sometimes not even then.
What we don’t want to do is transfer terabytes of data from Hadoop and throw away 90% of it since it’s not relevant. To minimize data movement, Teradata QueryGrid sends the remote system SQL filters that eliminate records and columns that aren’t needed. An example constraint could be “We only want records for single women age 30-40 with an average account balance over $5000. Oh, and only send us the account number, account type, and address.” This way, the Hadoop system discards unnecessary data so it doesn’t flood the network with data that will be thrown away. After all the processing is done in Hadoop, data is joined in the data warehouse, summarized, and delivered to the user’s favorite business intelligence tool.
Teradata QueryGrid delivers some important benefits:
• It’s easy to use: any user with any BI tool can do it
• Low DBA labor: it’s mostly setting up views and testing them once
• High performance: reducing hours to minutes means more accuracy and faster turnaround for demanding users
• Cross-system data on demand: don’t get stuck in programmer’s work queue
• Teradata/Aster strengths: security, workload management, system management
• Minimum data movement improves performance and reduces network use
• Move the processing to the data
Big data is now taking giant steps through your analytic architecture --frictionless, invisible, and in parallel. Nice boots!
Latest posts by Dan Graham (see all)
- Data Lake Best and Worst Practices - November 10, 2014
- MongoDB and Teradata QueryGrid – Even Better Together - June 19, 2014
- How $illy is Cost per Terabyte? - May 16, 2014