Big Analytics

 

It is well-known that there are two extreme alternatives for storing database tables on any storage media: storing it row-by-row (as done by traditional “row-store” technology) or storing it column-by-column (as done by recently popular “column-store” implementations). Row-stores store the entire first row of the table, followed by the entire second row of the table, etc. Column-stores store the entire first column of the table, followed by the entire second column of the table, etc. There have been huge amounts of research literature and commercial whitepapers that discuss the various advantages of these alternative approaches, along with various proposals for hybrid solutions (which I discussed in more detail in my previous post).

abadi blog clamp image abadiDespite the many conflicting arguments in favor of these different approaches, there is little question that column-stores compress data much better than row-stores. The reason is fairly intuitive: in a column-store, entire columns are stored contiguously --- in other words, a series of values from the same attribute domain are stored consecutively. In a row-store, values from different attribute domains are interspersed, thereby reducing the self-similarity of the data. In general the more self-similarity (lower entropy) you have in a dataset, the more compressible it is. Hence, column-stores are more compressible than row-stores.

In general, compression rates are very sensitive to the particular dataset that is being compressed. Therefore it is impossible to make any kind of guarantees about how much a particular database system/compression algorithm will compress an arbitrary dataset. However, as a general rule of thumb, it is reasonable to expect around 8X compression if a column-store is used on many kinds of datasets. 8X compression means that the compressed dataset is 1/8th the original size, and scan-based queries over the dataset can thus proceed approximately 8 times as fast. This stellar compression and resulting performance improvements are a major contributor to the recent popularity of column-stores.

It is precisely this renowned compression of column-stores which makes the compression rate of RainStor (a recent Teradata acquisition) so impressive in comparison. RainStor claims a factor of 5 times more compression than what column-stores are able to achieve on the same datasets, and 40X compression overall.

Although the reason why column-stores compress data better than row-stores is fairly intuitive, the reason why RainStor can compress data better than column-stores is less intuitive. Therefore, we will now explain this in more detail.

Take for example the following table, which is a subset of a table describing orders from a particular retail enterprise that sells bicycles and related parts. (A real table would have many more rows and columns, but we keep this example simple so that it is easier to understand what is going on).

Record Order date Ship date Product Price
1 03/22/2015 03/23/2015 “bicycle” 300
2 03/22/2015 03/24/2015 “lock” 18
3 03/22/2015 03/24/2015 “tire” 70
4 03/22/2015 03/23/2015 “lock” 18
5 03/22/2015 03/24/2015 “bicycle” 250
6 03/22/2015 03/23/2015 “bicycle” 280
7 03/22/2015 03/23/2015 “tire” 70
8 03/22/2015 03/23/2015 “lock” 18
9 03/22/2015 03/24/2015 “bicycle” 280
10 03/23/2015 03/24/2015 “lock” 18
11 03/23/2015 03/25/2015 “bicycle” 300
12 03/23/2015 03/24/2015 “bicycle” 280
13 03/23/2015 03/24/2015 “tire” 70
14 03/23/2015 03/25/2015 “bicycle” 250
15 03/23/2015 03/25/2015 “bicycle” 280

 

The table contains 15 records and shows four attributes --- the order and ship dates of a particular product; the product that was purchased, and the purchase price. Note that there is a relationship between some of these columns --- in particular the ship date is usually 1 or 2 days after the order date, and that the price of various products are usually consistent across orders, but there may be slight variations in price depending on what coupons the customer used to make the purchase.

A column-store would likely use “run-length encoding” to compress the order date column. Since records are sorted by order date, this would compress the column to its near-minimum --- it can be compressed as (03/22/2015, 9); (03/23/2015, 6) --- which indicates that 03/22/2015 is repeated 9 straight times, followed by 03/23/2015 which is repeated 6 times. The ship date column, although not sorted, is still very compressible, as each value can be expressed using a small number of bits in terms of how much larger (or smaller) it is from the previous value in the column. However, the other two columns --- product and price --- would likely be compressed using a variant of dictionary compression, where each value is mapped to the minimal number of bits needed represent it. For large datasets, where there are many unique values for price (or even for product), the number of bits needed to represent a dictionary entry is non-trivial, and the same dictionary entry is repeated in the compressed dataset for every repeated value in the original dataset.

In contrast, in RainStor, every unique value in the dataset is stored once (and only once), and every record is represented as a binary tree, where a breadth-first traversal of the tree enables the reconstruction of the original record. For example, the table shown above is compressed in RainStor using the forest of binary trees shown below. There are 15 binary trees (each of the 15 roots of these trees are shown using the green circles at the top of the figure), corresponding to the 15 records in the original dataset.abadi forest trees blog

Forest of Binary Trees Compression

For example, the binary tree corresponding to record 1 is shown on the left side of the figure. The root points to two children --- the internal nodes “A” and “E”. In turn, node “A” points to 03/22/2015 (corresponding to the order date of record 1), and to 03/23/2015 (corresponding to the ship date of record 1). Node “E” points to “bicycle” (corresponding to the product of record 1) and “300” corresponding to the price of record 1).

Note that records 4, 6, and 7 also have an order date of 03/22/2015 and a ship date of 03/23/2015. Therefore, the roots of the binary trees corresponding to those records also point to internal node “A”. Similarly, note that record 11 also is associated with the purchase of a bicycle for $300. Therefore, the root for record 11 also points to internal node “E”.

These shared internal nodes are what makes RainStor’s compression algorithm fundamentally different from any algorithm that a column-store is capable of performing. Column-stores are forced to create dictionaries and search for patterns only within individual columns. In contrast, RainStor’s compression algorithm finds patterns across different columns --- identifying the relationship between ship date and order date and the relationship between product and price, and leveraging these relationships to share branches in the trees that are formed, thereby eliminating redundant information. RainStor thus has fundamentally more room to search for patterns in the dataset and compress data by referencing these patterns via the (compressed) location of the root of the shared branch.

For a traditional archiving solution, compression rate is arguably the most important feature (right up there with immutability). Indeed, RainStor’s compression algorithm enables it to be used for archival use-cases, and RainStor provides all of the additional features you would expect from an archiving solution: encryption, LDAP/AD/PAM/Kerberos/PCI authentication and security, audit trails and logging, retention rules, expiry policies, and integrated implementation of existing compliance standards (e.g. SEC 17a-4).

However, what brings RainStor to the next level in the archival solutions market is that it is an “active” archive, meaning that the data that is managed by RainStor can be queried at high performance. RainStor provides a mature SQL stack for native querying of compressed RainStor data, including ANSI SQL 1992 and 2003 parsers, and a full MPP query execution engine. For enterprises with Hadoop clusters, RainStor is fully integrated with the Cloudera and Hortonworks distributions of Hadoop --- RainStor compressed data files can be partitioned over a HDFS cluster, and queried in parallel with HiveQL (or MapReduce or Pig). Furthermore, RainStor integrates with YARN for resource management, with HCatalog for metadata management, and with Ambari for system monitoring and management.

The reason why most archival solutions are not “active” is that the compression algorithms used to reduce the data size before archival are so heavy-weight, that significant processing resources must be invested in decompressing the data before it can be queried. Therefore, it is preferable to leave the data archived in compressed form, and only decompress it at times of significant need. In general, a user should expect significant query performance reductions relative to querying uncompressed data, in order to account for the additional decompression time.

The beauty of RainStor’s compression algorithm is that even though it gets compression ratios comparable to other archival products, its compression algorithm is not so heavy-weight that the data must be decompressed prior to querying it. In particular, the binary tree structures shown above are actually fairly straightforward to perform query operations on directly, without requiring decompression prior to access. For example, a count distinct or a group-by operation can be performed via a scan of the leaves of the binary tees. Furthermore, selections can be performed via a reverse traversal of the binary trees from the leaves that match the selection predicate. In general, since there is a one-to-one mapping of records in the uncompressed dataset to the binary trees in RainStor’s compressed files, all query operations can be expressed in terms of operations on these binary trees. Therefore, RainStor queries can benefit from the I/O improvement of scanning in less data (due to the smaller size of the compressed files on disk/memory) without paying the decompression cost to fully decompress these compressed files after they are read from storage. This leads to RainStor’s claims of 2X-100X performance improvement on most queries --- an industry-leading claim in the archival market.

In short, RainStor’s strong claims around compression and performance are backed up by the technology that is used under the covers. Its compression algorithm is able to identify and remove redundancy both within and across columns. Furthermore, the resulting data structures produced by the algorithm are amenable to direct operation on the compressed data. This allows the compressed files to be queried at high performance, and positions RainStor as a leading active-archive solution.

_________________________________________________________________________

daniel abadi crop BLOG bio mgmt

Daniel Abadi is an Associate Professor at Yale University, founder of Hadapt, and a Teradata employee following the recent acquisition. He does research primarily in database system architecture and implementation. He received a Ph.D. from MIT and a M.Phil from Cambridge. He is best known for his research in column-store database systems (the C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, commercialized by VoltDB), and Hadapt (acquired by Teradata). http://twitter.com/#!/daniel_abadi.

Data-Driven Business and the Need for Speed

Posted on: February 14th, 2014 by Guest Blogger No Comments

 

What stands in the way of effective use of data? Let’s face it: it’s people as much as technology. Silos of data and expertise cause friction in many companies. To create a winning strategy for data-driven business, you need fast, effective collaboration between teams, something much more like a pit crew than an org chart. To gain speed, you must find a way to break the silos between groups.

In most organizations, multiple teams are involved with data. Data may be used, passed onto another stage as an input, or both. Each use case involves end users, user interface designers, data scientists, business analysts, programmers, and operations people.

For BI to work optimally it must break through at least three silos:

  • Silo 1: BI architects
  • Silo 2: Data analysts
  • Silo 3: Business users

First, those focusing on data prep, programming and advanced analysis must not only work together but must be directed by the needs of the business users, ideally with a tight feedback loop. You don’t just want to get something in the hands of employees; you want to get the right thing in front of each of them, and that means communicating and understanding their needs.

Other types of silos exist between lines of business. For an insight to have maximum impact, it must find its way to those who need it most. Everyone must be on the lookout for signals that can be used by other lines of business and pass them along.

Silos Smashed: An End-to-End View
There’s no way to achieve an end-to-end view without breaking down silos. Usage statistics can show what types of analysis are popular, which increases transparency. Cross-functional teams that include all stakeholders (BI architects, data analysts, and business users) also help in breaking down barriers.

With the silos smashed, you can acquire value from data faster. Speed is key and breaking down silos can do for BI what pit crews do for racing. Data-driven business moves fast, and we must find more efficient ways of working together, cutting time off each iteration as we move toward real-time delivery of the right data to the right person at the right time.

For example, we should be able to see how changing a price affects customers and impacts the bottom line. And we should be able to do that on a store manager’s mobile device in real-time, not waiting until the back office runs the report and comes down from Mount Olympus with the answer. Immediacy of data for decision-making--that’s what drives competitive advantage.

While I'm at Strata, I'll be looking for new ideas about how to break the silos and speed time to value from data. I’m interested to hear your thoughts as well.

By: Dan Woods, Forbes Blogger and Co-Founder of Evolved Media

To learn more:

Optimize the Value of All the Data white paper
The Intelligent Enterprise infographic

Big Apple Hosts the Final Big Analytics Roadshow of the Year

Posted on: November 26th, 2013 by Teradata Aster No Comments

 

Speaking of ending things on a high note, New York City on December 6th will play host to the final event in the Big Analytics 2013 Roadshow series. Big Analytics 2013 New York is taking place at the Sheraton New York Hotel and Towers in the heart of Midtown on bustling 7th Avenue.

As we reflect on the illustrious journey of the Big Analytics 2013 Roadshow, kicking off in San Francisco, this year the Roadshow traveled through major international destinations including Atlanta, Dallas, Beijing, Tokyo, London and finally culminating at the Big Apple – it truly capsulated the appetite today for collecting, processing, understanding and analyzing data.

Big Analytics Atlanta 2013 photo

Big Analytics Roadshow 2013 stops in Atlanta

Drawing business & technical audiences across the globe, the roadshow afforded the attendees an opportunity to learn more about the convergence of technologies and methods like data science, digital marketing, data warehousing, Hadoop, and discovery platforms. Going beyond the “big data” hype, the event offered learning opportunities on how technologies and ideas combine to drive real business innovation. Our unyielding focus on results from data is truly what made the events so successful.

Continuing on with the rich lineage of delivering quality Big Data information, the New York event promises to pack tremendous amount of Big Data learning & education. The keynotes for the event include such industry luminaries as Dan Vesset, Program VP of Business Analytics at IDC, Tasso Argyros, Senior VP of Big Data at Teradata & Peter Lee, Senior VP of Tibco Software.

Photo of the Teradata Aster team in Dallas

Teradata team at the Dallas Big Analytics Roadshow


The keynotes will be followed by three tracks around Big Data Architecture, Data Science & Discovery & Data Driven Marketing. Each of these tracks will feature industry luminaries like Richard Winter of WinterCorp, John O’Brien of Radiant Advisors & John Lovett of Web Analytics Demystified. They will be joined by vendor presentations from Shaun Connolly of Hortonworks, Todd Talkington of Tableau & Brian Dirking of Alteryx.

As with every Big Analytics event, it presents an exciting opportunity to hear first hand from leading organizations like Comcast, Gilt Groupe & Meredith Corporation on how they are using Big Data Analytics & Discovery to deliver tremendous business value.

In summary, the event promises to be nothing less than the Oscars of Big Data and will bring together the who’s who of the Big Data industry. So, mark your calendars, pack your bags and get ready to attend the biggest Big Data event of the year.

Big Elephant Eats Data Warehouse

Posted on: September 19th, 2013 by Dan Graham No Comments

 

-- Teradata PR pit boss: “Dan, have you seen this Big Elephant Eats Data Warehouse article at BigMedia.com? This cub reporter guy’s like Rip Van Winkle waking up and trying to explain how the iPhone works. He’s just making things up. Get this Willy Everlern reporter on the phone.”

Ring ring ringtone. “Hello, Willy? Willy Everlern? This is Dan at Teradata again.”
--Willy: “Oh hi Dan. What’s happening out in Silicon Valley?”

--Dan: “It’s your latest blog Willy. That Big Elephant Eats Data Warehouse is clear, simple, and wrong. Hadoop has not stalled our data warehouse sales at all.”

--Willy: “Hey, I didn’t say that. Read it again. It says ‘Hadoop is eating the data warehouse market. Database heavyweights like Teradata are seeing slow growth because of Hadoop.’ See, I said slow growth --not no growth.”

--Dan: “Iszzat so? Willy, Hadoop is not Godzilla stomping on data warehouses --it’s a cute baby elephant, remember? A recent Data Warehousing Institute (TDWI) customer survey 78% of customers said ‘Hadoop complements a DW; it’s rarely a replacement. And IDC says Teradata database software grew at 14% last year and 14% the year before . How can you call that slow growth? I wish my retirement funds grew that slow every year. IDC also says analytic databases account for $11B dollars in 2012 – that’s just databases, no BI, ETL, hardware, and no services. According to Wikibon , the Hadoop market was around $256 million last year for software AND services. So even if half of that $256M was software revenue, its only 1% of the analytic databases software revenue.”

--Willy: “Well, I did do what you told me last time and talked to a Hadoop vendor who told me three of your customers -- A, B, and E-- offloaded data from Teradata to Hadoop. That’s why I said what I said.”

--Dan: “I’m glad you brought that up. All three of those companies offloaded low value data and processing from their Teradata Warehouse to Hadoop back in 2011. They did it so they could put new high value workloads into the data warehouse. Optimizing assets is just common sense for any CIO. But those new applications grew so fast that company A and company E bought huge Teradata system upgrades in 2012 at millions of dollars each. If that’s what Hadoop does to our data warehouses, we need more Hadoop. I encourage you to talk to vendors but when they tell you things like that, check out the other side of the story. Willy, that big white elephant isn’t taking market share or slowing our growth.”

--Willy: “Yellow.”

--Dan: “What?”

--Willy: “You called Hadoop a white elephant. It’s yellow.”

--Dan: “Sorry, Willy. I’m a joker. It’s a congenital disease in my family.”

--Dan: But on a serious note, my boss was really upset with the statement that ‘Hadoop is a whole new paradigm of analytics.’ Willy this one hurts. Companies like Teradata and SAS have been in the analytics business for 30 years. The BI/data warehouse community has been doing consumer 360 degree analysis, fraud detection, recommendation engines, risk, and profitability analysis for 20+ years. According to Gartner ‘There continues to be much hype about the advantages of open source community innovation. In this case, it’s often innovating what has been in SQL since standards like SQL92.’ Copying what databases have done since 1992 is not innovation --its 20 years of catching up.”

--Willy: “You don’t like Hadoop do you?”

--Dan: “Actually, I like Hadoop when it’s applied to the right workload --but I’m allergic to hype. You know Willy, Teradata sells Hadoop appliances so we are committed to its success. At Teradata, we co-invented SQL-H and HCatalog with Hortonworks for high speed data exchange. We even promote a reference architecture called Universal Data Architecture with Hadoop smack dab in the middle of it. But back to your point, if you want to see Hadoop innovation, look into YARN and Tez. Those Hortonworks guys are onto something.”

--Willy: “Well, you still have to admit that Hadoop is free where data warehouses cost $20,000 per terabyte. I found that on a dozen blogs and websites.”

--Dan: “Willy, don’t believe everything you hear on the internet. There’s websites out there that still think the moon landing was faked and TV wrestling is real. That stuff about Hadoop being free at $1000 a terabyte is self-contradicting. That’s Silly Con Valley hype at its worst. Recently The Data Warehousing Institute said ‘Hadoop is not free, as many people have mistakenly said about it. A number of Hadoop users speaking at recent TDWI conferences have explained that Hadoop incurs substantial payroll costs due to its intensive hand coding (normally done by high-payroll personnel such as data scientists) and its immature, non-productive tools…” Don’t get me wrong. Some Silicon Valley companies don’t use hype. I’ll also point you to Dr. Shacham -- Chief Data Scientist at PayPal --who did tests showing that the cost of a query on Hadoop was roughly the same as Teradata systems. That one’s a stunner!

Plus earlier this summer, Richard Winter, the all-time big data virtuoso, published research showing data warehouses are cheaper than Hadoop for – are you sitting down – queries and analytics. By the way Willy, we just had a ridiculous price reduction on our extreme data appliance that puts us incredibly close to Hadoop’s cost per terabyte.”

--Willy: “OK. OK. I get it. So there is a lot of internet hype about Hadoop. It’s getting so I don’t know who to trust anymore.”

--Dan: Well, I stick to my suggestion last month. You should probably talk to vendors first, then talk to Gartner, IDC, The Data Warehousing Institute, Ventana, and then some customers. And don’t forget to give me a call – I can hook you up with our customers who are living with Teradata and Hadoop.”

Later.
--Teradata PR pit boss: “Seems like Willy Everlern is struggling to learn.”
--Dan: “He’s not alone. I’m learning every day – I hope.”
---------
TDWI, Integrating Hadoop Into Business Intelligence and Data Warehousing, March 2013
IDC Worldwide Business Analytics Software 2013–2017 Forecast and 2012 Vendor Shares, June 2013
IDC Worldwide Business Analytics Software 2013–2017 Forecast and 2012 Vendor Shares, June 2013

http://wikibon.org/wiki/v/Hadoop-NoSQL_Software_and_Services_Market_Forecast_2012-2017

Gartner, Merv Adrian, Hadoop Summit Recap Part Two, http://blogs.gartner.com/merv-adrian , July 2013
TDWI, Integrating Hadoop Into Business Intelligence and Data Warehousing, March 2013
Dr. Nachum Shacham, Chief Data Scientist, eBay/PayPal, http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_1330_Shacham.pdf
Richard Winter, www.wintercorp.com/tcod-report, August 2013

Willy Everlern and Big Data Hype-ocracy

Posted on: September 4th, 2013 by Dan Graham 1 Comment

 

Willy Everlern is a young reporter at BigMedia.com who doesn’t understand data warehouses or computers.  His boss pushes Willy into many topics, so it’s hard for Willy to master any of them.  Even worse, Willy thinks compiling articles and one-liners from a bunch of internet blogs is research.  He doesn’t call the analyst firms, vendors, or customers for facts.  So some of Willy’s articles are speculations built on a firm foundation of hype.   Willy’s recent BigMedia.com article has gotten Mike-O, a PR JD (Jive Detector)  at Teradata, in a huff.

--Mike-O: “Dan, have you seen this Big Data: Beyond the Data Warehouse article at BigMedia.com?  It’s wrong on too many levels.  Looks like he’s diagonally parked in a parallel universe. Call this guy NOW.”

--Ring ring: “Hello, Willy?  This is Dan at Teradata.”

--Willy: “Oh, howdy Dan.  How’s everything going these days?”

--Dan: “Willy, we need to talk about your recent blog article.  There are a whole bunch of errors that have my customers calling us all confused. And you scared one investor to death.”

--Willy: “What errors?  I worked really hard on Big Data: Beyond the Data Warehouse. “

--Dan: “Well, let’s look at your article.  Skip to where it says “Database technology is ill-suited for big data.”   That may be true of most databases, but it’s not true of Teradata databases.  Our core competency has always been scalability to the largest databases in the world for over 30 years.  Our Petabyte Club of customers has over 50 member installations.  One of those machines is over 60 petabytes in size.  And our Aster database is chugging along with some 800 terabyte installations.  That’s big in most people’s thinking.  I can get you reference calls with some of these customers.  You should also check out Ventana’s blog on Teradata Addresses the Foundation of Big Data Analytics.1

--Willy: “Yeh, but it’s not the new Map Reduce stuff.  You missed my point.”

--Dan:   Well if Map-Reduce is your definition of big data, Gartner says Teradata Aster is one of two databases that implement MapReduce directly inside a DBMS. 2  Actually Aster is the first database with full map reduce in it. (Psst Willy – keep this a secret but we share technology between Teradata Database and Aster. Think about it.)  So we’ve got scale-out AND Map-Reduce in our databases. I admit, only one other vendor can do that so you weren’t completely off base.  But Teradata is an exception when it comes to big data. It’s our specialty.”

--Willy:  “OK, I’ll give you this one.  You got me.”

--Dan: “Just a little advice Willy – never buy sushi from a vending machine.”

--Willy: “What?”

--Dan: “Forget it.  I’m just playing with your head.”

--Dan: “So Willy, there’s another thing to discuss.  Skip to where the article says “Databases can’t handle multi-structured data. Willy, the majority of databases can’t handle tweets, web-surfing logs and internet-connected sensors.  You got that part right.  But Teradata and Aster databases are an exception again.  They peel apart multi-structured data easily.  Remember that 60 petabyte Teradata system I mentioned?  The customer uses it to unravel weblogs into name-value pairs and then do analysis of consumer purchasing behavior. Weblogs are as unstructured as it gets – the stuff looks like a cosmic hairball of data.  Aster’s SQL-MapReduce goes even further.  Aster can actually do joins of Twitter data, Facebook data, and consumer history to correlate patterns across them.  In English that means Aster can look at social data and tell you when a consumer is gonna jump to the competition.”

--Willy: “Wow.  So your aging old databases are really some kind of modern social network movie and The Matrix all in one!”

--Dan: “Slow down Willy – you’re scaring me.  Let’s stick to the facts.  I’ve got one more topic and I’ll let you go.  Now look in your blog where it says “Data warehouses are inflexible, resistant to change.”  Willy, I’m going to agree with you on this one, but it’s not what you think.  It’s not a technical problem, it’s a people problem.  It affects all the database vendors.  Some DBAs have become data jailors, keeping the data locked up.  And some BI governance committees have gone too far.  Business users could get more of what they want if they adopt Agile development methodologies. Plus Teradata built something called Data Lab that’s really flexible for A/B testing and new ‘what if’ ideas.  Companies can prototype something in a few weeks and promote it into production in another couple weeks. It helps, but we still need people to adopt the Agile methodology3.  There’s nothing wrong with the database software.  We could use your help getting the message out on Agile.”

--Willy:  “Hmmm.  Well, I wish I’d known all that a few weeks ago. How was I supposed to know?”

--Dan:  “Willy it’s my job to run and fetch whatever you need.  Anytime you want to mention Teradata, just call and I’ll be working for you.  Seriously, Mike-O pumps Barry Manilow music into the office for 3 hours anytime he  spots “jive’’  from one of you bloggerazzi  guys.”

--Willy: “Oh my god – that’s torture.  Look, I’ll call you next time if you can point me to the analyst stuff that makes me look smart.  Thanks Dan.”

--Dan:  “I’ll be glad to help, Willy.  Three hours of ‘Oh Mandy’ is highly motivational stuff.”
Later.

--Mike-O, PR JD: “Well, will he ever learn?”

--Dan: “I don’t know.  I don’t know.”

 

1 - Ventana, Tony Consentino, http://tonycosentino.ventanaresearch.com/2013/05/03/teradata-addresses-the-foundation-of-big-data-analytics

2 - Gartner, IT Market Clock for Database Management Systems, 2012, September 2012

3 - TDWI, Benefits of Agile Data Warehousing: A Real-World Story, July 2013

Big Insights from Big Analytics Roadshow

Posted on: January 25th, 2013 by Teradata Aster No Comments

 

Last month in New York we completed the 4th and final event in the Big Analytics 2012 roadshow. This series of events shared ideas on practical ways to address the big data challenge in organizations and change the conversation from “technology” to “business value”. In New York alone, 500 people attended from across both business and IT and we closed out the event with two speaker panels. The data science panel was, in my opinion, one of the most engaging and interesting panels I’ve ever seen at an event like this. The topic was on whether organizations really need a data scientist (and what’s different about the skill set from other analytic professionals). Mike Gualtieri from Forrester Research did a great job leading & prodding the discussion.

Overall, these events were a great way to learn and network. The events had great speakers from cutting-edge companies, universities, and industry thought-leaders including LinkedIn, DJ Patil, Barnes & Noble, Razorfish, Gilt Groupe, eBay, Mike Gualtieri from Forrester Research, Wayne Eckerson, and Mohan Sawhney from Kellogg School of Management.

As an aside, I’ve long observed that there has been a historic disconnect between marketing groups and the IT organizations and data warehouses that they support. I noticed this first when I worked at Business Objects where very few reporting applications ever included Web clickstream data. The marketing department always used a separate tool or application like Web Side Story (now part of Adobe) to handle this. There is a bridge being built to connect these worlds – both in terms of technology which can handle web clickstream and other customer interactional data, but also new analytic techniques which make it easier for marketing/business analysts to understand their customers more intimately and better serve them a relevant experience.

We ran a survey at the events, and I wanted to share some top takeaways. The events were split into business and technical tracks with themes of “data science” and “digital marketing”. Thus, the survey data compares the responses from those who were more interested in technology than the business content, so we can compare their responses. The survey data includes responses from 507 people in San Francisco, 322 in Boston, 441 in Chicago, and 894 in New York City for a total of 2164 respondents.

You can get the full set of graphs here, but here are a couple of my own observations / conclusions in looking at the data:

1)      “Who is talking about big data analytics in your organization?” - IT and Marketing were by far the largest responses with nearly 60% of IT organizations and 43% of marketing departments talking about it. New York had slightly higher # of CIO’s and CEO’s talking about big data at 23 and 21%, respectively

 Survey Data: Figure 1

 

 

 


 

 

 

 

 

 

 

2)      “Where is big data analytics in your company” - Across all cities, “customer interactions in Web/social/mobile” was 62% - the biggest area of big data analytics. With all the hype around machine/sensor data, it was surprisingly only being discussed in 20% of organizations. Since web servers and mobile devices are machines, it would have been interesting to see how the “machine generated data” responses would have been if we had taken the more specific example of customer interactions away

 Survey Data: Figure 2

 

 

 

 


 

 

 

 

 

 

3)      This chart is a more detailed breakdown of the areas where big data analytics is found, broken down by city. NYC has a few more “other.” Some of the “other” answers in NYC included:

  1. Claims
  2. Client Data Cloud
  3. Development, and Data Center Systems
  4. Customer Solutions
  5. Data Protection
  6. Education
  7. Financial Transaction
  8. Healthcare data
  9. Investment Research
  10. Market Data
  11.  Predictive Analytics (sales and servicing)
  12. Research
  13. Risk management /analytics
  14. Security

 Survey Data: Figure 3

 

 

 

 

 

 


 

 

 

 

4)      “What are the Greatest Big Analytics Application Opportunities for Businesses Today? – on average, general “data discovery or data science” was highest at 72%, with “digital marketing optimization” as second with just under 60% of respondents. In New York, “fraud detection and prevention” at 39% was slightly higher than in other cities, perhaps tied to the # of financial institutions in attendance

 Survey Data: Figure 4

 


 

 

 

 

 

 

 

 

 

In summary, there are lots of applications for big data analytics, but having a discovery platform which supports iterative exploration of ALL types of data and can support both business/marketing analysts as well as savvy data scientists is important. The divide between business groups like marketing and IT are closing. Marketers are more technically savvy and the most demanding for analytic solutions which can harness the deluge of customer interaction data. They need to partner closely with IT to architect the right solutions which tackle “big analytics” and provide the right toolsets to give the self-service access to this information without always requiring developer or IT support.

We are planning to sponsor the Big Analytics roadshow again in 2013 and take it international, as well. If you attended the event and have feedback or requests for topics, please let us know. I hear that there will be a “call for papers” going out soon. You can view the speaker bios & presentations from the Big Analytics 2012 events for ideas.

2 months & 10 questions on new Aster Big Analytics Appliance

Posted on: December 18th, 2012 by Teradata Aster No Comments

 

It’s been about two months since Teradata launched the Aster Big Analytics Appliance and since then we have had the opportunity to showcase the appliance to various customers, prospects, partners, analysts, journalists etc. We are pleased to report that since the launch the appliance has already received the “Ventana Big Data Technology of the Year” award and has been well received by industry experts and customers alike.

Over the past two months, starting with the launch tweetchat, we have received numerous enqueries around the appliance and think now is a good time to answer the top 10 most frequently asked questions about the new Teradata Aster offering. Without further ado here are the top 10 questions and their answers:

WHAT IS THE TERADATA ASTER BIG ANALYTICS APPLIANCE?

The Aster Big Analytics Appliance is a powerful, ready to-run platform that is pre-configured and optimized specifically for big data storage and analysis. A purpose built, integrated hardware and software solution for analytics at big data scale, the appliance runs Teradata Aster patented SQL-MapReduce® and SQL-H technology on a time-tested, fully supported Teradata hardware platform. Depending on workload needs, it can be exclusively configured with Aster nodes, Hortonworks Data Platform (HDP) Hadoop nodes, or a mixture of Aster and Hadoop nodes. Additionally, integrated backup nodes are available for data protection and high availability

WHO WILL BENEFIT MOST BY DEPLOYING THE APPLIANCE?

The appliance is designed for organizations looking for a turnkey integrated hardware and software solution to store, manage and analyze structured and unstructured data (ie: multi-structured data formats). The appliance meets the needs of both departmental and enterprise-wide buyers and can scale linearly to support massive data volumes.

WHY DO I NEED THIS APPLIANCE?

This appliance can help you gain valuable insights from all of your multi-structured data. Using these insights, you can optimize business processes to reduce cost and better serve your customers. More importantly, these insights can help you innovate by identifying new markets, new products, new business models etc. For example, by using the appliance a telecommunications company can analyze multi-structured customer interaction data across multiple channels such as web, call center and retail stores to identify the path customers take to churn. This insight can be used proactively to increase customer retention and improve customer satisfaction.

WHAT’S UNIQUE ABOUT THE APPLIANCE?

The appliance is an industry first in tightly integrating SQL-MapReduce®, SQL-H and Apache Hadoop. The appliance delivers a tightly integrated hardware and software solution to store, manage and analyze big data. The appliance delivers integrated interfaces for analytics and administration, so all types of multi-structured data can be quickly and easily analyzed through SQL based interfaces. This means that you can continue to use your favorite BI tools and all existing skill sets while deploying new data management and analytics technologies like Hadoop and MapReduce. Furthermore, the appliance delivers enterprise class reliability to allow technologies like Hadoop to now be used for mission critical applications with stringent SLA requirements.

WHY DID TERADATA BRING ASTER & HADOOP TOGETHER?

With the Aster Big Analytics Appliance, we are not just putting Aster and Hadoop in the same box. The Aster Big Analytics Appliance is the industry’s first unified big analytics appliance, providing a powerful, ready to run big analytics and discovery platform that is pre-configured and optimized specifically for big data analysis. It provides intrinsic integration between the Aster Database and Apache Hadoop, and we believe that customers will benefit the most by having these two systems in the same appliance.

Teradata’s vision stems from the Unified Data Architecture. The Aster Big Analytics Appliance offers customers the flexibility to configure the appliance to meet their needs. Hadoop is best for capture, storing and refining multi-structured data in batch whereas Aster is a big analytics and discovery platform that helps derive new insights from all types of data. Hadoop is best for capture, storing and refining multi-structured data in batch. Depending on the customer’s needs, the appliance can be configured with all Aster nodes, all Hadoop nodes or a mix of the two.

WHAT SKILLS DO I NEED TO DEPLOY THE APPLIANCE?

The Aster Big Analytics appliance is an integrated hardware and software solution for big data analytics, storage, and management, which is also designed as a plug and play solution that does not require special skill sets.

DOES THE APPLIANCE MAKE DATA SCIENTISTS OR DATA ANALYSTS IRRELEVANT?

Absolutely not. By integrating the hardware and software in an easy to use solution and providing easy to use interfaces for administration and analytics, the appliance allows data scientists to spend more time analyzing data.

In fact, with this simplified solution, your data scientists and analysts are freed from the constraints of data storage and management and can now spend their time on value added insights generation that ultimately leads to a greater fulfillment of your organization’s end goals.

HOW IS THE APPLIANCE PRICED?

Teradata doesn’t disclose product pricing as part of its standard business operating procedures. However, independent research conducted by industry analyst Dr. Richard Hackathorn, president and founder, Bolder Technology Inc., confirms that on a TCO and Time-to-Value basis the appliance presents a more attractive option vs. commonly available do-it-yourself solutions. http://teradata.com/News-Releases/2012/Teradata-Big-Analytics-Appliance-Enables-New-Business-Insights-on--All-Enterprise-Data/

WHAT OTHER ASTER DEPLOYMENT OPTIONS ARE AVAILABLE?

Besides deploying via the appliance, customers can also acquire and deploy Aster as a software only solution on commodity hardware] or in a public cloud.

WHERE CAN I GET MORE INFORMATION?

You can learn more about the Big Analytics Appliance via http://asterdata.com/big-analytics-appliance/  – home to release information, news about the appliance, product info (data sheet, solution brief, demo) and Aster Express tutorials.

 

Join the conversation on Twitter for additional Q&A with our experts:

Manan Goel @manangoel | Teradata Aster @asterdata

 

For additional information please contact Teradata at http://www.teradata.com/contact-us/

Santa Claus and Data Scientists

Posted on: December 3rd, 2012 by Teradata Aster No Comments

 

Who do you believe in more – Santa Claus or Data Scientists? That’s the debate we’re having in New York City on Dec 12th at Big Analytics 2012. Due to the sold-out event this panel discussion will be simulcast live to dig a little deeper behind the hype.

Some believe that data scientists are a new breed of analytic professional that mergers mathematics, statistics, programming, visualization, and systems operations (and perhaps a little quantum mechanics and string theory for good measure) all in one. Others say that Data Scientists are simply data analysts who live in California.

Whatever you believe, the skills gap for “data scientists” and analytic professionals is real and not expected to close until 2018. Businesses see the light in terms of data-driven competitive advantage, but are they willing to fork out $300,000/yr for a person that can do data science magic? That’s what CIO Journal is reporting with the guidance that “CIOs need to make sure that they are hiring for these positions to solve legitimate business problems, and not just because everyone else is doing it too”.

Universities like Northwestern University have built programs and degrees in analytics to help close the gap. Technology vendors are bridging the gap to make new analytic techniques on big data tenable to a broader set of analysts in mainstream organizations. But is data science really new? What are businesses doing to unlock and monetize new insights? What skills do you need to be a “data scientist”? How can you close the gap? What should you be paying attention to?

Mike Gualtieri from Forrester Research will be moderating a panel to answer these questions and more with:

  • Geoff Guerdat, Director of Data Architecture, Gilt Groupe
  • Bill Franks, Chief Analytics Officer, Teradata
  • Bernard Blais, SAS
  • Jim Walker, Director of Product Marketing, Hortonworks

 

Join the discussion at 3:30 EST on Dec 12th where you can ask questions and follow the discussion thread on Twitter with #BARS12, or follow along on TweetChat at: http://tweetchat.com/room/BARS12

... it certainly beats sitting up all night with milk and cookies looking out for Santa!

Shift the Economics of Your Data – Big Analytics 2012

Posted on: May 21st, 2012 by Teradata Aster No Comments

 

The conversation around “big data” has been evolving beyond a technology discussion to focus on analytics and applications to the business.  As such, we’ve worked with our partners and customers to expand the scope of the Big Data Summit events we initiated back in 2009 and have created Big Analytics 2012 - a new series of roadshow events kicking off in San Francisco on April 19, 2012 .

According to previous attendees and market surveys, the greatest big data application opportunities in businesses are:

- Digital marketing applications such as multi-channel analytics and testing to better understand and engage your customers

- Using data science and analytics to explore and develop new markets or data-driven services

Companies like LinkedIn, Edmodo, eBay,  and others have effectively applied data science and analytics to take advantage of the new economics of data. And they are ready to share details of what they have learned along the way.

Big Analytics 2012 is a half-day event, is absolutely free to attend, and will include insight from industry insiders in two different tracks: Digital Marketing Optimization, and Data Science and Analytics. Big Analytics 2012 is a great way to meet and hear from your peers such as: executives who want to learn more about leveraging advanced analytics to a competitive advantage, interactive marketing innovators who want access to "game changing" insights for digital marketing optimization, enterprise architects and business intelligence professionals looking to provide big data infrastructure and data scientists and business analysts who are responsible for developing new data-driven products or business insights.

Come to learn from the panel of experts and stay for an evening networking reception that will put you in touch with big data and analytics professionals from throughout the industry. Big Analytics 2012 will be coming soon to a city near you. Click here to learn more about the event and to register now.