In my previous blog I introduced Part 1 of my blog series on Hadoop and covered the HDFS component. As the building block to the rest of Hadoop, HDFS plays an important role in the storage of data within Hadoop. In this blog I’m now going to cover the terms Pig and Hive.
Whenever I mention Hadoop in conversations I often say the words pig and hive and always look at the person’s face to see their reaction. Most often the look on their face is one of bewilderment as the terms conjure up thoughts of something else.
However both Pig & Hive are two very important components of the Hadoop environment as they work on top of HDFS via the MapReduce framework and provides us with the interface to mine the data contained within HDFS.
When I speak to many customers on the topic of Hadoop I always make the comment that it is very easy to get data into Hadoop, but hard to extract value from the data once it’s in there. By hard I mean that if you want users to be interacting with the data in HDFS then they will either need to learn how to script in Pig or Hive. It’s not particularly hard to learn, but it’s yet another skill that your users will need to have. Skills in the market are still thin on the ground, so you’ll need to look to re-skilling your existing users instead.
This blog is more to provide a high level overview of the two capabilities within Hadoop. If you are looking for a more in depth comparison between both capabilities I highly recommend the article by Alan Gates from Yahoo.
What is Pig?
Pig is a scripting platform that allows users to write MapReduce operations using a scripting language called Pig Latin. Pig Latin is a flow language whereas SQL is a declarative language. SQL is great for asking a question of your data, while Pig Latin allows you to write a data flow that describes how your data will be transformed. Therefore the types of operations that it is used for is filtering, transforming, joining and writing data. These operations are exactly what MapReduce was intended for.
The Pig platform itself takes the Pig Latin script and transforms that into a MapReduce job that is then executed against a dataset. It is designed for running both operations against large data sets. Therefore the types of use cases it is ideal for are:
- ETL of data within Hadoop
- Iterative data processing
- Initial research on raw data sets
What is Hive?
Pig although is very powerful and useful, it still requires you to master a new language. Therefore to overcome this barrier, the smart cookies at Facebook developed Hive which allows people familiar with SQL (Structured Query Language) to write HQL (Hive Query Language) statements. A HQL statement is read by the Hive service and then transformed into a MapReduce job. This approach makes it very fast and adoptable for people that are already familiar with the syntax of SQL to write Hive queries. There are a few caveats however and these include:
- HQL is not a full replica of SQL statements. Therefore you need to be aware of what HQL cannot do that you typically do in SQL.
- Hive is not suited for simple quick transactional statements like what SQL can perform. Keep in mind that HQL is transformed into a MapReduce job which is then executed against a large dataset. Therefore don’t expect blazingly fast response times as MapReduce is not intended for this purpose.
- Hive only does Read based queries and not write operations. Forget about updates and deletes in Hive. However such operations in the future may be a possibility.
Therefore in summary both Pig and Hive get converted to MapReduce jobs at the end of the day, however both can be used interchangeably for particular purposes. The following table lists some particular functions and comments on both Pig and Hive.
If we look at the High Level architecture of Pig and Hive and their position in the overall Hadoop environment you can see how the two components interact with MapReduce to eventually get access to the data.
So the choice is up to you and what you are most comfortable with. The openness of Hadoop really gives you choice and flexibility when it comes to deciding what tool to use. If you are from the SQL world then you’ll find Hive the easiest to get used to. However if you competent in the Python language you’ll probably find Pig the most applicable. Keep in mind the limitations of both and you’ll be on your way to developing applications that are extracting value from the data in Hadoop in no time at all!
Ben Davis is a Senior Architect for Teradata Australia based in Canberra. With 18 years of experience in consulting, sales and technical data management roles, he has worked with some of the largest Australian organisations in developing comprehensive data management strategies. He holds a Degree in Law, A post graduate Masters in Business and Technology and is currently finishing his PhD in Information Technology with a thesis in executing large scale algorithms within cloud environments.
Latest posts by Ben Davis (see all)
- Mastering colours in your data visualisations - March 8, 2017
- Spotting the pretenders in Data Science - February 15, 2017
- Leveraging all Data in a Government/Client Engagement - November 15, 2016
- Can we defeat DDoS using analytics? - August 15, 2016
- The pitfalls of DIY Hadoop - August 8, 2016