Using Hive with Parquet in CDH 4.3

I was at Strata/Hadoop World NYC this week and man was it a lot of fun. Many amazing speakers and technologies, it’s amazing to see how the Big Data (and especially Hadoop) ecosystem is growing. In particular this year, I noticed a significant amount of attendees from Europe, something that was not the case in Strata 2012.

Anyway, one of the technologies that I was most impressed with (and kinf of ashamed I didn’t look at earlier…) is Parquet, an optimized columnar data format compatible with most of the Hadoop stack. It was developed jointly by Twitter and Cloudera with contributions from Criteo, and it looks awesome.

Now that the 3-day conference is over, I thought I’d give Parquet a spin and see how it can be used for Hive queries and how much it improves performance on some toy problems. For these experiments I’ve been using the Cloudera quickstart VM with CDH 4.3.

Read on →

First post

Okay so I’ve been meaning to create a website/blog for a very long time now, but kept pushing it back. With my recent introduction to Github pages, I decided that the time was perfect to get my hands dirty, and after a couple hours of work, this website emerged, based on the amazing oscailte template for Octopress.

Here are some of the things I’m planning to cover in this blog:

  • Thoughts about Big Data issues that I’ve encountered. And hopefully solutions.
  • Some of the new technologies or coding tricks that amaze me.
  • Anything related somehow to the field of Data Science.