I was at Strata/Hadoop World NYC this week and man was it a lot of fun. Many amazing speakers and technologies, it’s amazing to see how the Big Data (and especially Hadoop) ecosystem is growing. In particular this year, I noticed a significant amount of attendees from Europe, something that was not the case in Strata 2012.
Anyway, one of the technologies that I was most impressed with (and kinf of ashamed I didn’t look at earlier…) is Parquet, an optimized columnar data format compatible with most of the Hadoop stack. It was developed jointly by Twitter and Cloudera with contributions from Criteo, and it looks awesome.
Now that the 3-day conference is over, I thought I’d give Parquet a spin and see how it can be used for Hive queries and how much it improves performance on some toy problems. For these experiments I’ve been using the Cloudera quickstart VM with CDH 4.3.Read on →