Addressing few Q’s a reader had about Google’s BigData offering BigQuery:

Standard

After reading First Impression: Google’s BigData offering called BigQuery , a reader (Shadab Shah) had few questions about it and in this blog-post, I am going to address those questions:

Q1. Any browser based Tool’s to Query data in BigQuery?

A1: They have a Browser Based Tool which they call “BigQuery Browser Tool” using which you can Query Data.

Apart from browser based, there are other tools too:

1) a command line tool called “BQ command-line tool. You can find more information here: https://developers.google.com/bigquery/

2) API. one can “include” big data analytic capabilities into a web app via RESTFul API. (Point #2 content credit: Michael Manoochehri’s comment)

Q2) Where is the Data Stored? 

If i just say “Google Cloud” that would not be a complete answer. There’s a complementary service called “Google CLOUD SQL” and so I do not want you to confuse data stored for BigQuery with “Google cloud SQL”.Theres’ a difference between BigQuery and Google cloud SQL, you can read that here: https://developers.google.com/bigquery/docs/overview

Having said that, it’s stored on Google’s cloud and if you wish to use BigQuery – you’ll have to upload your data-set in a CSV format and if you do so, it’s stored in Google cloud and is ready to be analyzed via BigQuery.

Q3) Where do I find lots of data to play with BigQuery?

Google has few sample data-sets that you can play with:

bigquery sample data

That’s about it for this post. Thanks Shadab Shah for the questions, I hope this post is useful.

Advertisements

First Impression: Google’s BigData offering called BigQuery

Standard

 

As a part of University of Washington’s (UW) cloud class’s assignment, I played with Google’s BigData offering BigQuery and I am writing this blog post to share what I think about it. please note that the views are my own and do not represent those of the instructor’s and fellow students at UW. And also I am not a BigData “Expert”, Think of me as a student trying to get my head around various offerings out there – So if you feel otherwise about what I have written, Just let me know in the comments section. Any-who read along to know what I think of BigQuery:

First up what is BigQuery?

It’s a platform to analyze your data (lot’s of it) by running SQL-Like Queries. And it’s really SQL-Like, and so if you are from SQL world like me – you would not face any issues in getting up and running in seconds by referring to the nicely written documentation.

And other point to consider here is that even though it’s SQL-Like, you’ll be able to analyze considerable number of rows in few seconds. Let me give you an example: I played with a  sample (called gsod) which had 115M rows and as per my experiments, I was able to get answers to simple computations like max, mean, avg, etc in less than couple of seconds. And little complex queries having where, joins and group by in around 5-6 seconds. Your results may vary depending on the type of query you run but the BOTTOMLINE is that it is FAST. that’s a good news!

BigQuery is Fast!

But what bothers me is that How am I suppose to “UPLOAD” lots of data on the Google CLOUD. It takes time, right? But I guess that’s an issue with every cloud based BigData offering. But here’s what I am thinking – If your data is already on the cloud. for e.g. Amazon’s or Microsoft’s – Does it not make sense to run analytic’s on Amazon’s and Microsoft’s cloud instead of porting your data to Google’s?

[Sidenote: I like it that Hadoop on Azure allows Amazon S3 data source. Nice move!]

My concern: Time spent in uploading truckload of data to Google’s cloud just so that we can use it for BigQuery

And even if you have your data on GAE data-store, you’ll have to uplaod your data to BigQuery separately. Source

Zooming out for a moment, I feel the Goal of BigQuery was to offer an easy to use BigData platform, And I feel that’s what they have delivered:

An easy-to-use + easy-to-setup “Hadoop+Hive” Like Offering.

[Update: Aug 20th 2012: I have been thinking about it more and I realized that BigQuery is more about satisfying real-time Big Data Scenario’s. And Hadoop/Hive/MapReduce is more about Batch Oriented  analysis and it’s great if you need to pre-process tons and tons of data]

But this “easiness” means that It is NOT as advanced as a Hadoop Installation (or Hadoop-on-Azure or Amazon’s elastic-map-reduce). But again, it’s easier and faster to get started with BigQuery. I guess, it just depends on what you are trying to achieve and based on that you’ll have to figure which is right tool for your scenario. No generic answer here, Sorry!

And BTW BigQuery supports only CSV – Talk about Variability (One of the V’s of BigData!). Let’s not get into that. I just wanted to Point that out because if you’re looking to analyze data-sets that cannot be converted to CSV for running SQL-Like Queries on top of them then BigQuery is not for you.

Conclusion:

Try out BigQuery. It’s easy to get started. It’s powerful if SQL-Like queries are all what you’ll need to analyze your data. If you are BigData enthusiast/expert/student – It’ll be a nice exercise to mentally compare other BigData offerings with BigQuery.

If you decide to try BigQuery or have already tried it out, I’ll love to hear what you think of it. Please leave a comment!

UPDATE (based on Michael Manoochehri’s comment): I didn’t implied that it is prohibitively expensive to upload data to BigQuery. Because I know, it’s NOT! Here is the result that Michael Manoochehri shared: As a test I once ingested about 350 Gb of CSV data (split into 10gb raw files, then I gzipped each one into ~1Gb). I ingested the entire batch using the bq command line tool, and had the entire dataset in BigQuery in just a few hours. I agree that it’s not 100% trivial to move 300 Gb of data from a local cluster into Google’s cloud – but it’s not really that difficult.

[Update: Aug 20th 2012: If you are interested in the Mechanics behind BigQuery – search for “Google Dremel Whitepaper”. it’s an amazing read]