Posts Tagged ‘MapReduce’

BigData – A primer

August 29, 2012 Leave a comment

“Big Data” is not new to any of us. We see that everyday, every moment. We are contributing to Big Data every minute and are making it Bigger.

Yes, right now, before you ended up on this post, I’m sure enough you’ve clicked a couple of links that would have updated your IP address, geographic location, the website you were browsing on, and several other details on a server.

And your smart phone might have been conversing with it’s manufacturer about the current version of O.S or a recent crash report or updates for existing applications. So hundreds of millions of users like you and me, that means a vast amount of data generated every hour.

A Boeing 737 generates 240TB of data in a single cross country flight  – the speed/velocity of data generation is very high.

So what is happening to all the information being generated at this rate? This could easily measure a few hundreds of gigabytes to a few tera bytes or even peta bytes.

So, Big Data technologies are all about answering the following two questions

  1. How and where do you capture all this data?
  2. How do you organize and make meaningful business decisions based on this data?

Capturing Big Data

So the data being fed by sources like vast number of surveillance cameras, microphones, a wide variety of sensors, mobile phones, Internet Click Streams, tweets, facebook messages is actually, not structured. The velocity of this data demands very high write performance from the datastore – So much so that the ACID properties promised by RDBMS themselves could become a bottleneck for performance. A poor write performance means inability to capture the data as it comes. 

Availability of the store at all times necessitates that the data be distributed on multiple servers, and that brings in the problems of replication and consistency.

Simple Key Value stores have evolved in the recent times that work without conforming to ACID principles. These stores accept an application defined “Key” and some “Value” and persist the record as per a preset configuration. They offer a variety of SLAs for Replication, Consistency and speed of access. These are also called NoSQL databases.

NoSQL databases are Distributed Hash Tables that store “items” indexed by ‘keys”.

As per the CAP theorem, in a distributed environment it is impossible to guarantee all three of Consistency, Availability and Partition tolerance, you need to sacrifice one of them. All NoSQL databases are built to be operated in distributed environments (although they can be operated on lone hosts). They are optimized for very high write performance by conforming to BASE properties (RDBMS follow ACID properties, we all know that). 

BASE – Basically Available Soft-state Eventual consistency

Data from the application need not be normalized to multiple tables (as with RDBMS), so an object is written or read in one shot, into a single Key-Value table. All the data for an object resides at one place, and is co-located on the disk. So it is a sequential read which means very high disk through put.

Key Value stores are classified into four types based on the type of value they store.

  1. Simple KeyValue stores (Amazon Dynamo)
  2. Graph DB (Flock DB)
  3. Column families (Cassandra)
  4. Document databases (Mongo DB)

Making sense out of Big Data

Map Reduce is a programming model to process large data sets.

The idea is to send the code to where the data resides, because we are talking of large data sets and moving them around could be expensive and time consuming.


In this step, the actual problem is divided into sub-problems and are assigned to worker nodes (typically where the data resides).


All the results will be gathered from the worker nodes and will be merged to produce the final result.

Example: Apache Hadoop