Archive for the ‘Big Data’ Category

Big Data: NoSQL & the DBA

January 11, 2013 1 comment

Of late, “Big Data” has become one of the hottest topics in the worlds of business and computing.

I’ve done a presentation recently in an internal conference about Big Data, related technologies and how today’s Database Administrators are affected by it. Here’s a background/outline of the talk followed by the presentation itself (slimmed down version).


Customer feedback is very important for any business with out which there cannot be any improvement in the products/services it offers. So, over time, companies have followed several different approaches like in person surveys, random phone calls to customers to get the feedback.

Today, connected devices like mobile phones, tablet PCs, social networks with hundreds of millions of users and various sensors are generating enormous amounts of data (of the orders of hundreds of terabytes to a few petabytes), and at exceptional speeds (A single jet engine on the Boeing 737 generates as much as 240TB of data in a cross-country flight).

So businesses are calling this large volumes of data as “Big Data” and is learning to leverage it for surveys and feedbacks. Therefore it has become highly crucial for the success of any business.

Traditional database softwares are proving to be inefficient to capture and store this avalanche of data. With a little out-of-the-box thinking, the open source community has come up with several systems that can acquire, manage, and process such large volumes. Organizations must make a wise decision in picking a system.

Big Data is also creating several new roles and early birds can make a fortune. 🙂

In this talk, I discussed various technologies available, their working principles and the opportunities available for today’s DBA to explore in this new world of Big Data.

The presentation

Click here to view this presentation on




BigData – A primer

August 29, 2012 Leave a comment

“Big Data” is not new to any of us. We see that everyday, every moment. We are contributing to Big Data every minute and are making it Bigger.

Yes, right now, before you ended up on this post, I’m sure enough you’ve clicked a couple of links that would have updated your IP address, geographic location, the website you were browsing on, and several other details on a server.

And your smart phone might have been conversing with it’s manufacturer about the current version of O.S or a recent crash report or updates for existing applications. So hundreds of millions of users like you and me, that means a vast amount of data generated every hour.

A Boeing 737 generates 240TB of data in a single cross country flight  – the speed/velocity of data generation is very high.

So what is happening to all the information being generated at this rate? This could easily measure a few hundreds of gigabytes to a few tera bytes or even peta bytes.

So, Big Data technologies are all about answering the following two questions

  1. How and where do you capture all this data?
  2. How do you organize and make meaningful business decisions based on this data?

Capturing Big Data

So the data being fed by sources like vast number of surveillance cameras, microphones, a wide variety of sensors, mobile phones, Internet Click Streams, tweets, facebook messages is actually, not structured. The velocity of this data demands very high write performance from the datastore – So much so that the ACID properties promised by RDBMS themselves could become a bottleneck for performance. A poor write performance means inability to capture the data as it comes. 

Availability of the store at all times necessitates that the data be distributed on multiple servers, and that brings in the problems of replication and consistency.

Simple Key Value stores have evolved in the recent times that work without conforming to ACID principles. These stores accept an application defined “Key” and some “Value” and persist the record as per a preset configuration. They offer a variety of SLAs for Replication, Consistency and speed of access. These are also called NoSQL databases.

NoSQL databases are Distributed Hash Tables that store “items” indexed by ‘keys”.

As per the CAP theorem, in a distributed environment it is impossible to guarantee all three of Consistency, Availability and Partition tolerance, you need to sacrifice one of them. All NoSQL databases are built to be operated in distributed environments (although they can be operated on lone hosts). They are optimized for very high write performance by conforming to BASE properties (RDBMS follow ACID properties, we all know that). 

BASE – Basically Available Soft-state Eventual consistency

Data from the application need not be normalized to multiple tables (as with RDBMS), so an object is written or read in one shot, into a single Key-Value table. All the data for an object resides at one place, and is co-located on the disk. So it is a sequential read which means very high disk through put.

Key Value stores are classified into four types based on the type of value they store.

  1. Simple KeyValue stores (Amazon Dynamo)
  2. Graph DB (Flock DB)
  3. Column families (Cassandra)
  4. Document databases (Mongo DB)

Making sense out of Big Data

Map Reduce is a programming model to process large data sets.

The idea is to send the code to where the data resides, because we are talking of large data sets and moving them around could be expensive and time consuming.


In this step, the actual problem is divided into sub-problems and are assigned to worker nodes (typically where the data resides).


All the results will be gathered from the worker nodes and will be merged to produce the final result.

Example: Apache Hadoop