Archive

Archive for the ‘Distributed computing’ Category

Big Data: NoSQL & the DBA

January 11, 2013 1 comment

Of late, “Big Data” has become one of the hottest topics in the worlds of business and computing.

I’ve done a presentation recently in an internal conference about Big Data, related technologies and how today’s Database Administrators are affected by it. Here’s a background/outline of the talk followed by the presentation itself (slimmed down version).

Background/Outline


Customer feedback is very important for any business with out which there cannot be any improvement in the products/services it offers. So, over time, companies have followed several different approaches like in person surveys, random phone calls to customers to get the feedback.

Today, connected devices like mobile phones, tablet PCs, social networks with hundreds of millions of users and various sensors are generating enormous amounts of data (of the orders of hundreds of terabytes to a few petabytes), and at exceptional speeds (A single jet engine on the Boeing 737 generates as much as 240TB of data in a cross-country flight).

So businesses are calling this large volumes of data as “Big Data” and is learning to leverage it for surveys and feedbacks. Therefore it has become highly crucial for the success of any business.

Traditional database softwares are proving to be inefficient to capture and store this avalanche of data. With a little out-of-the-box thinking, the open source community has come up with several systems that can acquire, manage, and process such large volumes. Organizations must make a wise decision in picking a system.

Big Data is also creating several new roles and early birds can make a fortune. 🙂

In this talk, I discussed various technologies available, their working principles and the opportunities available for today’s DBA to explore in this new world of Big Data.

The presentation


Click here to view this presentation on Slideshare.com
Related

 

 

 


Advertisements

BigData – A primer

August 29, 2012 Leave a comment

“Big Data” is not new to any of us. We see that everyday, every moment. We are contributing to Big Data every minute and are making it Bigger.

Yes, right now, before you ended up on this post, I’m sure enough you’ve clicked a couple of links that would have updated your IP address, geographic location, the website you were browsing on, and several other details on a server.

And your smart phone might have been conversing with it’s manufacturer about the current version of O.S or a recent crash report or updates for existing applications. So hundreds of millions of users like you and me, that means a vast amount of data generated every hour.

A Boeing 737 generates 240TB of data in a single cross country flight  – the speed/velocity of data generation is very high.

So what is happening to all the information being generated at this rate? This could easily measure a few hundreds of gigabytes to a few tera bytes or even peta bytes.

So, Big Data technologies are all about answering the following two questions

  1. How and where do you capture all this data?
  2. How do you organize and make meaningful business decisions based on this data?

Capturing Big Data

So the data being fed by sources like vast number of surveillance cameras, microphones, a wide variety of sensors, mobile phones, Internet Click Streams, tweets, facebook messages is actually, not structured. The velocity of this data demands very high write performance from the datastore – So much so that the ACID properties promised by RDBMS themselves could become a bottleneck for performance. A poor write performance means inability to capture the data as it comes. 

Availability of the store at all times necessitates that the data be distributed on multiple servers, and that brings in the problems of replication and consistency.

Simple Key Value stores have evolved in the recent times that work without conforming to ACID principles. These stores accept an application defined “Key” and some “Value” and persist the record as per a preset configuration. They offer a variety of SLAs for Replication, Consistency and speed of access. These are also called NoSQL databases.

NoSQL databases are Distributed Hash Tables that store “items” indexed by ‘keys”.

As per the CAP theorem, in a distributed environment it is impossible to guarantee all three of Consistency, Availability and Partition tolerance, you need to sacrifice one of them. All NoSQL databases are built to be operated in distributed environments (although they can be operated on lone hosts). They are optimized for very high write performance by conforming to BASE properties (RDBMS follow ACID properties, we all know that). 

BASE – Basically Available Soft-state Eventual consistency

Data from the application need not be normalized to multiple tables (as with RDBMS), so an object is written or read in one shot, into a single Key-Value table. All the data for an object resides at one place, and is co-located on the disk. So it is a sequential read which means very high disk through put.

Key Value stores are classified into four types based on the type of value they store.

  1. Simple KeyValue stores (Amazon Dynamo)
  2. Graph DB (Flock DB)
  3. Column families (Cassandra)
  4. Document databases (Mongo DB)

Making sense out of Big Data

Map Reduce is a programming model to process large data sets.

The idea is to send the code to where the data resides, because we are talking of large data sets and moving them around could be expensive and time consuming.

Map:

In this step, the actual problem is divided into sub-problems and are assigned to worker nodes (typically where the data resides).

Reduce:

All the results will be gathered from the worker nodes and will be merged to produce the final result.

Example: Apache Hadoop

MapReduce

January 3, 2011 Leave a comment

http://en.wikipedia.org/wiki/Mapreduce.

Google’s lecture series on MapReduce:

More to come…

FAQs on the cloud

January 3, 2011 1 comment

What are people thinking about cloud computing?

FAQ:


1. What is this thing called “Cloud computing”?

a. Cloud computing is kind of generalized, automated & integrated virtualization;

 

2. Why is there a lot of hype all over?

a. There is a lot of hype all over because.. cloud computing is a revolution to come… http://bit.ly/g2z3su;

 

3. Is that something fictional? Is that really possible?

a. It is not fictional.. If you believe you can create more than a few virtual machines on a single physical computer, Cloud computing is very much possible – In fact, it is proven and current, you cannot really question it’s practicality (http://aws.amazon.com,http://cloudpower.in, …);

 

4. How can a Virtual Machine (VM) be scaled to a capacity more than what the underlying host has?

a. There is not one but a whole lot of physical servers that serve as the cloud. Usually there will be a master host which controls the behavior of the rest of the hosts. It manages to get VMs that have uncorrelated work-load peaks and valleys on to a physical host, and is transparent..

 

5. I don’t think a VM on the cloud is as efficient as a physical, dedicated server for my application, I’d rather opt to have a dedicated server?

a. Latencies may pop-in, but there is always a trade-off. You evaluate the TCO (Total Cost of Ownership), for your business. Physical dedicated servers **may** do good, but if I were you, I would consider the maintenance costs like datacenter cooling, power, building lease and other misc. maintenaces & also overall server utilization against the efficiency of the application. I would not want my servers to be idle during non-peak hours – I have to pay the same amount out of pocket even it is non-peak as for my business. I want my business to be **cost-effective**

 

 

Test drive on the cloud using Amazon web services.

September 1, 2009 Leave a comment

Hi!

I wish to write a couple of lines on working with the Amazon Elastic Compute Cloud (EC2) today.
Of course, visiting the aws website would be sufficient for one to understand how things work there, this one’s some kind of a primer! Outlines what’s there in the store!

Getting a server up and running is as simple as 1..2..3…

1. Register
2. Choose and Submit payment info.
and
3. Create an instance (choose your server configuration), and launch.

The server configuration may be chosen manually, or from a list of “Community instances” (pre-configured).

The server, with all the resources you’ve requested along with the OS, and other application softwares like Oracle Database or SQL Server, is delivered in a jiffy – as fast as you click the launch button!

Amazon Web Services provides you with a private-key file using which you can login to the instance. All other tools like those to monitor the resource usage are readily available to choose and use.

Technical team that a cloud demands:

The cloud not only brings about a revolution across small and mid-sized businesses, but also creates a shock wave in the fields of Infra Structure/Networking/DBA man power.

It happens this way…
Say, a company Xyz Pvt Ltd of India is to host its online sales and eCommerce website where transaction volume per day is about one million. The company is unable to afford a million dollars in purchasing a server, with-out which the database/app-server in no way can handle about a million transactions per day. A Cloud enables the company to do this.
With the advent of Cloud computing, the company can now pay about a 100 dollars per day and host its website on the server capable enough.

This is only a small illustration where Cloud computing helps small and middle scale companies. Now if there are about tens of thousands of such companies in a city like Hyderabad [India], atleast thousands of them would come up with their own websites like eCommerce/online sales/customer survey, where there were only a few earlier.

Now let us change our perspective, let us have a look from a company that offers this cloud, say Abc Pvt Ltd, from India. Say there are about 50 to 100 companies using the cloud offered by Abc. Abc has to see that the cloud it offered is up and running all the time. If it is offering about 100 high-end servers for its customers, imagine the workforce it would need to look after them.

Cloud computing: A revolution to come

August 19, 2009 Leave a comment

(Moved from aswanikumarv.blogspot.com) Originally posted on: Wed, Aug 19, 2009

I would like to make my comments on Cloud computing. As an admirer of technology and sciences, and more as a professional in computer software engineering, I have been reading articles about cloud computing on the web.
The meaning, importance, power and almost every other thing that relates to Cloud computing can be understood from the very name. It is self-explanatory.
Hee, the objects (software/hardware) are available in the form of a cloud, which any authorized user can access. It is sizeable, on the go; one can upsize or downsize their server as easily as they post a message to his/her friend on a social networking site, no wires, no cables, no power and obviously no hassles, no worries.

Today, there are many Independent Software Vendors, we will call them ISVs hereafter, who cannot afford full sized web-servers and database servers for the applications they develop. For the costs of their procurement and maintenance are too high and far, really far beyond imagination. That said, another reason for their not procurement of their own servers is scalability.

Now that this part is taken up by the provider of the cloud, the ISV must now pay only for the machine, and that is based on the time he uses it for!
Now suppose that there are about a 100 such , and with the advent of the cloud computing, suppose that 50 of them will host their applications on their own servers.
Now the purpose or domain of these applications is what brings competition among these 50 ISVs.
But let me interrupt the narration about these 50 ISVs and ask you a simple question, do you think it is only 100 such ISVs in this real world? No. Millions are there. So imagine the competition and the revolution that is to come.

A picture is worth 1000 words… A video is worth a thousand pictures…