Introduction To Big Data And Hadoop

What is Big Data

Have you ever really wondered how technologies actually evolve to fulfil emerging needs?

For example:

Earlier we had landline phones, but now we have shifted to smartphones. Similarly, how many of you remember floppy drives that were extensively used back in the ’90s? These Floppy drives have actually been replaced by Hard disks because these floppy drives had very low storage capacity and transfer speed.

Therefore, this actually makes floppy drives insufficient for handling the amount of data with which we are actually dealing with today. In fact, now we can even store terabytes of data on the cloud without being bothered about size constraints.

Need for Big Data

Following are the reasons why Big Data is needed.

By an estimate, around 90% of the world’s data has been created in the last two years alone. Moreover, 80% of the data is unstructured or available in widely varying structures, which are difficult to analyze.

As IT systems are being developed, it has been observed that structured formats like databases have some limitations with respect to handling large quantities of data.

It has also been observed that it is difficult to integrate information distributed across multiple systems.

Further, most business users don’t know what should be analyzed and also discover requirements only during the development of IT systems. As data has actually grown, so have ‘data lakes’ within enterprises.

Potentially valuable data for varied systems such as Enterprise Resource Planning or ERP (and Supply Chain Management or SCM (read as S-C-M) are either dormant or discarded. It is often too expensive to actually integrate large volumes of unstructured data.

Information, such as natural resources, has a short, useful lifespan and is best used in a limited time span.

Moreover, information is best exploited for business value if a context is actually added to it.

In the next section of introduction to big data tutorial, we’ll concentrate on the characteristics of big Data.

Introduction to Hadoop

Hadoop helps to leverage the opportunities provided by Big Data and overcome the challenges it encounters.

What is Hadoop?

Hadoop is actually an open source, a Java-based programming framework that further supports the processing of large data sets in a very distributed computing environment. It is actually based on the Google file system or GFS (read as G-F-S).

Why Hadoop

Hadoop runs variety of applications on distributed systems with thousands of nodes involving petabytes of data. It has a distributed file system, called Hadoop Distributed File System or HDFS, which enables fast data transfer among the nodes.

Further, it leverages a distributed computation framework called MapReduce.