Apache Hadoop — An Open Source for beginners in big data
Big data is this, Big data is that… Let’s start with simple words. We have a system, that generates Petabytes or Zettabytes of data. Now, we cant store this data in traditional SQL databases. If we need to run analytics on this much data, we are also not able to use traditional approaches. In this scenario, we enter into Big Data.

We hear different terms about this.
- Data Warehousing
- Data Lakes
- Analytical Databases
I am trying to explain all terms in my blogs. I am using some open source tools in my articles, so readers easily can access all tools to start learning.
From a business perspective, if we run analytics on big data we can ultimately fuel better and faster decision-making. Modeling and predicting future outcomes and enhanced business intelligence. Apache Hadoop, Apache Spark and entire Hadoop ecosystem as cost-effective, flexible data processing and storage tools designed to handle the volume of data being generated today.
Hadoop start with 2 research papers published by Google.
- Google’s distributed file system, called GFS
- MapReduce
Doug quoted on Google’s contribution to the development of the Hadoop framework:
“Google is living a few years in the future and sending the rest of us messages.”
Actually, Mike Cafarella and Doug Cutting were in the process of building a search engine system that can index 1 billion pages.
Hadoop is an open-source software framework used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license. It was written in Java programing language.
We have 3 major problems in Big Data
- The first problem is storing huge amount of data.
- Next problem was storing a variety of data.
- The third challenge was about processing the data faster.
Features of Hadoop
Hadoop is Open Source.
Hadoop cluster is Highly Scalable.
Hadoop provides Fault Tolerance.
Hadoop provides High Availability.
Hadoop is very Cost-Effective.
Hadoop is Faster in Data Processing.
Hadoop is based on Data Locality concept.
Hadoop provides Feasibility.
Hadoop EcoSystem and Components
Below diagram shows various components in the Hadoop ecosystem-

Apache Hadoop consists of two sub-projects –
Hadoop MapReduce: MapReduce is a computational model and software framework for writing applications which are run on Hadoop. These MapReduce programs are capable of processing enormous data in parallel on large clusters of computation nodes.
HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas of data blocks and distributes them on compute nodes in a cluster. This distribution enables reliable and extremely rapid computations.
Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used for a family of related projects that fall under the umbrella of distributed computing and large-scale data processing. Other Hadoop-related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume, and ZooKeeper.