Introduction to Hadoop

_hadoopelephant_rgb1

Hadoop is a platform that is well suited to deal with semi-structured & unstructured data, as well as when a data discovery process is needed. That isn’t to say that Hadoop can’t be used for structured dara that is readily available in a raw format; because it can.

Traditionally, data goes through a lot of rigor to make it into the warehouse. This data is cleaned up via various cleansing, enrichment, modeling, master data management and other services before it is ready for analysis; which is expensive process. Because of that expense, its clear that data that lands in warehouse is not just high value, but has a broad purpose; it is used to generate reports & dash-board where the accuracy is the key.

In contrast, Big Data repositories very rarely undergo the full quality control versions of data injected into a warehouse, Hadoop is built for the purpose of handling larger volumes of data, so prepping data and processing it should be cost prohibitive.

I say Hadoop as a system designed for processing mind-boggling amounts of data

Two main components of Hadoop:

1. Map – Reduce = Computation
2. HDFS = Storage

Hadoop Distributed file system (HDFS):

hdfs arch

Let’s discuss about the Hadoop cluster components before getting into details of HDFS.

A typical Hadoop environment consists of a master node, worker nodes with specialized software components.

Master node: There will be multiple master nodes to avoid single point of failure in any environment. The elements of master node are

1. Job Tracker
2. Task tracker
3. Name tracker

Job Tracker: Job tracker interacts with client applications. It is mainly responsible for distributing Map. Reduce tasks to particular nodes with in a cluster.

Task tracker: This process receives the tasks from a job tracker like Map, Reduce and shuffle.

Name node: All these processes are charged with storing a directory free of all files in the HDFS. They also keep track of where the file data is kept within the cluster. Client applications contact name nodes when they need to locate a file, or add, copy as delete a file.

Data Node: Data nodes stores data in the HDFS, it is responsible for replicating data across clusters. These interact with client apps and Name node supplied the data node’s address.

Hdfs-mr

Worker Nodes: These are the commodity servers for processing the data that is coming through. Each worker node includes a data node and a task tracker

Scenario to better understand how “stuff” works:

1. Let’s say we have a 300mb file
2. By default we make it as 128mb blocks
300mb= 128mb + 128mb + 44mb
3. So HDFS splits 300mb into blocks as above
4. HDFS will keep 3 copies of each block
5. All these blocks are stored on data nodes

Bottom line is, Name node tracks blocks & data nodes and pays attention to all nodes in cluster. It do not save any data and no data goes through it.

• When a Data node (DN) fails it makes sure the copies are copied to another node and can handle upto 2 DN’s failure.
• Name node (NN) is a single point of failure.
• DN’s continuously runs check sums, if any block is corrupted then it will be process from other DN’s replicas.

There is lot more to discuss but let’s move on to M-R for now.

Map Reduce (M-R)

Google invented this. The main characteristics of M-R are:

1. Sort/merge is the primate
2. Batch oriented
3. Ad hoc queries (no schema)
4. Distribution handled by frame work

Let’s make it simple to understand, we get TB’s & PB’s of data to get processed & analyzed. So to handle it we use MR which basically has two major phases map & reduce.

Map: MR uses key/value pairs. Any data that comes in will be Splitted by HDFS into blocks and then we process it through M-R where we assign a value to every key.

Example: “Gurukulindia is the best site to learn big data”

<key value>

<Gurkulindia, 1> <is, 1=””> <the, 1=””> <best, 1=””> <site, 1=””> <to, 1=””> <learn, 1=””> <big, 1=””> <data, 1=””>

Just to list network view & logical view and to make good view:

1. Input step: Load data into HDFS by splitting & load to DN’S. The blocks are replicated to overcome failures. The NN keeps track of blocks & DN’s.
2. Job step: Submit the MR Job & its details to the Job tracker.
3. Job init step: The Job tracker interacts with task tracker on each DN to schedule MR tasks.
4. Map step: Mapper process data blocks and generates a list of key value pairs
5. Sort step: Mapper sorts list of key value pair
6. Shuffle: Transfers mapped output to the reducers in sorted fashion.
7. Reduce: Reduces merge list of key value pairs to generate final result.

The results of Reduces are finally stored in HDFS replicated as per the configuration and then clients will be able to read from HDFS.

Ravikiran Paladugu

Ravikiran Paladugu

Ravi pursuing his doctoral degree in the area of Data Science and currently working as a Storage and Hadoop consultant specializing in Apache Hadoop, EMC and NetApp products, USA. My experience with Big data includes design, deployment and administration of Hadoop and related components by providing end-to-end data protection and identifying performance bottlenecks in the infrastructure.

20 Responses

  1. Sukumar Enuguri says:

    I’d love to learn more hadoop…

  2. Ramdev Ramdev says:

    sukumar, you will see more on this space.

  3. ramesh reddy says:

    Hi Ramdev,

    Currently am working as sr unix admin(Solaris-primary Linux-Secondary)
    i would like to learn hadoop , but which topic is suit for unox admins like HDFS, JOBkeer etc.
    Could you please suggest .. where to start and what to learn.

  4. Ramdev Ramdev says:

    Ramesh, Undestanding with HDFS, MAP Reduce and Hadoop Cluster maintence will be additional asset for unix admin, if they chose to work in cloud based environment.

  5. ramesh reddy says:

    Hi Ramdev,

    Please provide some websites for HDFS, MAP Reduce and Hadoop Clusters documentation

  6. Abhishek says:

    Hey Rama…Super article..:)
    Can u pls share/send steps to install it …i am working on it with 200+ data nodes and started from scratch…right now I m scratching my own head ..but will keep on updating my findings and glitches here..if u have any apart from cloudera..pls send. Many Thanks..and hats off to ur website..it’s awesome..!!

  7. Hi Abhishek,

    I am working on drafting a article to summarize the general steps of installing Hadoop. I bet it would be hectic when you are just starting with this technology, but as you get experienced it will be great fun. I would go for scripting when talking about 200+ nodes.

  8. Piper says:

    I personally thouroughly liked this topic unixadminschool.com » Introduction to Hadoop totally a very lovely browse appreciation

  9. Thanks for the feedback Piper!

  10. Rahul says:

    Hi Ramdev,

    currently i’m working as a Solaris/Linux Admin from 2 Years. and now i’m looking for Oracle DBA will it good for me specially my future?please suggest me.

  11. Ramdev Ramdev says:

    Hi Rahul, for me personally the career growth is same for both DBA and Sysadmin and infact they grow parallel in infrastructure management services. If at all if you are looking for future technologies and interested in database related stuff, i would recommend cassandra and bigdata skills.  

  12. Kiran says:

    I am interested to know the Hadoop concepts for Unix SA’s. As per my knowledge there are 2 ways in Hadoop.
    1. Hadoop Development 2. Hadoop Admin. For Unix SA’s Hadoop Admin is very good suitable.
    In current market (Jobs) how is Hadoop growth. Will it continue ? I

  13. david blowe says:

    Hadoop – very nice ! I am broadening my knowledge greatly ! Thanks ! If you have more on Isilon and Hadoop that would be good .Excellent ! Did I understand correctly that Hadoop allows or makes a triad/triangle of DN servers so if one goes down then the data can still be obtained from the other two ? Or it “just” has 3copies on one big server like I think Isilon is or Panasas too ?

  14. Kiran MS says:

    Step:2 . By default we make it as 128mb blocks
    Make me clear on this … i see that by default it should be 64 MB right ?

  15. Ramdev Ramdev says:

    @Kiran, i believe for the values mentioned in this article based cloudera hadoop.
    Are you using apache default hadoop version?

  16. Kiran MS says:

    Okie….
    I have an issue.. I am trying to move the files from Linux to Hadoop. No errors in it. But the problem is ..We are unable to view the Dir…

    Pls find the following o/p

    hadoop@hadoop-VirtualBox:~$ pwd
    /home/hadoop

    hadoop@hadoop-VirtualBox:~$ ls -l Hadoop_Files
    -rw-rw-r– 1 hadoop hadoop 9 Feb 2 11:31 Hadoop_Files

    hadoop@hadoop-VirtualBox:~$ hadoop fs -put Hadoop_Files /user/hadoop

    hadoop@hadoop-VirtualBox:~$ hadoop fs -ls
    Found 4 items
    -rw-r–r– 1 hadoop supergroup 22 2014-02-01 12:15 /user/hadoop/Apache_Hadoop
    -rw-r–r– 1 hadoop supergroup 9 2014-02-02 11:31 /user/hadoop/Hadoop_Files
    -rw-r–r– 1 hadoop supergroup 8445 2014-02-01 12:25 /user/hadoop/examples.desktop
    -rw-r–r– 1 hadoop supergroup 22 2014-02-01 12:16 /user/hadoop/test

    hadoop@hadoop-VirtualBox:~$ hadoop fs -cat /user/hadoop/Hadoop_Files
    Test File

    hadoop@hadoop-VirtualBox:~$ hadoop fs -lsr
    -rw-r–r– 1 hadoop supergroup 22 2014-02-01 12:15 /user/hadoop/Apache_Hadoop
    -rw-r–r– 1 hadoop supergroup 9 2014-02-02 11:31 /user/hadoop/Hadoop_Files
    -rw-r–r– 1 hadoop supergroup 8445 2014-02-01 12:25 /user/hadoop/examples.desktop
    -rw-r–r– 1 hadoop supergroup 22 2014-02-01 12:16 /user/hadoop/test
    hadoop@hadoop-VirtualBox:~$

    hadoop@hadoop-VirtualBox:~$ jps
    2543 DataNode
    3780 Jps
    3085 TaskTracker
    2857 JobTracker
    2773 SecondaryNameNode
    2316 NameNode
    hadoop@hadoop-VirtualBox:~

    From localhost:50070 we are able see the Dir but not in command line…

  17. Kiran MS says:

    from localhost:50070

    Name Type Size Replication Block Size Modification Time Permission Owner Group
    tmp dir

    2014-02-01 12:14 rwxr-xr-x hadoop supergroup
    user dir

    2014-02-01 12:25 rwxr-xr-x hadoop supergroup

    Name Type Size Replication Block Size Modification Time Permission Owner Group
    hadoop dir

    2014-02-02 11:31 rwxr-xr-x hadoop supergrou

    Name Type Size Replication Block Size Modification Time Permission Owner Group
    Apache_Hadoop file 0.02 KB 1 64 MB 2014-02-01 12:15 rw-r–r– hadoop supergroup
    Hadoop_Files file 0.01 KB 1 64 MB 2014-02-02 11:31 rw-r–r– hadoop supergroup
    examples.desktop file 8.25 KB 1 64 MB 2014-02-01 12:25 rw-r–r– hadoop supergroup
    test file 0.02 KB 1 64 MB 2014-02-01 12:16 rw-r–r– hadoop supergroup

  18. PRAFUL says:

    Thanks for Showing interest in HADOOP, Please go hadoop Admin part… it will very help for you . RAJA 9818868839

  19. I am planning to get hadoop administratio training from you .

    If possible, please let me know the details

What is in your mind, about this post ? Leave a Reply

Close
  Our next learning article is ready, subscribe it in your email

What is your Learning Goal for Next Six Months ? Talk to us