Sample case Study : Hadoop Configruation to use EMC Isilon Storage
This article is basically for those who already knows what is hadoop and why it is used. If you are not familiar with hadoop, please skip this post. We will be having another article to give basic understanding of the hadoop.
Hadoop can be deployed in different ways due to its flexible, open-source framework for large-scale distributed computation. I would like share my recent deployment of Hadoop on Isilon scale-out NAS.
To give a very high level introduction to EMC Isilon scale-out NAS storage platform, it combines modular hardware with unified software to harness unstructured data, powered by the distributed OneFS operating system, an EMC Isilon cluster delivers a scalable pool of storage with a global namespace.
The OneFS file system can be configured for native support of the Hadoop Distributed File System (HDFS) protocol, enabling your cluster to participate in a Hadoop system. The HDFS service, which is enabled by default after you activate an HDFS license, can be enabled or disabled by running the isi services command.
To enable the HDFS service, run the following command:
isi services isi_hdfs_d enableo
An HDFS implementation adds HDFS to the list of protocols that can be used to access the OneFS file system. Implementing HDFS on an Isilon cluster does not create a separate HDFS file system. The cluster can continue to be accessed through NFS, SMB, FTP, and HTTP.
The HDFS implementation from Isilon is a lightweight protocol layer between the OneFS file system and HDFS clients. Unlike with a traditional HDFS implementation, files are stored in the standard POSIX-compatible file system on an Isilon cluster. This means files can be accessed by the standard protocols that OneFS supports, such as NFS, SMB, FTP, and HTTP as well as HDFS.
Files that will be processed by Hadoop can be loaded by using standard Hadoop methods, such as hadoop fs -put, or they can be copied by using an NFS or SMB mount and accessed by HDFS as though they were loaded by Hadoop methods. Also, files loaded by Hadoop methods can be read with an NFS or SMB mount.
The supported versions of Hadoop are as follows:
- Apache Hadoop 0.20.203.0
- Apache Hadoop 0.20.205
- Cloudera (CDH3 Update 3)
- Greenplum HD 1.1
To enable native HDFS support in OneFS, you must integrate the Isilon cluster with a cluster of Hadoop compute nodes. This process requires configuration of the Isilon cluster as well as each Hadoop compute node that needs access to the cluster.
Create a local user:
To access files on OneFS by using the HDFS protocol, you must first create a local Hadoop user that maps to a user on a Hadoop client.
- Open a SSH connection to any node in the cluster and log in by using the root user account.
- At the command prompt, run the isi auth users create command to create a local user.
isi auth users create –name=”user1″
Configure the HDFS protocol
You can specify which HDFS distribution to use, and you can set the logging level, the root path, the Hadoop block size, and the number of available worker threads. You configure HDFS by running the isi hdfs command in the OneFS command-line interface.
- Open a SSH connection to any node in the cluster and log in by using the root account.
- To specify which distribution of the HDFS protocol to use, run the isi hdfs command with the –force-version option.
- AUTO: Attempts to match the distribution that is being used by the Hadoop compute node.
- APACHE_0_20_203: Uses the Apache Hadoop 0.20.203 release.
- APACHE_0_20_205: Uses the Apache Hadoop 0.20.205 release.
- CLOUDERA_CDH3: Uses version 3 of Cloudera’s distribution, which includes Apache Hadoop.
- GREENPLUM_HD_1_1: Uses the Greenplum HD 1.1 distribution.
For example, the following command forces OneFS to use version 0.20.203 of the Apache Hadoop distribution:
isi hdfs –force-version=APACHE_0_20_203
3. To set the default logging level for the Hadoop daemon across the cluster, run the isi hdfs command with the –log-level option.
- EMERG: A panic condition. This is normally broadcast to all users.
- ALERT: A condition that should be corrected immediately, such as a corrupted system database.
- CRIT: Critical conditions, such as hard device errors.
- ERR: Errors.
- WARNING: Warning messages.
- NOTICE: Conditions that are not error conditions, but may need special handling.
- INFO: Informational messages.
- DEBUG: Messages that contain information typically of use only when debugging a program.
For example, the following command sets the log level to WARNING:
isi hdfs –log-level=WARNING
4. To set the path on the cluster to present as the HDFS root directory, run the isi hdfs command with the –root-path option.
For example, the following command sets the root path to /ifs/hadoop:
isi hdfs –root-path=/ifs/hadoop
5. To set the Hadoop block size, run the isi hdfs command with the –block-size option.
Valid values are 4KB to 1GB. The default value is 64MB.
For example, the following command sets the block size to 32 MB:
isi hdfs –block-size=32MB
6. To tune the number of worker threads that HDFS uses, run the isi hdfs command with the –num-threads option.
Valid values are 1 to 256 or auto, which is calculated as twice the number of cores. The default value is auto.
For example, the following command specifies 8 worker threads:
isi hdfs –num-threads=8
7. To allocate IP addresses from an IP address pool, run isi hdfs with the –add-ip-pool option.
For example, the following command allocates IP addresses from a pool named “pool2,” which is in the “subnet0” subnet:
isi hdfs –add-ip-pool=subnet0:pool2
HDFS commands that can be used on Isilon OneFS:
Manages rack-local configuration
isi hdfs racks
Displays an HDFS rack object
isi hdfs racks view –name
Modifies an HDFS rack object
isi hdfs racks modify
Lists the exisiting HDFS racks
isi hdfs racks list
Deletes an HDFS rack
isi hdfs racks delete –name
Creates the HDFS rack
isi hdfs racks create –name