This post will introduce some basic progress & configuration of deploying a HDFS cluster, and some error diagnosis.

Please see this post for a topic collection of distributed computing.

Let's define /localscratch/hdfs as the root of all the pacakages, and we will keep everything in this folder. There are a few variables used a path with this prefix. You may rename them using your path.

Let's assume hdfs-host-placeholder is the domain of our nodes. If we just use it locally, we can replace it as localhost

Environment variables

Firstly, let's create a file named as "hdfs-env.sh"

mkdir -p /localscratch/hdfs/etc
cat > /localscratch/hdfs/etc/hdfs-env.sh << EOF
# how to get java home?
# java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home'
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 

export HADOOP_HOME=/localscratch/hdfs/lib/hadoop # define absolute path for hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="${HADOOP_OPTS} -Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR"
EOF

We can edit our $HOME/.bash_profile, and add a few lines as follow:

if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
    . /localscratch/hdfs/etc/hdfs-env.sh
    export PATH=$PATH:$HADOOP_HOME/bin
fi

Download Package

Download hadoop-2.9.2.tar.gz, and decompress it into /localscratch/hdfs/lib

mkdir -p /localscratch/hdfs/lib
cd /localscratch/hdfs/lib
wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz
tar -xzvf hadoop-2.9.2.tar.gz
ln -sf hadoop-2.9.2 hadoop

Update Configure Files

1. Edit $HADOOP_HOME/etc/hadoop/core-site.xml

Here is a example.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hdfs-host-placeholder:9000</value>
    </property>
    
    <!-- workaround for old hadoop-->
    <property>
       <name>fs.file.impl</name>
       <value>org.apache.hadoop.fs.LocalFileSystem</value>
       <description>The FileSystem for file: uris.</description>
    </property>

    <property>
       <name>fs.hdfs.impl</name>
       <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
       <description>The FileSystem for hdfs: uris.</description>
    </property>
</configuration>

fs.defaultFS defined the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

Please see here for more variable configurable.

2. Modify $HADOOP_HOME/etc/hadoop/hdfs-site.xml as follow:

Here is a example.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>file:///localscratch/hdfs/data/name</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>file:///localscratch/hdfs/data/data</value>
    </property>

    <!-- ZooKeeper -->
    <!-- Please see https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Configuring_automatic_failover
         for further explaination of using ZooKeeper.
    -->
    <!--
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>hdfs-host-placeholder:2181</value>
    </property>
    -->

</configuration>

Please see here for more variable configurable.

3. Modify $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add this script on top:

if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
    . /localscratch/hdfs/etc/hdfs-env.sh
fi

4. Modify $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml

Change the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5 This could prevent the jobs never return in small cluster.

5. Edit $HADOOP_HOME/etc/hadoop/slaves

This is a list of members.

hdfs-host-placeholder

The members are written line by line

6. Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>mapreduce.tasktracker.map.tasks.maximum</name>
        <value>32</value>
    </property>
    <property>
        <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
        <value>32</value>
    </property>
</configuration>

Please see here for more variable configurable.

7. Edit $HADOOP_HOME/etc/hadoop/mapred-queues.xml

<?xml version="1.0"?>
<queues>
  <queue>
    <name>default</name>
    <properties>
    </properties>
    <state>running</state>
    <acl-submit-job> </acl-submit-job>
    <acl-administer-jobs> </acl-administer-jobs>
  </queue>
</queues>

8. Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml

Here is a example

<?xml version="1.0"?>
<configuration>
    <property>
        <name>yarn.web-proxy.address</name>
        <value>0.0.0.0:19987</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- https://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits  -->
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
        <description>Whether virtual memory limits will be enforced for containers</description>
    </property>

    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>4</value>
        <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
    </property>

    <!-- https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Launcher-ERROR-reason-Main-class-org-apache-oozie-action-hadoop/td-p/27686 -->
    <!-- https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html -->
    <!-- as well as https://stackoverflow.com/questions/30828879 -->
    <!-- -->
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    </property>
    <property>
        <name>yarn.scheduler.increment-allocation-mb</name>
        <value>512</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>8192</value>
    </property>
    <!-- -->
    
    <!-- -->
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>2048</value>
    </property>
    <!-- -->

    <property>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>32</value>
    </property>
    
    
    <!-- In case of the hard disk limitation -->
    <property>
      <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
      <value>100</value>
    </property>

    <!-- For Cluster -->
    <!-- ref: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml -->
    <!-- rest of the configurations: https://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/ -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hdfs-host-placeholder</value>
    </property>
</configuration>

9. Initialize & reformat HDFS name/data folders

# create the corresponding folders
mkdir -p /localscratch/hdfs/data/name
mkdir -p /localscratch/hdfs/data/data
# init name node
hdfs namenode -format

Create Manage Scripts

1. Start scripts

mkdir -p /localscratch/hdfs/sbin
cat > /localscratch/hdfs/sbin/start-hdfs.sh << EOF
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
    . /localscratch/hdfs/etc/hdfs-env.sh
    export PATH=$HADOOP_HOME/bin:$PATH
fi

echo "start: hadoop dfs"

$HADOOP_HOME/sbin/start-dfs.sh

echo "start: hadoop yarn"

$HADOOP_HOME/sbin/start-yarn.sh

EOF
chmod +x /localscratch/hdfs/sbin/start-hdfs.sh

2. Stop scripts

mkdir -p /localscratch/hdfs/sbin
cat > /localscratch/hdfs/sbin/stop-hdfs.sh << EOF
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
    . /localscratch/hdfs/etc/hdfs-env.sh
    export PATH=$HADOOP_HOME/bin:$PATH
fi

echo "stop: hadoop yarn"

$HADOOP_HOME/sbin/stop-yarn.sh

echo "stop: hadoop dfs"

$HADOOP_HOME/sbin/stop-dfs.sh

EOF
chmod +x /localscratch/hdfs/sbin/stop-hdfs.sh

References

Categories: Code

Yu

Ideals are like the stars: we never reach them, but like the mariners of the sea, we chart our course by them.

Leave a Reply

Your email address will not be published. Required fields are marked *