This post will introduce some basic progress & configuration of deploying a HDFS cluster, and some error diagnosis.
Please see this post for a topic collection of distributed computing.
Let’s define /localscratch/hdfs as the root of all the pacakages, and we will keep everything in this folder. There are a few variables used a path with this prefix. You may rename them using your path.
Let’s assume hdfs-host-placeholder is the domain of our nodes. If we just use it locally, we can replace it as localhost
Environment variables
Firstly, let’s create a file named as “hdfs-env.sh”
mkdir -p /localscratch/hdfs/etc
cat > /localscratch/hdfs/etc/hdfs-env.sh <&1 > /dev/null | grep 'java.home'
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/localscratch/hdfs/lib/hadoop # define absolute path for hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="${HADOOP_OPTS} -Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR"
EOF
We can edit our $HOME/.bash_profile, and add a few lines as follow:
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
. /localscratch/hdfs/etc/hdfs-env.sh
export PATH=$PATH:$HADOOP_HOME/bin
fi
Download Package
Download hadoop-2.9.2.tar.gz, and decompress it into /localscratch/hdfs/lib
mkdir -p /localscratch/hdfs/lib cd /localscratch/hdfs/lib wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz tar -xzvf hadoop-2.9.2.tar.gz ln -sf hadoop-2.9.2 hadoop
Update Configure Files
1. Edit $HADOOP_HOME/etc/hadoop/core-site.xml
Here is a example.
fs.defaultFS
hdfs://hdfs-host-placeholder:9000
fs.file.impl
org.apache.hadoop.fs.LocalFileSystem
The FileSystem for file: uris.
fs.hdfs.impl
org.apache.hadoop.hdfs.DistributedFileSystem
The FileSystem for hdfs: uris.
fs.defaultFS defined the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.
Please see here for more variable configurable.
2. Modify $HADOOP_HOME/etc/hadoop/hdfs-site.xml as follow:
Here is a example.
dfs.replication
1
dfs.name.dir
file:///localscratch/hdfs/data/name
dfs.data.dir
file:///localscratch/hdfs/data/data
<!--
ha.zookeeper.quorum
hdfs-host-placeholder:2181
-->
Please see here for more variable configurable.
3. Modify $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add this script on top:
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
. /localscratch/hdfs/etc/hdfs-env.sh
fi
4. Modify $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
Change the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5 This could prevent the jobs never return in small cluster.
5. Edit $HADOOP_HOME/etc/hadoop/slaves
This is a list of members.
hdfs-host-placeholder
The members are written line by line
6. Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml
mapreduce.tasktracker.map.tasks.maximum
32
mapreduce.tasktracker.reduce.tasks.maximum
32
Please see here for more variable configurable.
7. Edit $HADOOP_HOME/etc/hadoop/mapred-queues.xml
default
running
8. Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml
Here is a example
yarn.web-proxy.address
0.0.0.0:19987
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.vmem-check-enabled
false
Whether virtual memory limits will be enforced for containers
yarn.nodemanager.vmem-pmem-ratio
4
Ratio between virtual memory to physical memory when setting memory limits for containers
yarn.scheduler.minimum-allocation-mb
1024
yarn.scheduler.increment-allocation-mb
512
yarn.scheduler.maximum-allocation-mb
8192
yarn.nodemanager.resource.memory-mb
2048
yarn.scheduler.minimum-allocation-vcores
1
yarn.scheduler.maximum-allocation-vcores
32
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
100
yarn.resourcemanager.hostname
hdfs-host-placeholder
9. Initialize & reformat HDFS name/data folders
# create the corresponding folders mkdir -p /localscratch/hdfs/data/name mkdir -p /localscratch/hdfs/data/data # init name node hdfs namenode -format
Create Manage Scripts
1. Start scripts
mkdir -p /localscratch/hdfs/sbin
cat > /localscratch/hdfs/sbin/start-hdfs.sh << EOF
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
. /localscratch/hdfs/etc/hdfs-env.sh
export PATH=$HADOOP_HOME/bin:$PATH
fi
echo "start: hadoop dfs"
$HADOOP_HOME/sbin/start-dfs.sh
echo "start: hadoop yarn"
$HADOOP_HOME/sbin/start-yarn.sh
EOF
chmod +x /localscratch/hdfs/sbin/start-hdfs.sh
2. Stop scripts
mkdir -p /localscratch/hdfs/sbin
cat > /localscratch/hdfs/sbin/stop-hdfs.sh << EOF
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then
. /localscratch/hdfs/etc/hdfs-env.sh
export PATH=$HADOOP_HOME/bin:$PATH
fi
echo "stop: hadoop yarn"
$HADOOP_HOME/sbin/stop-yarn.sh
echo "stop: hadoop dfs"
$HADOOP_HOME/sbin/stop-dfs.sh
EOF
chmod +x /localscratch/hdfs/sbin/stop-hdfs.sh