This post will introduce some basic progress & configuration of deploying a HDFS cluster, and some error diagnosis.
Please see this post for a topic collection of distributed computing.
Let’s define /localscratch/hdfs as the root of all the pacakages, and we will keep everything in this folder. There are a few variables used a path with this prefix. You may rename them using your path.
Let’s assume hdfs-host-placeholder is the domain of our nodes. If we just use it locally, we can replace it as localhost
Environment variables
Firstly, let’s create a file named as “hdfs-env.sh”
mkdir -p /localscratch/hdfs/etc cat > /localscratch/hdfs/etc/hdfs-env.sh << EOF # how to get java home? # java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/localscratch/hdfs/lib/hadoop # define absolute path for hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="${HADOOP_OPTS} -Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR" EOF
We can edit our $HOME/.bash_profile, and add a few lines as follow:
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh export PATH=$PATH:$HADOOP_HOME/bin fi
Download Package
Download hadoop-2.9.2.tar.gz, and decompress it into /localscratch/hdfs/lib
mkdir -p /localscratch/hdfs/lib cd /localscratch/hdfs/lib wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz tar -xzvf hadoop-2.9.2.tar.gz ln -sf hadoop-2.9.2 hadoop
Update Configure Files
1. Edit $HADOOP_HOME/etc/hadoop/core-site.xml
Here is a example.
fs.defaultFS hdfs://hdfs-host-placeholder:9000 fs.file.impl org.apache.hadoop.fs.LocalFileSystem The FileSystem for file: uris. fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem The FileSystem for hdfs: uris.
fs.defaultFS defined the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.
Please see here for more variable configurable.
2. Modify $HADOOP_HOME/etc/hadoop/hdfs-site.xml as follow:
Here is a example.
dfs.replication 1 dfs.name.dir file:///localscratch/hdfs/data/name dfs.data.dir file:///localscratch/hdfs/data/data
Please see here for more variable configurable.
3. Modify $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add this script on top:
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh fi
4. Modify $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
Change the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5 This could prevent the jobs never return in small cluster.
5. Edit $HADOOP_HOME/etc/hadoop/slaves
This is a list of members.
hdfs-host-placeholder
The members are written line by line
6. Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml
mapreduce.tasktracker.map.tasks.maximum 32 mapreduce.tasktracker.reduce.tasks.maximum 32
Please see here for more variable configurable.
7. Edit $HADOOP_HOME/etc/hadoop/mapred-queues.xml
default running
8. Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml
Here is a example
yarn.web-proxy.address 0.0.0.0:19987 yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.vmem-check-enabled false Whether virtual memory limits will be enforced for containers yarn.nodemanager.vmem-pmem-ratio 4 Ratio between virtual memory to physical memory when setting memory limits for containers yarn.scheduler.minimum-allocation-mb 1024 yarn.scheduler.increment-allocation-mb 512 yarn.scheduler.maximum-allocation-mb 8192 yarn.nodemanager.resource.memory-mb 2048 yarn.scheduler.minimum-allocation-vcores 1 yarn.scheduler.maximum-allocation-vcores 32 yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 100 yarn.resourcemanager.hostname hdfs-host-placeholder
9. Initialize & reformat HDFS name/data folders
# create the corresponding folders mkdir -p /localscratch/hdfs/data/name mkdir -p /localscratch/hdfs/data/data # init name node hdfs namenode -format
Create Manage Scripts
1. Start scripts
mkdir -p /localscratch/hdfs/sbin cat > /localscratch/hdfs/sbin/start-hdfs.sh << EOF if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh export PATH=$HADOOP_HOME/bin:$PATH fi echo "start: hadoop dfs" $HADOOP_HOME/sbin/start-dfs.sh echo "start: hadoop yarn" $HADOOP_HOME/sbin/start-yarn.sh EOF chmod +x /localscratch/hdfs/sbin/start-hdfs.sh
2. Stop scripts
mkdir -p /localscratch/hdfs/sbin cat > /localscratch/hdfs/sbin/stop-hdfs.sh << EOF if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh export PATH=$HADOOP_HOME/bin:$PATH fi echo "stop: hadoop yarn" $HADOOP_HOME/sbin/stop-yarn.sh echo "stop: hadoop dfs" $HADOOP_HOME/sbin/stop-dfs.sh EOF chmod +x /localscratch/hdfs/sbin/stop-hdfs.sh