This post will introduce some basic progress & configuration of deploying a HDFS cluster, and some error diagnosis.
Please see this post for a topic collection of distributed computing.
Let's define /localscratch/hdfs as the root of all the pacakages, and we will keep everything in this folder. There are a few variables used a path with this prefix. You may rename them using your path.
Let's assume hdfs-host-placeholder is the domain of our nodes. If we just use it locally, we can replace it as localhost
Environment variables
Firstly, let's create a file named as "hdfs-env.sh"
mkdir -p /localscratch/hdfs/etc cat > /localscratch/hdfs/etc/hdfs-env.sh << EOF # how to get java home? # java -XshowSettings:properties -version 2>&1 > /dev/null | grep 'java.home' export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export HADOOP_HOME=/localscratch/hdfs/lib/hadoop # define absolute path for hadoop export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="${HADOOP_OPTS} -Djava.library.path=$HADOOP_COMMON_LIB_NATIVE_DIR" EOF
We can edit our $HOME/.bash_profile, and add a few lines as follow:
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh export PATH=$PATH:$HADOOP_HOME/bin fi
Download Package
Download hadoop-2.9.2.tar.gz, and decompress it into /localscratch/hdfs/lib
mkdir -p /localscratch/hdfs/lib cd /localscratch/hdfs/lib wget http://mirror.olnevhost.net/pub/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz tar -xzvf hadoop-2.9.2.tar.gz ln -sf hadoop-2.9.2 hadoop
Update Configure Files
1. Edit $HADOOP_HOME/etc/hadoop/core-site.xml
Here is a example.
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hdfs-host-placeholder:9000</value> </property> <!-- workaround for old hadoop--> <property> <name>fs.file.impl</name> <value>org.apache.hadoop.fs.LocalFileSystem</value> <description>The FileSystem for file: uris.</description> </property> <property> <name>fs.hdfs.impl</name> <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> <description>The FileSystem for hdfs: uris.</description> </property> </configuration>
fs.defaultFS defined the name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
Please see here for more variable configurable.
2. Modify $HADOOP_HOME/etc/hadoop/hdfs-site.xml as follow:
Here is a example.
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///localscratch/hdfs/data/name</value> </property> <property> <name>dfs.data.dir</name> <value>file:///localscratch/hdfs/data/data</value> </property> <!-- ZooKeeper --> <!-- Please see https://hadoop.apache.org/docs/r2.9.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Configuring_automatic_failover for further explaination of using ZooKeeper. --> <!-- <property> <name>ha.zookeeper.quorum</name> <value>hdfs-host-placeholder:2181</value> </property> --> </configuration>
Please see here for more variable configurable.
3. Modify $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Add this script on top:
if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh fi
4. Modify $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml
Change the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5 This could prevent the jobs never return in small cluster.
5. Edit $HADOOP_HOME/etc/hadoop/slaves
This is a list of members.
hdfs-host-placeholder
The members are written line by line
6. Edit $HADOOP_HOME/etc/hadoop/mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.tasktracker.map.tasks.maximum</name> <value>32</value> </property> <property> <name>mapreduce.tasktracker.reduce.tasks.maximum</name> <value>32</value> </property> </configuration>
Please see here for more variable configurable.
7. Edit $HADOOP_HOME/etc/hadoop/mapred-queues.xml
<?xml version="1.0"?> <queues> <queue> <name>default</name> <properties> </properties> <state>running</state> <acl-submit-job> </acl-submit-job> <acl-administer-jobs> </acl-administer-jobs> </queue> </queues>
8. Edit $HADOOP_HOME/etc/hadoop/yarn-site.xml
Here is a example
<?xml version="1.0"?> <configuration> <property> <name>yarn.web-proxy.address</name> <value>0.0.0.0:19987</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!-- https://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits --> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> <description>Whether virtual memory limits will be enforced for containers</description> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description> </property> <!-- https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Launcher-ERROR-reason-Main-class-org-apache-oozie-action-hadoop/td-p/27686 --> <!-- https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html --> <!-- as well as https://stackoverflow.com/questions/30828879 --> <!-- --> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.increment-allocation-mb</name> <value>512</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>8192</value> </property> <!-- --> <!-- --> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>2048</value> </property> <!-- --> <property> <name>yarn.scheduler.minimum-allocation-vcores</name> <value>1</value> </property> <property> <name>yarn.scheduler.maximum-allocation-vcores</name> <value>32</value> </property> <!-- In case of the hard disk limitation --> <property> <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name> <value>100</value> </property> <!-- For Cluster --> <!-- ref: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml --> <!-- rest of the configurations: https://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/ --> <property> <name>yarn.resourcemanager.hostname</name> <value>hdfs-host-placeholder</value> </property> </configuration>
9. Initialize & reformat HDFS name/data folders
# create the corresponding folders mkdir -p /localscratch/hdfs/data/name mkdir -p /localscratch/hdfs/data/data # init name node hdfs namenode -format
Create Manage Scripts
1. Start scripts
mkdir -p /localscratch/hdfs/sbin cat > /localscratch/hdfs/sbin/start-hdfs.sh << EOF if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh export PATH=$HADOOP_HOME/bin:$PATH fi echo "start: hadoop dfs" $HADOOP_HOME/sbin/start-dfs.sh echo "start: hadoop yarn" $HADOOP_HOME/sbin/start-yarn.sh EOF chmod +x /localscratch/hdfs/sbin/start-hdfs.sh
2. Stop scripts
mkdir -p /localscratch/hdfs/sbin cat > /localscratch/hdfs/sbin/stop-hdfs.sh << EOF if [ -f /localscratch/hdfs/etc/hdfs-env.sh ]; then . /localscratch/hdfs/etc/hdfs-env.sh export PATH=$HADOOP_HOME/bin:$PATH fi echo "stop: hadoop yarn" $HADOOP_HOME/sbin/stop-yarn.sh echo "stop: hadoop dfs" $HADOOP_HOME/sbin/stop-dfs.sh EOF chmod +x /localscratch/hdfs/sbin/stop-hdfs.sh