Setup a Hadoop cluster

$vm_names = ( "master", "slave1", "slave2" )
 
foreach ($name in $vm_names) {
    New-VHD -ParentPath .base-images\ubuntu-18.04.1-jdk-8u191-hadoop-3.1.1.vhdx -Path .\hadoop-$name.vhdx -Differencing
    New-VM -Name "hadoop-$name" -SwitchName "NAT Switch" -MemoryStartupBytes 1GB -VHDPath .\hadoop-$name.vhdx
}
  • 配置网络
hadoop-master    192.168.100.100
hadoop-slave1    192.168.100.101
hadoop-slave2    192.168.100.102
  • 创建hadoop用户
$ sudo useradd -m hadoop -s /bin/bash
$ sudo passwd hadoop
$ sudo adduser hadoop sudo
  • 为每台虚拟机配置静态IP和DNS

以hadoop-slave1为例:

$ vim /etc/netplan/50-cloud-init.yaml
network:
    ethernets:
        eth0:
            addresses: [192.168.100.101/24]
            dhcp4: no
            nameservers:
                addresses: [8.8.8.8, 8.8.4.4]
    version: 2
  • 应用生效
$ sudo netplan apply
  • 配置JAVA路径
$ sudo vim /etc/profile.d/java-path.sh
export JAVA_HOME="/usr/local/share/jdk1.8.0_191"
if [ -d "$JAVA_HOME" ] ; then
    PATH="$JAVA_HOME/bin${PATH:+:${PATH}}"
fi
  • 配置hadoop路径
$ sudo vim ~/.profile
export HADOOP_HOME="$HOME/hadoop-3.1.1"
if [ -d "$HADOOP_HOME" ]; then
    PATH="$HADOOP_HOME/bin:$HADOOP_HOME/sbin${PATH:+:${PATH}}"
fi
  • 分发密钥
$ ssh hadoop@hadoop-master
$ ssh-keygen -b 4096
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop-master
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop-slave1
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@hadoop-slave2
  • 设置JAVA_HOME
尽管在系统变量中有JAVA_HOME,但是当hadoop进程进行远程ssh的时候,该环境变量会丢失,所以必须指定。
$ sudo vim ~/hadoop-3.1.1/etc/hadoop/hadoop-env.sh
export JAVA_HOME="/usr/local/share/jdk1.8.0_191"
  • 配置core-site.xml
$ sudo vim ~/hadoop-3.1.1/etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://hadoop-master:9000</value>
    </property>
</configuration>
  • 配置hdfs-site.xml
$ sudo vim ~/hadoop-3.1.1/etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/hadoop/hadoop-3.1.1/data/nameNode</value>
    </property>
     
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/hadoop-3.1.1/data/dataNode</value>
        </property>
         
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
</configuration>
  • 配置mapred-site.xml
$ sudo vim ~/hadoop-3.1.1/etc/hadoop/mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
  • 配置yarn-site.xml
$ sudo vim ~/hadoop-3.1.1/etc/hadoop/yarn-site.xml
<configuration>
    <property>
        <name>yarn.acl.enable</name>
        <value>0</value>
    </property>
 
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-master</value>
    </property>
 
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>
  • 配置workers
$ sudo vim ~/hadoop-3.1.1/etc/hadoop/workers
hadoop-slave1
hadoop-slave2
  • 分发Hadoop配置文件至slaves
$ vim ~/sync-slaves.sh
#!/bin/bash
 
for node in hadoop-slave1 hadoop-slave2; do
    scp $HADOOP_HOME/etc/hadoop/* $node:$HADOOP_HOME/etc/hadoop/;
done
$ ./sync-slaves.sh
  • 格式化文件系统
$ hdfs namenode -format
  • 启动DFS
$ start-dfs.sh
Starting namenodes on [hadoop-master]
Starting datanodes
Starting secondary namenodes [ubuntu-1804]
  • 检查hadoop-master上的进程
$ jps
55572 Jps
45754 SecondaryNameNode
45455 NameNode
  • 检查hadoop-slave1/2上的进程
$ jps
3344 DataNode
3537 Jps
Configured Capacity:	23.49 GB
Configured Remote Capacity:	0 B
DFS Used:	72 KB (0%)
Non DFS Used:	10.35 GB
DFS Remaining:	11.91 GB (50.69%)
Block Pool Used:	72 KB (0%)
DataNodes usages% (Min/Median/Max/stdDev):	0.00% / 0.00% / 0.00% / 0.00%
Live Nodes	2 (Decommissioned: 0, In Maintenance: 0)
Dead Nodes	0 (Decommissioned: 0, In Maintenance: 0)
Decommissioning Nodes	0
Entering Maintenance Nodes	0
Total Datanode Volume Failures	0 (0 B)
Number of Under-Replicated Blocks	0
Number of Blocks Pending Deletion	0
Block Deletion Start Time	Fri Jan 04 14:24:19 +0800 2019
Last Checkpoint Time	Fri Jan 04 14:25:28 +0800 2019
  • 再进一步,我们将配置HDFSNFS服务

具体的官方配置流程请参考 HDFS NFS Gateway,在这里采用最简化配置方案,仅配置core-site.xml

  • 安装 NFS 和 RPCBIND
$ sudo apt install rpcbind
$ sudo apt install nfs-kernel-server
  • 配置 Hadoop

关于NFS-gateway的代理用户,Apache官方文档有如下解释:

The NFS-gateway uses proxy user to proxy all the users accessing the NFS mounts. In non-secure mode, the user running the gateway is the proxy user, while in secure mode the user in Kerberos keytab is the proxy user. Suppose the proxy user is ‘nfsserver’ and users belonging to the groups ‘users-group1’ and ‘users-group2’ use the NFS mounts, then in core-site.xml of the NameNode, the following two properities must be set and only NameNode needs restart after the configuration change (NOTE: replace the string ‘nfsserver’ with the proxy user name in your cluster):

文档中的 nfsserver 是当前 Hadoop 簇的用户,许多网络帖子直接采用了 root,如果你的系统不是这个用户后续的 mount 将会失败。
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://hadoop-master:9000</value>
    </property>
    
    <property>
        <name>hadoop.proxyuser.hadoop.groups</name>
        <value>*</value>
    </property>

    <property>
        <name>hadoop.proxyuser.hadoop.hosts</name>
        <value>*</value>
    </property>
</configuration>
  • 配置 hdfs-site.xml
<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/home/hadoop/hadoop-3.1.1/data/nameNode</value>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/hadoop/hadoop-3.1.1/data/dataNode</value>
    </property>
    
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>

    <property>
        <name>dfs.blocksize</name>
        <value>1m</value>
    </property>
    
    <property>
        <name>nfs.exports.allowed.hosts</name>
        <value>* rw</value>
    </property>

    <property>
        <name>nfs.superuser</name>
        <value>hadoop</value>
    </property>

    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
</configuration>
上面配置了 dfs.blocksize 为 1mb,默认为 128 mb,对于小尺寸文件可以减少空间占用。默认的 dfs.permissions.enabled 设置为 true,由于权限设置的原因网页用户上传文件会失败,如果希望网页用户上传文件,需要配置用户或者将此项改为 false 将不验证权限。
  • 创建一个如下的脚本文件,将负责启动 nfs-gateway
#!/bin/bash
 
set -xe

stop-all.sh

sudo $HADOOP_HOME/bin/hdfs --daemon stop portmap
$HADOOP_HOME/bin/hdfs --daemon stop nfs3

sudo service nfs-kernel-server stop
sudo service rpcbind stop

start-all.sh

sudo $HADOOP_HOME/bin/hdfs --daemon start portmap
$HADOOP_HOME/bin/hdfs --daemon start nfs3

jps

rpcinfo -p localhost
  • 执行该脚本
$ ./start-nfs-gw.sh

hadoop@ubuntu-1804:~$ ./start-nfs-gw.sh 
++ stop-all.sh
WARNING: Stopping all Apache Hadoop daemons as hadoop in 10 seconds.
WARNING: Use CTRL-C to abort.
Stopping namenodes on [hadoop-master]
Stopping datanodes
Stopping secondary namenodes [ubuntu-1804]
Stopping nodemanagers
Stopping resourcemanager
++ sudo /home/hadoop/hadoop-3.1.1/bin/hdfs --daemon stop portmap
++ /home/hadoop/hadoop-3.1.1/bin/hdfs --daemon stop nfs3
++ sudo service nfs-kernel-server stop
++ sudo service rpcbind stop
++ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as hadoop in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [hadoop-master]
Starting datanodes
Starting secondary namenodes [ubuntu-1804]
Starting resourcemanager
Starting nodemanagers
++ sudo /home/hadoop/hadoop-3.1.1/bin/hdfs --daemon start portmap
++ /home/hadoop/hadoop-3.1.1/bin/hdfs --daemon start nfs3
++ jps
12278 NameNode
12791 ResourceManager
13193 Nfs3
13228 Jps
12574 SecondaryNameNode
++ rpcinfo -p localhost
   program vers proto   port  service
    100005    3   udp   4242  mountd
    100005    1   tcp   4242  mountd
    100000    2   udp    111  portmapper
    100000    2   tcp    111  portmapper
    100005    3   tcp   4242  mountd
    100005    2   tcp   4242  mountd
    100003    3   tcp   2049  nfs
    100005    2   udp   4242  mountd
    100005    1   udp   4242  mountd
  • 尝试挂载 HDFS
$ sudo mkdir -p /mnt/hdfs
$ sudo mount -v -t nfs -o vers=3,proto=tcp,nolock,noacl,sync hadoop-master:/ /mnt/hdfs
mount.nfs: timeout set for Tue Oct 15 08:08:37 2019
mount.nfs: trying text-based options 'vers=3,proto=tcp,nolock,noacl,addr=192.168.100.100'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying 192.168.100.100 prog 100003 vers 3 prot TCP port 2049
mount.nfs: prog 100005, trying vers=3, prot=6
mount.nfs: trying 192.168.100.100 prog 100005 vers 3 prot TCP port 4242
  • 尝试执行文件操作:如拷贝文件、文件夹、删除文件等
/mnt/hdfs$ ll
total 7
drwxr-xr-x 5 hadoop hadoop      160 Oct 16 08:46 ./
drwxr-xr-x 3 root   root       4096 Oct 16 08:12 ../
drwxrwxr-x 6 hadoop hadoop      192 Oct 16 08:34 Case_1/
drwxrwxr-x 7 hadoop hadoop      224 Oct 16 08:40 Case_2/
-rw-r--r-- 1 hadoop hadoop     1366 Jan  4  2019 README.txt