CuSO4_Deposit's Electrolytic Infodump

Big Data System Lab Note

Set up Hadoop environment

Apache Hadoop 3.3.6 – Hadoop: Setting up a Single Node Cluster.

set up java environment

java version requirements:

https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions

% curl -sL "https://builds.openlogic.com/downloadJDK/openlogic-openjdk/8u382-b05/openlogic-openjdk-8u382-b05-linux-x64.tar.gz" -o ./openjdk-8u382.tar.gz
% tar -xvf ./openjdk-8u382.tar.gz
% mv ./openlogic-openjdk-8u382-b05-linux-x64 ./usr/bin/openjdk-8u832
% sudo vim ~/.zshrc

add java to path:

# ~/.zshrc

#...
export JAVA_PATH="/usr/bin/openjdk-8u832/bin"
export PATH=$PATH:$JAVA_PATH

test path:

% source ~/.zshrc
% java -version                                                             1 ↵
openjdk version "1.8.0_382-382"
OpenJDK Runtime Environment (build 1.8.0_382-382-b05)
OpenJDK 64-Bit Server VM (build 25.382-b05, mixed mode)

requisite software

ssh required, pdsh for better performance.

% sudo apt install ssh
% sudo apt install pdsh

download hadoop

(switch to a mirror if download is too slow)

% curl -LO "https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz" -o ./
% tar -xvf ./hadoop-3.3.6.tar.gz
% sudo mv ./hadoop-3.3.6 /usr/local/bin/hadoop-3.3.6

prepare to start the Hadoop Cluster

edit hadoop-3.3.6/etc/hadoop/hadoop-env.sh. Hadoop will try to execute$JAVA_HOME/bin/java.

# etc/hadoop/hadoop-env.sh

# export JAVA_HOME=$JAVA_PATH/.. 
# The above works when test bin/hadoop, but fails in start-dfs.sh. idk why.
# related: https://stackoverflow.com/questions/40831151/hadoop-cannot-start-start-dfs-sh

export JAVA_HOME=/usr/bin/openjdk-8u832

Then try the following command, it will display the doc of hadoop script.

% bin/hadoop

Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
where CLASSNAME is a user-provided Java class

OPTIONS is none or any of:
...

Execute hadoop-env.sh so that $JAVA_HOME is exported.

Pseudo-Distributed Operation

Use the following:

etc/hadoop/core-site.xml:

<configuration>
  <property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:9000</value>
  </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
  <property>
      <name>dfs.replication</name>
      <value>1</value>
  </property>
</configuration>

set up passphraseless ssh

% ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
% chmod 0600 ~/.ssh/authorized_keys

yarn on single note

Format the filesystem

% bin/hdfs namenode -format

try to start NameNode daemon and DataNode daemon, error will be raised:

[hadoop@VM7617-BDS-Lab:/usr/local/bin/hadoop-3.3.6]
% sbin/start-dfs.sh                                                         3 ↵
Starting namenodes on [localhost]
pdsh@VM7617-BDS-Lab: localhost: rcmd: socket: Permission denied
Starting datanodes
pdsh@VM7617-BDS-Lab: localhost: rcmd: socket: Permission denied
Starting secondary namenodes [VM7617-BDS-Lab]
pdsh@VM7617-BDS-Lab: VM7617-BDS-Lab: rcmd: socket: Permission denied
[hadoop@VM7617-BDS-Lab:/usr/local/bin/hadoop-3.3.6]
% sudo sbin/start-dfs.sh
Starting namenodes on [localhost]
ERROR: Attempting to operate on hdfs namenode as root
ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
Starting datanodes
ERROR: Attempting to operate on hdfs datanode as root
ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
Starting secondary namenodes [VM7617-BDS-Lab]
ERROR: Attempting to operate on hdfs secondarynamenode as root
ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.

according to this post on Stackoverflow, add these lines to etc/hadoop/hadoop-env.sh

#
# To prevent accidents, shell commands be (superficially) locked
# to only allow certain users to execute certain subcommands.
# It uses the format of (command)_(subcommand)_USER.
#
# For example, to limit who can execute the namenode command,
export HDFS_NAMENODE_USER=hadoop
export HDFS_DATANODE_USER=hadoop
export HDFS_SECONDARYNAMENODE_USER=hadoop
export YARN_RESOURCEMANAGER_USER=hadoop
export YARN_NODEMANAGER_USER=hadoop

sockets - Permission Denied error while running start-dfs.sh - Stack Overflow - This post explains how pdsh causes the first permission denied problem. Thus we should add an env to zshrc / bashrc / …:

# ~/.zshrc

export PDSH_RCMD_TYPE=ssh

Now it works:

% sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [VM7617-BDS-Lab]
VM7617-BDS-Lab: Warning: Permanently added 'vm7617-bds-lab' (ED25519) to the list of known hosts.

Check the web interface at http://localhost:9870.

Make the HDFS dir required to execute MapRecuce jobs.

% bin/hdfs dfs -mkdir -p /user/hadoop

#Hadoop #java