Detailed tutorial on deploying hadoop clusters using docker

  • 2021-09-05 01:17:25
  • OfStack

Recently, we want to build an hadoop test cluster in our company, so we use docker to quickly deploy hadoop cluster.

0. Write in front

There are already many tutorials on the Internet, but there are many pits in them. Please record the installation process under 1.

Goal: Build a cluster of hadoop version 2.7. 7 with 1 master and 2 slaves and 3 machines using docker

Prepare:

First of all, we should have an centos7 machine with more than 8G memory. I use Alibaba Cloud host.

Next, upload the jdk and hadoop packets to the server.

I installed hadoop 2.7. 7. The package is ready for everyone. Link: https://pan.baidu.com/s/15n_W-1rqOd2cUzhfvbkH4g Extraction code: vmzw.

Step 1: Steps

It is roughly divided into the following steps:

Installing docker Basic environment preparation Configure the network and start the docker container Configure host and ssh secret-free login Install and configure hadoop

1.1 Installing docker

Perform the following steps in turn to install docker. If there is an docker environment, it can be skipped.


yum update

yum install -y yum-utils device-mapper-persistent-data lvm2

yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

yum install -y docker-ce
 
systemctl start docker

docker -v

1.2 Basic Environment Preparation

1.2. 1 Create the basic centos7 image to pull the official centos7 image


docker pull centos

Generate centos image with ssh function through build Dockfile

Create an Dockerfile file


vi Dockerfile

Write the following to Dockerfile


FROM centos
MAINTAINER mwf

RUN yum install -y openssh-server sudo
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config
RUN yum install -y openssh-clients

RUN echo "root:qwe123" | chpasswd
RUN echo "root  ALL=(ALL)    ALL" >> /etc/sudoers
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key

RUN mkdir /var/run/sshd
EXPOSE 22
CMD ["/usr/sbin/sshd", "-D"]

Based on the centos image, set the password to wqe123, install the ssh service and start

Constructing Dockerfile


docker build -t="centos7-ssh" .

Will generate 1 file named centos7-ssh You can use the mirror image of docker images View

1.2. 2 Image generation with hadoop and jdk environments

Place the prepared package in the current directory. hadoop-2.7.7.tar.gz And jdk-8u202-linux-x64.tar.gz Generate centos image with hadoop and jdk environment through build Dockfile

Just now, an Dockerfile has been created, so remove it first. mv Dockerfile Dockerfile.bak

Create an Dockerfile


vi Dockerfile

Write the following:


FROM centos7-ssh
ADD jdk-8u202-linux-x64.tar.gz /usr/local/
RUN mv /usr/local/jdk1.8.0_202 /usr/local/jdk1.8
ENV JAVA_HOME /usr/local/jdk1.8
ENV PATH $JAVA_HOME/bin:$PATH

ADD hadoop-2.7.7.tar.gz /usr/local
RUN mv /usr/local/hadoop-2.7.7 /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH

RUN yum install -y which sudo

The general meaning of the above content is: based on the centos7-ssh generated above, put the hadoop and jdk packages in, and then match the environment variables.

Building Dockerfile


docker build -t="hadoop" .

A mirror named hadoop will be generated

1.3 Configure the network and start the docker container

Because clusters must be able to connect with each other, the network should be configured first.

Create a network


docker network create --driver bridge hadoop-br

The above command creates a file named hadoop-br bridge type network of

Specify the network when starting docker


docker run -itd --network hadoop-br --name hadoop1 -p 50070:50070 -p 8088:8088 hadoop
docker run -itd --network hadoop-br --name hadoop2 hadoop
docker run -itd --network hadoop-br --name hadoop3 hadoop

The above command started 3 machines, and the network was specified as hadoop-br hadoop1 also turns on port mapping.

View the network


docker pull centos
0

Execute the above command to see the corresponding network information:


[
  {
    "Name": "hadoop-br",
    "Id": "88b7839f412a140462b87a353769e8091e92b5451c47b5c6e7b44a1879bc7c9a",
    "Containers": {
"86e52eb15351114d45fdad4462cc2050c05202554849bedb8702822945268631": {
        "Name": "hadoop1",
        "IPv4Address": "172.18.0.2/16",
        "IPv6Address": ""
      },
      "9baa1ff183f557f180da2b7af8366759a0d70834f43d6b60fba2e64f340e0558": {
        "Name": "hadoop2",
        "IPv4Address": "172.18.0.3/16",
        "IPv6Address": ""
      }, "e18a3166e965a81d28b4fe5168d1f0c3df1cb9f7e0cbe0673864779b224c8a7f": {
        "Name": "hadoop3",
        "IPv4Address": "172.18.0.4/16",
        "IPv6Address": ""
      }
    },
  }
]

We can know the ip corresponding to three machines:


172.18.0.2 hadoop1 
172.18.0.3 hadoop2 
172.18.0.4 hadoop3 

Log in to the docker container, and you can communicate with each other by ping.


docker pull centos
3

1.4 Configuring host and ssh Secret Free Login

1.4. 1 Configuring host

Modify host for each machine separately


docker pull centos
4

Write the following (note: ip separated from docker may not be 1 for each person, fill in your own):


docker pull centos
5

1.4. 2 ssh secret-free login

Because the ssh service has been installed in the image above, execute the following commands directly on each machine separately:


docker pull centos
6

1.4. 3 Test for Successful Configuration


ping hadoop1 
ping hadoop2
ping hadoop3
ssh hadoop1
ssh hadoop2
ssh hadoop3

1.5 Installing and Configuring hadoop

1.5. 1 Operating on hadoop1

Enter hadoop1


docker exec -it hadoop1 bash

Create 1 folders, 1 will be used in the configuration


docker pull centos
9

Switch to the directory configured by hadoop


cd $HADOOP_HOME/etc/hadoop/

Edit core-site. xml


<property>
    <name>fs.defaultFS</name>
    <value>hdfs://hadoop1:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/home/hadoop/tmp</value>
  </property>
  <property>
    <name>io.file.buffer.size</name>
    <value>131702</value>
  </property>

Edit hdfs-site. xml


 <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/home/hadoop/hdfs_name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/home/hadoop/hdfs_data</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>hadoop1:9001</value>
  </property>
  <property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
  </property>

Edit mapred-site. xml

mapred-site. xml does not exist by default cp mapred-site.xml.template mapred-site.xml


 <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>hadoop1:10020</value>
  </property>
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>hadoop1:19888</value>
  </property>

Edit yarn-site. xml


 <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop1:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>hadoop1:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>hadoop1:8031</value>
  </property>
  <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>hadoop1:8033</value>
  </property>
  <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>hadoop1:8088</value>
  </property>

Edit slaves

Here I have hadoop1 as the master node and hadoop2, 3 as the slave nodes


hadoop2
hadoop3

Copy files to hadoop2 and hadoop3

Execute the following commands in turn:


scp -r $HADOOP_HOME/ hadoop2:/usr/local/
scp -r $HADOOP_HOME/ hadoop3:/usr/local/

scp -r /home/hadoop hadoop2:/
scp -r /home/hadoop hadoop3:/

1.5. 2 Operating on each machine

Connect each machine separately


docker exec -it hadoop1 bash
docker exec -it hadoop2 bash
docker exec -it hadoop3 bash

Configure environment variables for the hadoop sbin directory

Because the hadoop bin directory was configured when the image was created, but the sbin directory was not, it should be configured separately. Assign configuration for each machine:


vi ~/.bashrc

Add the following:


export PATH=$PATH:$HADOOP_HOME/sbin

Execution:


source ~/.bashrc

1.5. 3 Start hadoop

Execute the following command on hadoop1:

Formatting hdfs


hdfs namenode -format

1 key start


start-all.sh

If you don't make mistakes, you can celebrate 1 time. If something goes wrong, come on.

1.6 Test using hadoopjps


# hadoop1
1748 Jps
490 NameNode
846 ResourceManager
686 SecondaryNameNode

# hadoop2
400 DataNode
721 Jps
509 NodeManager

# hadoop3
425 NodeManager
316 DataNode
591 Jps

Upload a file


hdfs dfs -mkdir /mwf

echo hello > a.txt
hdfs dfs -put a.txt /mwf

hdfs dfs -ls /mwf

Found 1 items
drwxr-xr-x  - root supergroup     0 2020-09-04 11:14 /mwf

Because it is a cloud server, I don't want to match ports, so I don't look at ui interface.

2. Finally

The above is the process summarized after my successful installation. There should be no problems and there may be omissions.

3. Reference

https://cloud.tencent.com/developer/article/1084166

https://cloud.tencent.com/developer/article/1084157?from=10680

https://blog.csdn.net/ifenggege/article/details/108396249


Related articles: