Detailed tutorial on deploying hadoop clusters using docker
- 2021-09-05 01:17:25
- OfStack
Recently, we want to build an hadoop test cluster in our company, so we use docker to quickly deploy hadoop cluster.
0. Write in front
There are already many tutorials on the Internet, but there are many pits in them. Please record the installation process under 1.
Goal: Build a cluster of hadoop version 2.7. 7 with 1 master and 2 slaves and 3 machines using docker
Prepare:
First of all, we should have an centos7 machine with more than 8G memory. I use Alibaba Cloud host.
Next, upload the jdk and hadoop packets to the server.
I installed hadoop 2.7. 7. The package is ready for everyone. Link: https://pan.baidu.com/s/15n_W-1rqOd2cUzhfvbkH4g Extraction code: vmzw.
Step 1: Steps
It is roughly divided into the following steps:
Installing docker Basic environment preparation Configure the network and start the docker container Configure host and ssh secret-free login Install and configure hadoop1.1 Installing docker
Perform the following steps in turn to install docker. If there is an docker environment, it can be skipped.
yum update
yum install -y yum-utils device-mapper-persistent-data lvm2
yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo
yum install -y docker-ce
systemctl start docker
docker -v
1.2 Basic Environment Preparation
1.2. 1 Create the basic centos7 image to pull the official centos7 image
docker pull centos
Generate centos image with ssh function through build Dockfile
Create an Dockerfile file
vi Dockerfile
Write the following to Dockerfile
FROM centos
MAINTAINER mwf
RUN yum install -y openssh-server sudo
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config
RUN yum install -y openssh-clients
RUN echo "root:qwe123" | chpasswd
RUN echo "root ALL=(ALL) ALL" >> /etc/sudoers
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN mkdir /var/run/sshd
EXPOSE 22
CMD ["/usr/sbin/sshd", "-D"]
Based on the centos image, set the password to wqe123, install the ssh service and start
Constructing Dockerfile
docker build -t="centos7-ssh" .
Will generate 1 file named
centos7-ssh
You can use the mirror image of
docker images
View
1.2. 2 Image generation with hadoop and jdk environments
Place the prepared package in the current directory.hadoop-2.7.7.tar.gz
And
jdk-8u202-linux-x64.tar.gz
Generate centos image with hadoop and jdk environment through build Dockfile
Just now, an Dockerfile has been created, so remove it first.
mv Dockerfile Dockerfile.bak
Create an Dockerfile
vi Dockerfile
Write the following:
FROM centos7-ssh
ADD jdk-8u202-linux-x64.tar.gz /usr/local/
RUN mv /usr/local/jdk1.8.0_202 /usr/local/jdk1.8
ENV JAVA_HOME /usr/local/jdk1.8
ENV PATH $JAVA_HOME/bin:$PATH
ADD hadoop-2.7.7.tar.gz /usr/local
RUN mv /usr/local/hadoop-2.7.7 /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH
RUN yum install -y which sudo
The general meaning of the above content is: based on the centos7-ssh generated above, put the hadoop and jdk packages in, and then match the environment variables.
Building Dockerfile
docker build -t="hadoop" .
A mirror named hadoop will be generated
1.3 Configure the network and start the docker container
Because clusters must be able to connect with each other, the network should be configured first.
Create a network
docker network create --driver bridge hadoop-br
The above command creates a file named
hadoop-br
bridge type network of
Specify the network when starting docker
docker run -itd --network hadoop-br --name hadoop1 -p 50070:50070 -p 8088:8088 hadoop
docker run -itd --network hadoop-br --name hadoop2 hadoop
docker run -itd --network hadoop-br --name hadoop3 hadoop
The above command started 3 machines, and the network was specified as
hadoop-br
hadoop1 also turns on port mapping.
View the network
docker pull centos
0
Execute the above command to see the corresponding network information:
[
{
"Name": "hadoop-br",
"Id": "88b7839f412a140462b87a353769e8091e92b5451c47b5c6e7b44a1879bc7c9a",
"Containers": {
"86e52eb15351114d45fdad4462cc2050c05202554849bedb8702822945268631": {
"Name": "hadoop1",
"IPv4Address": "172.18.0.2/16",
"IPv6Address": ""
},
"9baa1ff183f557f180da2b7af8366759a0d70834f43d6b60fba2e64f340e0558": {
"Name": "hadoop2",
"IPv4Address": "172.18.0.3/16",
"IPv6Address": ""
}, "e18a3166e965a81d28b4fe5168d1f0c3df1cb9f7e0cbe0673864779b224c8a7f": {
"Name": "hadoop3",
"IPv4Address": "172.18.0.4/16",
"IPv6Address": ""
}
},
}
]
We can know the ip corresponding to three machines:
172.18.0.2 hadoop1
172.18.0.3 hadoop2
172.18.0.4 hadoop3
Log in to the docker container, and you can communicate with each other by ping.
docker pull centos
3
1.4 Configuring host and ssh Secret Free Login
1.4. 1 Configuring host
Modify host for each machine separately
docker pull centos
4
Write the following (note: ip separated from docker may not be 1 for each person, fill in your own):
docker pull centos
5
1.4. 2 ssh secret-free login
Because the ssh service has been installed in the image above, execute the following commands directly on each machine separately:
docker pull centos
6
1.4. 3 Test for Successful Configuration
ping hadoop1
ping hadoop2
ping hadoop3
ssh hadoop1
ssh hadoop2
ssh hadoop3
1.5 Installing and Configuring hadoop
1.5. 1 Operating on hadoop1
Enter hadoop1
docker exec -it hadoop1 bash
Create 1 folders, 1 will be used in the configuration
docker pull centos
9
Switch to the directory configured by hadoop
cd $HADOOP_HOME/etc/hadoop/
Edit core-site. xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop1:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hadoop/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131702</value>
</property>
Edit hdfs-site. xml
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hdfs_name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hdfs_data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop1:9001</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
Edit mapred-site. xml
mapred-site. xml does not exist by default
cp mapred-site.xml.template mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop1:19888</value>
</property>
Edit yarn-site. xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop1:8088</value>
</property>
Edit slaves
Here I have hadoop1 as the master node and hadoop2, 3 as the slave nodes
hadoop2
hadoop3
Copy files to hadoop2 and hadoop3
Execute the following commands in turn:
scp -r $HADOOP_HOME/ hadoop2:/usr/local/
scp -r $HADOOP_HOME/ hadoop3:/usr/local/
scp -r /home/hadoop hadoop2:/
scp -r /home/hadoop hadoop3:/
1.5. 2 Operating on each machine
Connect each machine separately
docker exec -it hadoop1 bash
docker exec -it hadoop2 bash
docker exec -it hadoop3 bash
Configure environment variables for the hadoop sbin directory
Because the hadoop bin directory was configured when the image was created, but the sbin directory was not, it should be configured separately. Assign configuration for each machine:
vi ~/.bashrc
Add the following:
export PATH=$PATH:$HADOOP_HOME/sbin
Execution:
source ~/.bashrc
1.5. 3 Start hadoop
Execute the following command on hadoop1:
Formatting hdfs
hdfs namenode -format
1 key start
start-all.sh
If you don't make mistakes, you can celebrate 1 time. If something goes wrong, come on.
1.6 Test using hadoopjps
# hadoop1
1748 Jps
490 NameNode
846 ResourceManager
686 SecondaryNameNode
# hadoop2
400 DataNode
721 Jps
509 NodeManager
# hadoop3
425 NodeManager
316 DataNode
591 Jps
Upload a file
hdfs dfs -mkdir /mwf
echo hello > a.txt
hdfs dfs -put a.txt /mwf
hdfs dfs -ls /mwf
Found 1 items
drwxr-xr-x - root supergroup 0 2020-09-04 11:14 /mwf
Because it is a cloud server, I don't want to match ports, so I don't look at ui interface.
2. Finally
The above is the process summarized after my successful installation. There should be no problems and there may be omissions.
3. Reference
https://cloud.tencent.com/developer/article/1084166
https://cloud.tencent.com/developer/article/1084157?from=10680
https://blog.csdn.net/ifenggege/article/details/108396249