하둡 HDFS 코드리뷰를 진행하면서 테스트용 클러스터를 구축할 필요가 있었습니다. 실제 개발 장비에 설치되어 있는 하둡 클러스터를 이용해 볼 수 있었지만 도커에 하둡을 띄워보면 어떨까 생각이 들어서 관련 자료를 찾아보며 정리를 해봤습니다.

Hadoop Docker 이미지 만들기

하둡을 띄울 도커 이미지를 만들어보겠습니다. (Dockerfile이나 docker-compose를 이용한 방법은 추후에 별도의 포스트를 할애해서 다루기로 하겠습니다.)

우선 CentOS 이미지를 이용해 컨테이너를 구동시킵니다.

docker run -it --name hadoop-base centos

yum 패키지를 업데이트하고 필요한 다양한 라이브러리들을 설치합니다. 이 부분에서 시간이 조금 오래걸릴 수 있습니다.

yum update
yum install wget -y
yum install vim -y
yum install openssh-server openssh-clients openssh-askpass -y
yum install java-1.8.0-openjdk-devel.x86_64 -y

컨테이너들끼리 로그인절차 없이 ssh 접속을 할 수 있도록 키 파일을 생성해줍니다.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

'Missing privilege separation directory: /run/sshd' 오류가 발생하지 않도록 관련 디렉토리도 만들어줍니다.

mkdir /var/run/sshd

하둡 홈 디렉토리를 생성하고 하둡 바이너리를 다운로드해서 풀어줍니다.

mkdir /hadoop_home
cd /hadoop_home
wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
tar xvzf hadoop-2.7.7.tar.gz

bashrc를 열어서 환경 변수들을 설정해줍니다.

# vi ~/.bashrc

...
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/hadoop_home/hadoop-2.7.7
export HADOOP_CONFIG_HOME=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

# run sshd
/usr/sbin/sshd

수정한 환경변수들을 적용합니다.

$ source ~/.bashrc

각 데몬들이 홈으로 사용할 경로를 생성합니다.

# mkdir /hadoop_home/temp
# mkdir /hadoop_home/namenode_home
# mkdir /hadoop_home/datanode_home

이제 각 데몬들을 위한 하둡 설정파일들을 세팅해줘야합니다.

mapred-site.xml 파일을 mapred-site.xml.template 파일로부터 생성해줍니다.

# cd $HADOOP_CONFIG_HOME
# cp mapred-site.xml.template mapred-site.xml

'core-site.xml', 'hdfs-site.xml', 'mapred-site.xml' 파일에 설정을 수정해줍니다. 여기서부터 하나의 컨테이너에 모든 데몬을 띄우는 'Pseudo-Distributed' 모드와 여러 컨테이너에 각자 데몬을 띄우는 'Distributed' 모드의 설정이 갈립니다.

Pseudo-Distributed 모드

각 설정 파일들을 다음과 같이 수정해줍니다.

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/hadoop_home/temp</value>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
        <final>true</final>
    </property>
</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/hadoop_home/namenode_home</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/hadoop_home/datanode_home</value>
        <final>true</final>
    </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
</configuration>

hadoop-env.sh 파일의 마지막에 JAVA_HOME 환경 변수를 추가합니다.

...
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

이제 네임노드를 포맷합니다.

# hadoop namenode -format

이제 하둡 Pseudo-Distributed 모드를 위한 도커 이미지가 만들어졌습니다. 이 도커 이미지를 커밋해 놓으면 나중에 다시 재사용할 수 있습니다. 터미널창을 하나 더 열어서 지금까지 만들어 놓은 도커 이미지를 커밋해 놓습니다.

$ docker commit hadoop-base centos:hadoop

이제 하둡 클러스터를 구동해보겠습니다. start-all.sh을 실행하면 하둡 클러스터가 구동됩니다.

# start-all.sh

구동 스크립트를 실행시키면 뭔가를 계속 묻는데, 'yes'를 입력해서 계속 진행합니다.

설치가 끝나고 jps 명령을 이용해서 잘 구동되었는지 확인합니다.

# jps
6896 DataNode
6387 ResourceManager
7285 NodeManager
7063 SecondaryNameNode
7384 Jps
6766 NameNode

다음 명령을 이용해서 HDFS에 디렉토리를 만들어보겠습니다.

# hadoop fs -mkdir -p /user/dave
# hadoop fs -ls /
Found 1 items
drwxr-xr-x - root supergroup 0 2019-11-16 14:16 /user

디렉토리가 만들어졌음을 확인할 수 있습니다. 테스트를 해볼 수 있는 HDFS 클러스터가 'Pseudo-Distributed' 모드로 만들어졌습니다.

간단한 테스트 - WordCount

설치된 하둡의 LICENSE.txt 파일에 WordCount 예제를 돌려보겠습니다.

# cd $HADOOP_HOME
# ls
LICENSE.txt NOTICE.txt README.txt bin etc include lib libexec sbin share

테스트를 위한 테스트 디렉토리를 HDFS에 생성하고, hadoop fs 명령을 이용해서 구축한 클러스터에 LICENSE.txt 파일을 올립니다.

# hadoop fs -mkdir -p /test
# hadoop fs -put LICENSE.txt /test
# hadoop fs -ls /test
Found 1 items
-rw-r--r-- 2 root supergroup 86424 2019-11-16 07:05 /test/LICENSE.txt

하둡 패키지에 기본적으로 제공되는 jar 파일을 이용해서 wordcount 예제를 돌려보겠습니다.

# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /test /test_out

MapReduce 작업이 Submit 되고 짧은 수행시간안에 작업이 끝납니다. 이제 WordCount 결과를 살펴보겠습니다.

# hadoop fs -cat /test_out/*
...
wherever 1
whether 9
which 32
which, 2
which: 1
who 2
whole, 2
whom 8
will 7
with 99
withdraw 2
within 20
without 42
work 11
work, 3
work. 2
works 3
works, 1
world-wide, 4
worldwide, 4
would 1
writing 2
writing, 3
written 11
xmlenc 1
year 1
you 9
your 4
252.227-7014(a)(1)) 1
§ 1
“AS 1
“Contributor 1
“Contributor” 1
“Covered 1
“Executable” 1
“Initial 1
“Larger 1
“Licensable” 1
“License” 1
“Modifications” 1
“Original 1
“Participant”) 1
“Patent 1
“Source 1
“Your”) 1
“You” 2
“commercial 3
“control” 1

이런 결과가 나옵니다.

결과 파일을 보면,

# hadoop fs -ls /test_out
Found 2 items
-rw-r--r-- 2 root supergroup 0 2019-11-15 07:08 /test_out/_SUCCESS
-rw-r--r-- 2 root supergroup 22239 2019-11-15 07:08 /test_out/part-r-00000

리듀서에서 만든 결과 파일이 HDFS 클러스터에 잘 만들어져 있는걸 확인할 수 있습니다.

Distributed 모드

이제 여러 컨테이너에 데몬들을 띄우는 'Distributed' 모드로 이미지를 만들고 클러스터를 띄워보겠습니다. core-site.xml, hdfs-site.xml, mapred-site.xml 파일을 다음과 같이 수정합니다.

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/hadoop_home/temp</value>
    </property>

    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
        <final>true</final>
    </property>
</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/hadoop_home/namenode_home</value>
        <final>true</final>
    </property>

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/hadoop_home/datanode_home</value>
        <final>true</final>
    </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>master:9001</value>
    </property>
</configuration>

hadoop-env.sh 파일의 마지막에 JAVA_HOME 환경 변수를 지정해줍니다.

...

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

이제 네임노드를 포맷합니다.

# hadoop namenode -format

'SHUTDOWN_MSG: Shutting down NameNode at ??? ' 뭐 이런 메시지가 출력되면서 포맷이 성공적으로 종료됩니다. '/hadoop_home/namenode_home' 디렉토리에 가보면 파일들이 생성되어 있다.

이제 하둡 컨테이너가 완성되었습니다.

이 컨테이너를 각 데몬을 위해 구동시키기 위해 도커 이미지로 커밋해놓겠습니다. 터미널을 하나 더 열어 호스트 운영체제에서 도커 컨테이너를 커밋합니다.

$ docker commit hadoop-base centos:hadoop

하둡 클러스터 구동하기

이제 만들어진 도커 이미지를 이용해서 마스터 노드 하나와 슬레이브 노드 두 개를 구동시켜보겠습니다. 터미널 3개를 열어서 다음 명령을 각각 실행해줍니다.

$ docker run -it -h master --name master -p {HostMachinePort for admin page}:50070 centos:hadoop

$ docker run -it -h slave1 --name slave1 --link master:master centos:hadoop

$ docker run -it -h slave2 --name slave2 --link master:master centos:hadoop

여기서 마스터를 구동할 때 {HostMachinePort for admin page} 부분에 네임노드의 admin 웹 페이지로 접속할 때 사용할 호스트 운영체제의 포트 번호를 입력하면 됩니다. 도커의 -p 옵션을 이용하여 입력한 포트를 마스터 노드의 50070 포트로 포트포워딩을 해줍니다.

3개의 컨테이너가 구동중이지만 아직 하둡 데몬들은 구동되지 않았는데요. 하둡 데몬을 구동하기 위한 몇 가지 작업을 해야합니다. 우선 두 개의 슬레이브 노드의 가상 IP 주소를 확인해봅니다. 호스트 운영체제에서 도커의 inspect 명령을 이용해서 가상 IP 정보를 확인해볼 수 있습니다.

$ docker inspect slave1 | grep IPAddress

$ docker inspect slave2 | grep IPAddress

현재 도커에서 구동중인 컨테이너가 마스터와 슬레이브 1, 슬레이브 2 밖에 없는 상황이라 할당된 가상 IP는 다음과 같았습니다.

Master : 172.17.0.2
Slave1 : 172.17.0.3
Slave2 : 172.17.0.4

이 값은 실제 inspect 명령으로 돌려봐야 정확하게 알 수 있으며, 사용자의 도커 컨테이너 구동 상황에 따라 다를 수 있습니다. 직접 확인해 보시기 바랍니다.

이제 마스터 컨테이너로 붙어서 호스트 파일을 수정해줍니다.

$ docker attach master
root@master:~# vi /etc/hosts

/etc/hosts 파일에 슬레이브 노드들의 가상 IP 주소를 추가해 줍니다.

/etc/hosts 파일

...
172.17.0.2 master

172.17.0.3 slave1
172.17.0.4 slave2

$HADOOP_CONFIG_HOME 디렉토리로 이동해서 slaves 파일을 수정해줍니다.

# cd $HADOOP_CONFIG_HOME
# vi slaves

이 파일에는 데이터 노드로 동작할 컨테이너들의 목록이 적힙니다. 다음과 같이 적어줍니다.

slave1
slave2
master

3개의 컨테이너 모두 데이터노드로 동작하도록 구성을 해보겠습니다.

이제 'start-all.sh' 스크립트로 하둡 클러스터에 각 데몬들을 구동시키겠습니다.

# start-all.sh

하둡의 sbin을 PATH 경로로 잡아뒀으므로 그냥 동작할껍니다. 스크립트를 실행하면 뭔가를 자꾸 말해주면서 'yes/no'를 선택하라고 할 텐데 그냥 전부 yes 로 입력하면 됩니다. (stdout, stderr 의 메시지가 한번에 섞이면서 헷갈릴 수 있는데, 그냥 뭔가 화면에 출력되는 내용이 멈췄다 싶으면 'yes'를 계속해서 입력해주면 됩니다. )

jps 명령을 입력해보면,

# jps
704 NodeManager
273 DataNode
440 SecondaryNameNode
138 NameNode
990 Jps
591 ResourceManager

Hadoop 관련된 데몬들이 정상적으로 실행되고 있는게 보입니다.

slave 노드에 붙어서 jps 명령을 실행하면,

# jps
245 Jps
149 NodeManager
39 DataNode

역시 하둡관련 데몬들이 떠있는걸 볼수 있습니다. 하둡 클러스터의 현 상황을 보고 싶으면

# hdfs dfsadmin -report

명령을 실행해보면 됩니다. 'Live datanodes (3): ' 라는 출력에서 3개의 데이터노드가 구동되고 있음을 확인할 수 있습니다. 다음 명령을 이용해서 HDFS에 디렉토리를 만들어보겠습니다.

# hadoop fs -mkdir -p /user/dave
# hadoop fs -ls /

ls 명령을 실행해보면,

# hadoop fs -ls /
Found 1 items
drwxr-xr-x - root supergroup 0 2019-11-16 14:16 /user

디렉토리가 만들어졌음을 볼 수 있습니다.

Admin 페이지

웹 브라우저를 열고 'localhost:{Host Machine Port}' 주소를 입력하면 마스터 컨테이너에서 동작하고 있는 Admin 웹 페이지 데몬으로 접속해서 현재 하둡 클러스터의 상태를 웹 브라우저로 볼 수 있습니다. (Host Machine Port 값은 마스터 컨테이너를 띄울 때 입력했던 포트 번호를 사용하면 됩니다.)

이 페이지의 'Utilities' 탭을 선택해보면 구축한 HDFS 의 파일 시스템 디렉토리 구조를 웹에서 볼 수도 있는데요 매우 유용합니다.

간단한 테스트 해보기

이번에도 'Wordcount' 예제를 실행해보겠습니다. 위에서 구축한 클러스터에 붙기 위해 클라이언트 컨테이너를 하나 더 띄워보겠습니다.

$ docker run -it -h client --name client --link master:master ubuntu:hadoop

설치된 하둡의 LICENSE.txt 파일에 WordCount 예제를 돌려보겠습니다.

# cd $HADOOP_HOME
# ls
LICENSE.txt NOTICE.txt README.txt bin etc include lib libexec sbin share

테스트를 위한 테스트 디렉토리를 HDFS에 생성하고, hadoop fs 명령을 이용해서 구축한 클러스터에 LICENSE.txt 파일을 올려봅니다.

# hadoop fs -mkdir -p /test
# hadoop fs -put LICENSE.txt /test
# hadoop fs -ls /test
Found 1 items
-rw-r--r-- 2 root supergroup 86424 2019-11-16 07:05 /test/LICENSE.txt

하둡 패키지에 기본적으로 제공되는 jar 파일을 이용해서 wordcount 예제를 돌려보겠습니다.

# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.7.jar wordcount /test /test_out

MapReduce 작업이 Submit 되고 짧은 수행시간안에 작업이 끝난다. 이제 WordCount 결과를 살펴보겠습니다.

# hadoop fs -cat /test_out/*
...
wherever 1
whether 9
which 32
which, 2
which: 1
who 2
whole, 2
whom 8
will 7
with 99
withdraw 2
within 20
without 42
work 11
work, 3
work. 2
works 3
works, 1
world-wide, 4
worldwide, 4
would 1
writing 2
writing, 3
written 11
xmlenc 1
year 1
you 9
your 4
252.227-7014(a)(1)) 1
§ 1
“AS 1
“Contributor 1
“Contributor” 1
“Covered 1
“Executable” 1
“Initial 1
“Larger 1
“Licensable” 1
“License” 1
“Modifications” 1
“Original 1
“Participant”) 1
“Patent 1
“Source 1
“Your”) 1
“You” 2
“commercial 3
“control” 1