Info
Build Hadoop cluster in OpenStack with Terraform.
Primary goal of this project is to build Hadoop cluster. But the most part is generic - the Hadoop deployment can be skipped, or replaced by implementing different deployment type (see deployments directory).
Requirements
Locally installed:
Configuration:
- public part of ssh key uploaded to OpenStack
- ssh-agent with ssh key
- configured access to OpenStack (for example downloaded cloud.yaml file or the environment set)
Hadoop image
To setup Hadoop on single machine, launch:
/usr/local/sbin/hadoop-setup.sh
Hadoop image can be used also to build Hadoop cluster. It contains pre-downloaded and pre-installed Hadoop packages and dependencies, so this will speed things up.
Single machine
See above, when Hadoop image is available.
It is possible also to build Hadoop on single machine using terraform + this orchestration scripts (set type=hadoop-single, and n=0).
For example (check also the other values used in variables.tf):
cat <<EOF > mycluster.auto.tfvars
n = 0
type = "hadoop-single"
flavor = "standard.large" # >4GB memory needed
EOF
./launch.sh
Build cluster
#
# 1. check *variables.tf*
#
# It is possible to override default values using *\*.auto.tfvars* files.
#
cat <<EOF > mycluster.auto.tfvars
domain = 'mydomain'
n = 3
security_trusted_cidr = [
"0.0.0.0/0",
"::/0",
]
ssh = 'mykey'
EOF
#
# 2. add ssh key to ssh agent
#
# It must be the ssh key used in the *ssh* parameter in *variables.tf* or *\*.auto.tfvars*.
#
ssh-add
#
# 3. launch the setup script
#
./launch.sh
Destroy cluster
terraform destroy
Usage
Hadoop can be used on the "master" node (the frontend machine). The name can be configured by master_hostname in variables.tf. This machine is configured with the floating public IP address.
Before accessing Hadoop services, it is needed to obtain the Kerberos ticket:
kinit
Password
Look for the generated password of the created user for Hadoop in the output or password.txt file in home directory (/home/debian/password.txt).
It is possible to set the new password on the master server using ('debian' is the user name):
sudo kadmin.local cpw debian
Public IP
The public IP is in the public_hosts file or inventory file.
Advanced usage
Add Hadoop node
On the terraform client machine:
# decrease number of nodes in terraform
vim *.auto.tfvars
# check the output
./terraform plan
# perform the changes
./launch.sh
# refresh configuration
yellowmanager refresh
#(this will call with credentials: 1) hdfs dfsadmin -refreshNodes, 2) yarn rmadmin -refreshNodes)
Remove Hadoop node
Data must be migrated from the removed nodes first in the Hadoop cluster. Theoretically this isn't needed when removing only one node due to replication policy on HDFS. In such case the steps would be the same as adding node.
- update Hadoop cluster
On the master machine:
# add nodes to remove (it must be the nodes with the highest numbers), for example:
echo node3.terra >> /etc/hadoop/conf/excludes
# refresh configuration
yellowmanager refresh
#(this will call with credentials: 1) hdfs dfsadmin -refreshNodes, 2) yarn rmadmin -refreshNodes)
# wait to finalize decommissioning (CLI or check the http://PUBLIC_IP:50070)
sudo -u hdfs kinit -k -t /etc/security/keytab/nn.service.keytab nn/`hostname -f`
sudo -u hdfs hdfs dfsadmin -report
...
- update infrastructure + SW configuration
On the terraform client machine:
# decrease number of nodes in terraform
vim *.auto.tfvars
# check the output
./terraform plan
# perform the changes
./launch.sh
- cleanups
On the master machine:
echo > /etc/hadoop/conf/excludes
sudo -u hdfs hdfs dfsadmin -refreshNodes
Add user
Launch /usr/local/sbin/hadoop-adduser.sh USER_NAME in the whole cluster.
For example using Ansible (replace $USER_NAME by the user name):
ansible -i ./inventory -m command -a "/usr/local/sbin/hadoop-adduser.sh $USER_NAME" all
The generated password is written on the output and stored in the home directory.
Internals
The launch.sh script is doing something like this:
terraform init
terraform apply
terraform output -json > config.json
./orchestrate.py
Terraform builds the infrastructure, orchestrate.py finishes the missing pieces (waiting for machine existence, proper DNS setup, ...), and then deploys and configures the software. The information about the infrastructure from Terraform is used for the orchestration.
The orchestration script has multiple steps and dry-run option. See ./orchestrate.py --help.