'Terraform: wait till the instance is "reachable"
I have some Terraform code with an aws_instance
and a null_resource
:
resource "aws_instance" "example" {
ami = data.aws_ami.server.id
instance_type = "t2.medium"
key_name = aws_key_pair.deployer.key_name
tags = {
name = "example"
}
vpc_security_group_ids = [aws_security_group.main.id]
}
resource "null_resource" "example" {
provisioner "local-exec" {
command = "ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -T 300 -i ${aws_instance.example.public_dns}, --user centos --private-key files/id_rsa playbook.yml"
}
}
It kinda works, but sometimes there is a bug (probably when the instance in a pending state). When I rerun Terraform - it works as expected.
Question: How can I run local-exec only when the instance is running and accepting an SSH connection?
Solution 1:[1]
The null_resource
is currently only going to wait until the aws_instance
resource has completed which in turn only waits until the AWS API returns that it is in the Running
state. There's a long gap from there to the instance starting the OS and then being able to accept SSH connections before your local-exec
provisioner can connect.
One way to handle this is to use the remote-exec
provisioner on the instance first as that has the ability to wait for the instance to be ready. Changing your existing code to handle this would look like this:
resource "aws_instance" "example" {
ami = data.aws_ami.server.id
instance_type = "t2.medium"
key_name = aws_key_pair.deployer.key_name
tags = {
name = "example"
}
vpc_security_group_ids = [aws_security_group.main.id]
}
resource "null_resource" "example" {
provisioner "remote-exec" {
connection {
host = aws_instance.example.public_dns
user = "centos"
file = file("files/id_rsa")
}
inline = ["echo 'connected!'"]
}
provisioner "local-exec" {
command = "ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -T 300 -i ${aws_instance.example.public_dns}, --user centos --private-key files/id_rsa playbook.yml"
}
}
This will first attempt to connect to the instance's public DNS address as the centos
user with the files/id_rsa
private key. Once it is connected it will then run echo 'connected!'
as a simple command before moving on to your existing local-exec
provisioner that runs Ansible against the instance.
Note that just being able to connect over SSH may not actually be enough for you to then provision the instance. If your Ansible script tries to interact with your package manager then you may find that it is locked from the instance's user data script running. If this is the case you will need to remotely execute a script that waits for cloud-init
to be complete first. An example script looks like this:
#!/bin/bash
while [ ! -f /var/lib/cloud/instance/boot-finished ]; do
echo -e "\033[1;36mWaiting for cloud-init..."
sleep 1
done
Solution 2:[2]
There is an ansible specific solution for this problem. Add this code to you playbook(there is all so pre_task clause if you use roles)
- name: will wait till reachable
hosts: all
gather_facts: no # important
tasks:
- name: Wait for system to become reachable
wait_for_connection:
- name: Gather facts for the first time
setup:
Solution 3:[3]
For cases where instances are not externally exposed (About 90% of the time in most of my projects), and SSM agent is installed on the target instance (newer AWS AMIs come pre-loaded with it), you can leverage SSM to probe the instance. Here's some sample code:
instanceId=$1
echo "Waiting for instance to bootstrap ..."
tries=0
responseCode=1
while [[ $responseCode != 0 && $tries -le 10 ]]
do
echo "Try # $tries"
cmdId=$(aws ssm send-command --document-name AWS-RunShellScript --instance-ids $instanceId --parameters commands="cat /tmp/job-done.txt # or some other validation logic" --query Command.CommandId --output text)
sleep 5
responseCode=$(aws ssm get-command-invocation --command-id $cmdId --instance-id $instanceId --query ResponseCode --output text)
echo "ResponseCode: $responseCode"
if [ $responseCode != 0 ]; then
echo "Sleeping ..."
sleep 60
fi
(( tries++ ))
done
echo "Wait time over. ResponseCode: $responseCode"
Assuming you have AWS CLI installed locally, you can have this null_resource required before you act on the instance. In my case, I was building an AMI.
resource "null_resource" "wait_for_instance" {
depends_on = [
aws_instance.my_instance
]
triggers = {
always_run = "${timestamp()}"
}
provisioner "local-exec" {
command = "${path.module}/scripts/check-instance-state.sh ${aws_instance.my_instance.id}"
}
}
Solution 4:[4]
Have a look at depends_on
https://www.terraform.io/docs/configuration/resources.html#depends_on-explicit-resource-dependencies
It shouldn't be used but if you do, always write a comment!! This should solve your problem. If you really want it to run after ssh is running which is the case for ansible, then you could create a health check for the instance and depend on that.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | ydaetskcoR |
Solution 2 | kharandziuk |
Solution 3 | peter n |
Solution 4 | frathert |