'Terraform: wait till the instance is "reachable"

I have some Terraform code with an aws_instance and a null_resource:

resource "aws_instance" "example" {
  ami           = data.aws_ami.server.id
  instance_type = "t2.medium"
  key_name      = aws_key_pair.deployer.key_name

  tags = {
    name = "example"
  }

  vpc_security_group_ids = [aws_security_group.main.id]
}

resource "null_resource" "example" {
  provisioner "local-exec" {
    command = "ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -T 300 -i ${aws_instance.example.public_dns}, --user centos --private-key files/id_rsa playbook.yml"
  }
}

It kinda works, but sometimes there is a bug (probably when the instance in a pending state). When I rerun Terraform - it works as expected.

Question: How can I run local-exec only when the instance is running and accepting an SSH connection?

ansible terraform

Solution 1:^[1]

The null_resource is currently only going to wait until the aws_instance resource has completed which in turn only waits until the AWS API returns that it is in the Running state. There's a long gap from there to the instance starting the OS and then being able to accept SSH connections before your local-exec provisioner can connect.

One way to handle this is to use the remote-exec provisioner on the instance first as that has the ability to wait for the instance to be ready. Changing your existing code to handle this would look like this:

resource "aws_instance" "example" {
  ami           = data.aws_ami.server.id
  instance_type = "t2.medium"
  key_name      = aws_key_pair.deployer.key_name

  tags = {
    name = "example"
  }

  vpc_security_group_ids = [aws_security_group.main.id]


}

resource "null_resource" "example" {
  provisioner "remote-exec" {
    connection {
      host = aws_instance.example.public_dns
      user = "centos"
      file = file("files/id_rsa")
    }

    inline = ["echo 'connected!'"]
  }

  provisioner "local-exec" {
    command = "ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -T 300 -i ${aws_instance.example.public_dns},  --user centos --private-key files/id_rsa playbook.yml"
  }
}

This will first attempt to connect to the instance's public DNS address as the centos user with the files/id_rsa private key. Once it is connected it will then run echo 'connected!' as a simple command before moving on to your existing local-exec provisioner that runs Ansible against the instance.

Note that just being able to connect over SSH may not actually be enough for you to then provision the instance. If your Ansible script tries to interact with your package manager then you may find that it is locked from the instance's user data script running. If this is the case you will need to remotely execute a script that waits for cloud-init to be complete first. An example script looks like this:

#!/bin/bash

while [ ! -f /var/lib/cloud/instance/boot-finished ]; do
  echo -e "\033[1;36mWaiting for cloud-init..."
  sleep 1
done

Solution 2:^[2]

There is an ansible specific solution for this problem. Add this code to you playbook(there is all so pre_task clause if you use roles)

- name: will wait till reachable
  hosts: all
  gather_facts: no # important
  tasks:
    - name: Wait for system to become reachable
      wait_for_connection:

    - name: Gather facts for the first time
      setup:

Solution 3:^[3]

For cases where instances are not externally exposed (About 90% of the time in most of my projects), and SSM agent is installed on the target instance (newer AWS AMIs come pre-loaded with it), you can leverage SSM to probe the instance. Here's some sample code:

instanceId=$1
echo "Waiting for instance to bootstrap ..."
tries=0
responseCode=1
while [[ $responseCode != 0 && $tries -le 10 ]]
do
  echo "Try # $tries"
  cmdId=$(aws ssm send-command --document-name AWS-RunShellScript --instance-ids $instanceId --parameters commands="cat /tmp/job-done.txt # or some other validation logic" --query Command.CommandId --output text)
  sleep 5
  responseCode=$(aws ssm get-command-invocation --command-id $cmdId --instance-id $instanceId --query ResponseCode --output text)
  echo "ResponseCode: $responseCode"
  if [ $responseCode != 0 ]; then
    echo "Sleeping ..."
    sleep 60
  fi
  (( tries++ ))
done
echo "Wait time over. ResponseCode: $responseCode"

Assuming you have AWS CLI installed locally, you can have this null_resource required before you act on the instance. In my case, I was building an AMI.

resource "null_resource" "wait_for_instance" {
  depends_on = [
    aws_instance.my_instance
  ]
  triggers = {
    always_run = "${timestamp()}"
  }
  provisioner "local-exec" {
    command = "${path.module}/scripts/check-instance-state.sh ${aws_instance.my_instance.id}"
  }
}

Solution 4:^[4]

Have a look at depends_on
https://www.terraform.io/docs/configuration/resources.html#depends_on-explicit-resource-dependencies

It shouldn't be used but if you do, always write a comment!! This should solve your problem. If you really want it to run after ssh is running which is the case for ansible, then you could create a health check for the instance and depend on that.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	ydaetskcoR
Solution 2	kharandziuk
Solution 3	peter n
Solution 4	frathert

'Terraform: wait till the instance is "reachable"

Solution 1:[1]

Solution 2:[2]

Solution 3:[3]

Solution 4:[4]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]

Solution 3:^[3]

Solution 4:^[4]