Run a Tailscale VPN relay on ECS/Fargate

2022-Feb-22 • by David Norton

This is the first of our client logs, describing a problem encountered by a client, and a solution we helped design and deliver.

Today we'll describe how we used Tailscale and ECS to help our client build an inexpensive, simple VPN solution.

Background

The client is a small company transitioning from startup mode into "scale-up mode". Their infrastructure is on AWS, and they needed to move certain resources, such as databases, into private VPC subnets so they were not internet-accessible. However, developers and administrators at the company need to retain access to the databases from their workstations for troubleshooting and maintenance tasks.

We could easily have set up a bastion or jumphost, but we didn't want the additional management overhead of maintaining and protecting the SSH server, managing users and SSH keys, and so on. Plus, there were some non-technical users that needed access to internal resources. So we looked at VPN options. AWS Client VPN was an obvious option, but the pricing seemed higher than necessary.

Tailscale

Tailscale is a VPN service and client based on the Wireguard protocol. It was recommended both by many of my own peers, and by some developers at the client. We decided to give it a shot, and were pleased with the experience!

In our experience, it's simple and easy-to-use. Workstations can join using a native UI client, and server machines can be added with a CLI client running as a "sidecar" on the same server or Kubernetes pod. You can also connect to entire private networks with a subnet router or relay. Finally, you can send all of your workstation's traffic through a Tailscale exit node.

On the downside, there is no official Terraform provider (there is an unofficial provider), and OIDC authentication is only available under an "Enterprise"-level license. Additionally, as you'll see in this tutorial, the published Docker image either needs to be extended to run configuration commands such as tailscale up, or you need to run the commands outside of the docker run lifecycle.

They offer a free version for personal use, and plans for businesses start at $5/user/month.

The solution: Running Tailscale on ECS

We decided to launch Tailscale on ECS. We will use the Fargate capacity provider since the client doesn't manage any EC2 instances and would like to keep it that way -- but this could easily be run on managed ECS nodes.

We will run Tailscale in relay mode, but this could also be used to run an exit node or to run Tailscale as a sidecar to your other applications. We will run Tailscale using userspace networking so that we don't have to provide low-level permissions to the container.

Step 1: Generate an auth key

Auth keys allow you to login to Tailscale without a UI. I'd recommend a non-reusable (one-off), ephemeral (instances get cleaned up) key.

Step 2: Build Docker image

We can either launch Tailscale using a custom Docker image, or we can use Terraform's null_resource and provisioning feature. If you're OK with the funky smell of a time_sleep and null_resource provisioner, skip this step and see the examples in the associated GitHub repo.

The standard Tailscale docker image can be given the tailscaled command to run in daemon mode, but the first time it launches, it needs to be configured by running tailscale up and passing parameters such as the auth key, subnets, etc. Ideally we could pass these as environment variables, but that is not supported by the off-the-shelf image. Many of their instructions (including for AWS App Runner, which is based on Fargate) suggest building your own bootstrap script.

I built a Docker image similar to theirs, which contains a bootstrap script:

#!/bin/sh

function up() {
    until tailscale up --authkey=${TAILSCALE_AUTHKEY} ${TAILSCALE_UP_ARGS}
    do
        sleep 0.1
    done
}

# send this function into the background
up &

exec tailscaled --tun=userspace-networking --state=$TAILSCALE_STATE_PARAMETER_ARN

And a Dockerfile:

# never use :latest. It will only cause you pain.
FROM tailscale/tailscale:v1.21.26

ADD ./bootstrap.sh /bootstrap.sh
CMD ['/bootstrap.sh']

I decided against publishing this Docker image I we can't commit to the maintenance requirement of keeping the Docker image up to date with security patches and Tailscale's updates. If Tailscale could offer similar functionality with their official Docker image, we'd much appreciate it.

Step 3: Create AWS resources with Terraform

This Terraform configuration will take a few variables:

variable "image_name" {
  type = string
}
variable "container_command" {
  type    = list(string)
  default = null
}
variable "subnets" {
  type = list(string)
}
variable "authkey" {
  type = string
}
variable "tailscale_up_args" {
  type    = string
  default = "--advertise-routes 10.1.0.0/16,10.2.0.0/16"
}

Create a task definition and service:

data "aws_region" "current" {}
data "aws_caller_identity" "current" {}

locals {
  name          = "tailscale-vpn"
}
resource "aws_ecs_cluster" "default" {
  name = local.name
}
resource "aws_ecs_service" "default" {
  name            = local.name
  cluster         = aws_ecs_cluster.default.name
  task_definition = aws_ecs_task_definition.default.arn
  launch_type     = "FARGATE"

  desired_count          = 1
  enable_execute_command = true

  network_configuration {
    assign_public_ip = false

    subnets = var.subnets

    security_groups = [aws_security_group.default.id]
  }

  wait_for_steady_state = true
}

data "aws_subnet" "subnet" {
  id = var.subnets[0]
}

resource "aws_security_group" "default" {
  name   = local.name
  vpc_id = data.aws_subnet.subnet.vpc_id

  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }
}

resource "aws_ecs_task_definition" "default" {
  family = local.name
  container_definitions = jsonencode([
    {
      name      = "tailscale"
      image     = var.image_name
      essential = true
      linuxParameters = {
        initProcessEnabled = true
      }
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.default.name,
          "awslogs-region"        = data.aws_region.current.name,
          "awslogs-stream-prefix" = local.name
        }
      }
      environment = var.container_environment
      command = var.container_command
    }
  ])
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"

  cpu    = 256
  memory = 512

  execution_role_arn = aws_iam_role.execution.arn
  task_role_arn      = aws_iam_role.task.arn
}

resource "aws_cloudwatch_log_group" "default" {
  name              = "ecs/${local.name}"
  retention_in_days = 30
}

I've configured Tailscale to store its state in an SSM parameter. Thus, there is necessary IAM configuration:

data "aws_iam_policy_document" "ecs_tasks_service" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]

    principals {
      type = "Service"

      identifiers = [
        "ecs-tasks.amazonaws.com",
      ]
    }
  }
}

resource "aws_iam_role" "execution" {
  name               = "${local.name}-execution"
  assume_role_policy = data.aws_iam_policy_document.ecs_tasks_service.json
}

resource "aws_iam_role_policy_attachment" "execution" {
  role       = aws_iam_role.execution.id
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role" "task" {
  name               = "${local.name}-task"
  assume_role_policy = data.aws_iam_policy_document.ecs_tasks_service.json
}

resource "aws_iam_role_policy" "task" {
  policy = <<POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameter",
        "ssm:PutParameter"
      ],
      "Resource": [
        "arn:aws:ssm:${data.aws_region.current.name}:${data.aws_caller_identity.current.account_id}:parameter/tailscale-vpn-state"
      ]
    }
  ]
}
POLICY

  role = aws_iam_role.task.id
}

resource "aws_iam_role_policy_attachment" "ssm" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
  role       = aws_iam_role.task.name
}

Add this module to your Terraform project, supply variables such as the following, run terraform plan, and apply.

module "tailscale" {
  source = "../.."

  subnets = data.aws_subnets.subnets.ids

  image_name = local.image

  container_environment = [
    { name = "TAILSCALE_UP_ARGS", value = local.up_args },
    { name = "TAILSCALE_STATE_PARAMETER_ARN", value = local.state_parameter_arn },
    { name = "TAILSCALE_AUTHKEY", value = local.authkey },
  ]
}

Step 4: Approve in Tailscale

We need to approve the subnets or exit nodes, however you configured in the arguments, in the Tailscale Admin UI. This is a one-time thing:

Step 5: Enjoy!

If you configured a subnet router, try to navigate to a private IP in that subnet. If you configured an exit node, select the Exit Node from the Tailscale menu, and check your IP address -- it should be an AWS IP!

If you tear this down, you may want to delete the machine state from the configured SSM parameter:

$ aws ssm delete-parameters --names "tailscale-vpn-state" 

Conclusion

The source code referenced here is all available the GitHub repository.

I hope this was a helpful tutorial. If anything is unclear, please reach out and I will do my best to answer any questions you may have!