Automating stable FQDNs for public Amazon ECS tasks (virtual part 1)

In an effort to dive deeper into event driven architectures, I have lately been experimenting with AWS Step Functions as I have documented in this blog post where I have refactored the application logic of my demo application Yelb into a set of state machines. As I wanted to dive deeper into Amazon EventBridge, I was looking for a proper project to gets my hands dirty.

I decided to take a challenge after looking at this ECS feature request on the public AWS containers roadmap where users are asking for a way to have a reliable public DNS for a single Amazon ECS task without needing to have a load balancer in front of it (for costs reason). It is indeed possible to have an ECS task with a public IP exposed on the Internet but 1) it is not possible to associate an elastic IP to an ECS task making discovery complicated and 2) there is no out of the box workflow in the product that registers such tasks on the fly.

Inspired by this request I wanted to create a prototype that would extend the behaviour of ECS to introduce this capability in a way that was the least intrusive possible and that would have the best MTBM (Mean Time Between Maintenance) possible.

If you want to read more about "code as a liability" and the concept of "MTBM" please read the Background section of the blog I linked above.

There is something that excites me about the opportunity of using this approach to create "service extensions" that introduce new features behaviours into a product with almost zero maintenance. I like to think about this approach as "patching a product" (ECS in this case) to make it do things I need it to do... but that it doesn't do out of the box. Of course this would not be a substitute for AWS engineering to introduce new features but, because the number of feature requests is always going to be higher than the number of features that can possibly be delivered, there is a gap, or an opportunity, that could be filled with this approach.

The prototype user experience

So how does this "patch" work?

Let's go through the end-user experience and what an ECS user would do and see:

  • a user creates an ECS service with 1 task in it. The user adds two tags and configure them to propagate to the task:
    • PUBLICHOSTEDZONE: this is the hosted zone as registered in Route53
    • HOSTEDZONEID: this is the id of said zone
  • when the task moves into the RUNNING state a new record is created in the zone, and it's going to map the public IP of the ECS task to an A record. This record can be resolved via <ECS servicename>.<$PUBLICHOSTEDZONE>
  • if the user manually kills the task the automation removes the record from R53 and when a new task is provisioned by ECS it will add it back (through a new UPSERT) for the new task IP (with the same FQDN).
  • if the user deletes the service the task will be stopped and the automation will just delete the A record in Route53

How about a 3 minutes demo?

Sure! In this short video I show the user experience I have described above. Note that, for simplicity, the ECS service has already been created, and I am only playing with setting the desired task count from 0 to 1 and from 1 to 0 to show what happens.

The prototype implementation

First and foremost, there is no "application code" involved here other than IaC that configures EventBridge and StepFunctions.

This has been implemented through a couple of rules in EventBridge that track when a task is RUNNING and when a task has been requested to be STOPPED.

This rule is triggered when a task has been requested to stop (and the task should be removed from service discovery asap):

2  "source": ["aws.ecs"],
3  "detail-type": ["ECS Task State Change"],
4  "detail": {
5    "lastStatus": ["RUNNING"],
6    "desiredStatus": ["STOPPED"]
7  }

This rule is triggered when a task has transitioned into the RUNNING state (and the task can be added to service discovery):

2  "source": ["aws.ecs"],
3  "detail-type": ["ECS Task State Change"],
4  "detail": {
5    "lastStatus": ["RUNNING"],
6    "desiredStatus": ["RUNNING"]
7  }

Both rules trigger a single Step Functions state machine that does the following:

  • it describes the ENI of the task (this is required to read the Task tags + its public IP address)
  • it lists the Route53 records set (not strictly required for this implementation)
  • it determines what action to set for the R53 API call depending on the event type
  • it runs the ChangeResourceRecordSets API call

This is the layout of the Step Functions workflow as seen in Step Functions Workflow Studio:

This is how an invocation of the Step Functions workflow looks like:

This is the Step Functions workflow as implemented in this prototype:

 2    "Comment": "State machine to create/update a Route53 record",
 3    "StartAt": "DescribeNetworkInterfaces",
 4    "States": {
 5        "ChangeResourceRecordSets": {
 6            "End": true,
 7            "Parameters": {
 8                "ChangeBatch": {
 9                    "Changes": [
10                        {
11                            "Action.$": "$.recordActionOutput.recordAction",
12                            "ResourceRecordSet": {
13                                "Name.$": "States.Format('{}.{}', States.ArrayGetItem($.NetworkInterfaceDescription.NetworkInterfaces[0].TagSet[?(@.Key==aws:ecs:serviceName)].Value, 0),States.ArrayGetItem($.NetworkInterfaceDescription.NetworkInterfaces[0].TagSet[?(@.Key==PUBLICHOSTEDZONE)].Value, 0))",
14                                "ResourceRecords": [
15                                    {
16                                        "Value.$": "$.NetworkInterfaceDescription.NetworkInterfaces[0].Association.PublicIp"
17                                    }
18                                ],
19                                "Ttl": 60,
20                                "Type": "A"
21                            }
22                        }
23                    ]
24                },
25                "HostedZoneId.$": "States.ArrayGetItem($.NetworkInterfaceDescription.NetworkInterfaces[0].TagSet[?(@.Key==HOSTEDZONEID)].Value, 0)"
26            },
27            "Resource": "arn:aws:states:::aws-sdk:route53:changeResourceRecordSets",
28            "Type": "Task"
29        },
30        "DeleteAction": {
31            "Next": "ChangeResourceRecordSets",
32            "Result": {
33                "recordAction": "DELETE"
34            },
35            "ResultPath": "$.recordActionOutput",
36            "Type": "Pass"
37        },
38        "DescribeNetworkInterfaces": {
39            "Next": "ListResourceRecordSets",
40            "Parameters": {
41                "NetworkInterfaceIds.$": "$.detail.attachments[0].details[?(].value"
42            },
43            "Resource": "arn:aws:states:::aws-sdk:ec2:describeNetworkInterfaces",
44            "ResultPath": "$.NetworkInterfaceDescription",
45            "Type": "Task"
46        },
47        "ListResourceRecordSets": {
48            "Next": "RunningOrStopped",
49            "Parameters": {
50                "HostedZoneId.$": "States.ArrayGetItem($.NetworkInterfaceDescription.NetworkInterfaces[0].TagSet[?(@.Key==HOSTEDZONEID)].Value, 0)"
51            },
52            "Resource": "arn:aws:states:::aws-sdk:route53:listResourceRecordSets",
53            "ResultPath": "$.ResourceRecordSetsOutput",
54            "Type": "Task"
55        },
56        "RunningOrStopped": {
57            "Choices": [
58                {
59                    "Next": "UpsertAction",
60                    "StringMatches": "RUNNING",
61                    "Variable": "$.detail.desiredStatus"
62                },
63                {
64                    "Next": "DeleteAction",
65                    "StringMatches": "STOPPED",
66                    "Variable": "$.detail.desiredStatus"
67                }
68            ],
69            "Default": "DeleteAction",
70            "Type": "Choice"
71        },
72        "UpsertAction": {
73            "Next": "ChangeResourceRecordSets",
74            "Result": {
75                "recordAction": "UPSERT"
76            },
77            "ResultPath": "$.recordActionOutput",
78            "Type": "Pass"
79        }
80    }

Fun fact: I have since found out that Ray, a colleague of mine, has created a very similar solution where the update logic is performed in a Lambda function instead of a Step Functions workflow. In the spirit of building something that is close to zero maintenance I still lean on preferring the Step Functions implementation but Lambda is another option!

Fun fact #2: Aaron has since shared he also built something similar with Lambda, and it's available in the CDK Construct Hub. This even works for services with multiple tasks!

Things to keep in mind

Note that this prototype only works with a single task in an ECS service. If you don't want to use an ECS service it would be trivial to apply the tags to the task using an alternative mechanism (e.g., inheriting them via the task definition) and add another tag (e.g., "HOSTNAME") as the A record instead of using the service name as the A record. Also note that the prototype does not have many controls and the Step Functions Amazon State Language (ASL) assumes that tasks have those tags.

With a bit more logic (a lot more probably) one could also think about creating a workflow that adds multiple IPs to the same A record thus creating a sort of "poor man load balancer" solution (which gives me shudders, because I fear what the Internet would say about using a DNS for load balancing, but it would be an interesting academic experiment nonetheless).

Of course being this a DNS based solution all concerns related to client caching etc. apply here. This prototype sets a 60 seconds TTL in the DNS record, but it could be set, or even parametrized with another tag, to something else if need be.

Nice! But these seem to be random configuration pieces... where is the "patch" you promised us?

Fair! I have described the (few) pieces of configuration that I need to implement but how do I tie them together in an "artifact" that I can apply? How do I tie an EventBridge rule to a state machine target? How do I define the role and policy of what the Step Functions state machine can do? How do I define the role and policy of what the EventBridge rules can target?

One option would be to define this solution in CDK as a pattern. But, in the spirit of exploring and learning more about new AWS products, I have decided to use AWS Application Composer to build this "patch". Follow me to (virtual) part 2 of this post to see how easy it is.