Configuring a timeout for Amazon ECS tasks
I know, I am boring. Some people relax doing Sudoku. I relax writing Step Functions state machines. Not that I am any good, I just enjoy doing it (I am that weird).
As I was searching for my next Sudoku state machine challenge, I bumped into this Amazon ECS roadmap request to introduce support for tasks timeouts. The request is actually fairly legit. You may need to make sure jobs launched via the runTask
API do not go rogue, and you want to be able to configure the infrastructure in a way, no matter what happens, that a given task can't run for more than a certain configurable amount of time.
A tangential use case, not necessarily limited to Amazon ECS tasks, could be considered to avoid bill surprises. I have not investigated this further but imagine, for example, setting up temporary ephemeral environments that get purged automatically after a certain period of time using the technique described below.
Following the theme I started with the Automating stable FQDNs for public Amazon ECS tasks (virtual part 1) blog post, I wanted to write a "patch" for the ECS service leveraging AWS Step Functions and Amazon EventBridge to implement this feature request.
The flow I built is fairly simple: if you run a task with a specific tag (TIMEOUT
), the state machine will Wait the amount of time specified in the value of the tag, and then it will call a stopTask
. If no TIMEOUT
tag has been specified, the state machine will just exit. This is the state machine flow as represented in the Step Functions Workflow Studio canvas:
The state machine is triggered by an EventBridge rule that matches ECS Task State Change
events with "lastStatus": ["RUNNING"]
and "desiredStatus": ["RUNNING"]
.
Using the same process I have used in the Using AWS Application Composer to build a serviceful application (virtual part 2) blog post I have produced the following CloudFormation template:
1Resources:
2 ecstaskrunning:
3 Type: AWS::Events::Rule
4 Properties:
5 EventPattern:
6 source:
7 - aws.ecs
8 detail-type:
9 - ECS Task State Change
10 detail:
11 lastStatus:
12 - RUNNING
13 desiredStatus:
14 - RUNNING
15 Targets:
16 - Id: !GetAtt tasktimeoutstatemachine.Name
17 Arn: !Ref tasktimeoutstatemachine
18 RoleArn: !GetAtt ecstaskrunningTotasktimeoutstatemachine.Arn
19 tasktimeoutstatemachine:
20 Type: AWS::Serverless::StateMachine
21 Properties:
22 Definition:
23 Comment: State machine to create/update a Route53 record
24 StartAt: ListTagsForResource
25 States:
26 ListTagsForResource:
27 Type: Task
28 Next: CheckTimeout
29 Parameters:
30 ResourceArn.$: $.resources[0]
31 ResultPath: $.listTagsForResource
32 Resource: arn:aws:states:::aws-sdk:ecs:listTagsForResource
33 CheckTimeout:
34 Type: Pass
35 Parameters:
36 timeoutexists.$: States.ArrayLength($.listTagsForResource.Tags[?(@.Key == TIMEOUT)])
37 ResultPath: $.timeoutconfiguration
38 Next: IsTimoutSet
39 IsTimoutSet:
40 Type: Choice
41 Choices:
42 - Variable: $.timeoutconfiguration.timeoutexists
43 NumericEquals: 1
44 Next: GetTimeoutValue
45 Default: Success
46 GetTimeoutValue:
47 Type: Pass
48 Parameters:
49 timeoutvalue.$: States.ArrayGetItem($.listTagsForResource.Tags[?(@.Key == TIMEOUT)].Value, 0)
50 ResultPath: $.timeoutconfiguration
51 Next: Wait
52 Success:
53 Type: Succeed
54 Wait:
55 Type: Wait
56 Next: StopTask
57 SecondsPath: $.timeoutconfiguration.timeoutvalue
58 StopTask:
59 Type: Task
60 Parameters:
61 Task.$: $.resources[0]
62 Cluster.$: $.detail.clusterArn
63 Resource: arn:aws:states:::aws-sdk:ecs:stopTask
64 End: true
65 Logging:
66 Level: ALL
67 IncludeExecutionData: true
68 Destinations:
69 - CloudWatchLogsLogGroup:
70 LogGroupArn: !GetAtt tasktimeoutstatemachineLogGroup.Arn
71 Policies:
72 - AWSXrayWriteOnlyAccess
73 - Statement:
74 - Effect: Allow
75 Action:
76 - ecs:ListTagsForResource
77 - ecs:StopTask
78 - logs:CreateLogDelivery
79 - logs:GetLogDelivery
80 - logs:UpdateLogDelivery
81 - logs:DeleteLogDelivery
82 - logs:ListLogDeliveries
83 - logs:PutResourcePolicy
84 - logs:DescribeResourcePolicies
85 - logs:DescribeLogGroups
86 Resource: '*'
87 Tracing:
88 Enabled: true
89 Type: STANDARD
90 tasktimeoutstatemachineLogGroup:
91 Type: AWS::Logs::LogGroup
92 Properties:
93 LogGroupName: !Sub
94 - /aws/vendedlogs/states/${AWS::StackName}-${ResourceId}-Logs
95 - ResourceId: tasktimeoutstatemachine
96 ecstaskrunningTotasktimeoutstatemachine:
97 Type: AWS::IAM::Role
98 Properties:
99 AssumeRolePolicyDocument:
100 Version: '2012-10-17'
101 Statement:
102 Effect: Allow
103 Principal:
104 Service: !Sub events.${AWS::URLSuffix}
105 Action: sts:AssumeRole
106 Condition:
107 ArnLike:
108 aws:SourceArn: !Sub
109 - arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/${AWS::StackName}-${ResourceId}-*
110 - ResourceId: ecstaskrunning
111 Policies:
112 - PolicyName: StartExecutionPolicy
113 PolicyDocument:
114 Version: '2012-10-17'
115 Statement:
116 - Effect: Allow
117 Action: states:StartExecution
118 Resource: !Ref tasktimeoutstatemachine
119Transform: AWS::Serverless-2016-10-31
If you instantiate this template in a CloudFormation stack, you change the behaviour of ECS: whenever you set the tag TIMEOUT
(expressed in seconds) on an ECS task, the AWS infrastructure will stop it after the timeout value has expired. As always, given this is just a few lines of IaC (Infrastructure as Code) along with a short snippet of ASL (Amazon State Language), there is no need to maintain it. There is no "code" and there is no traditional language framework or runtime to update. Fire and forget.
This is not a replacement for a native ECS feature, but I continue to find interesting how AWS services could be "patched" so rapidly and effectively in a way that is much closer to an actual "built-in" capability than it is to traditional "glue code" (to be maintained).
There are a couple of additional considerations worth noting: First, because this template is using standard
Step Functions workflows, the user is not charged for execution time but rather for state transitions (a blessing when you need to wait for a timeout I guess). Second, there is no timeout for the Wait
state itself so the limit for your TIMEOUT
is the limit of the maximum task execution time for the standard
workflow which is... a year (or roughly more than 30.000.000 seconds). Which I assume is enough for most use cases.
Have fun.
Massimo.
Update: some users have reported an issue where the Step Function state machine can't be invoked by the rule. This is due to a limit in the EventBridge rule names length that I have called out in my previous blog and that I am reporting here below for convenience.
Tip: make sure to not use a
--stack-name
that is too long because EventBridge rule names are limited to 64 characters, and it's easy to get too close to 64 characters when Application Composer builds the rule name adding the stack name, logical ID of the rule and more random characters.