Configuring a timeout for Amazon ECS tasks

I know, I am boring. Some people relax doing Sudoku. I relax writing Step Functions state machines. Not that I am any good, I just enjoy doing it (I am that weird).

As I was searching for my next Sudoku state machine challenge, I bumped into this Amazon ECS roadmap request to introduce support for tasks timeouts. The request is actually fairly legit. You may need to make sure jobs launched via the runTask API do not go rogue, and you want to be able to configure the infrastructure in a way, no matter what happens, that a given task can't run for more than a certain configurable amount of time.

A tangential use case, not necessarily limited to Amazon ECS tasks, could be considered to avoid bill surprises. I have not investigated this further but imagine, for example, setting up temporary ephemeral environments that get purged automatically after a certain period of time using the technique described below.

Following the theme I started with the Automating stable FQDNs for public Amazon ECS tasks (virtual part 1) blog post, I wanted to write a "patch" for the ECS service leveraging AWS Step Functions and Amazon EventBridge to implement this feature request.

The flow I built is fairly simple: if you run a task with a specific tag (TIMEOUT), the state machine will Wait the amount of time specified in the value of the tag, and then it will call a stopTask. If no TIMEOUT tag has been specified, the state machine will just exit. This is the state machine flow as represented in the Step Functions Workflow Studio canvas:

The state machine is triggered by an EventBridge rule that matches ECS Task State Change events with "lastStatus": ["RUNNING"] and "desiredStatus": ["RUNNING"].

Using the same process I have used in the Using AWS Application Composer to build a serviceful application (virtual part 2) blog post I have produced the following CloudFormation template:

  1Resources:
  2  ecstaskrunning:
  3    Type: AWS::Events::Rule
  4    Properties:
  5      EventPattern:
  6        source:
  7          - aws.ecs
  8        detail-type:
  9          - ECS Task State Change
 10        detail:
 11          lastStatus:
 12            - RUNNING
 13          desiredStatus:
 14            - RUNNING
 15      Targets:
 16        - Id: !GetAtt tasktimeoutstatemachine.Name
 17          Arn: !Ref tasktimeoutstatemachine
 18          RoleArn: !GetAtt ecstaskrunningTotasktimeoutstatemachine.Arn
 19  tasktimeoutstatemachine:
 20    Type: AWS::Serverless::StateMachine
 21    Properties:
 22      Definition:
 23        Comment: State machine to create/update a Route53 record
 24        StartAt: ListTagsForResource
 25        States:
 26          ListTagsForResource:
 27            Type: Task
 28            Next: CheckTimeout
 29            Parameters:
 30              ResourceArn.$: $.resources[0]
 31            ResultPath: $.listTagsForResource
 32            Resource: arn:aws:states:::aws-sdk:ecs:listTagsForResource
 33          CheckTimeout:
 34            Type: Pass
 35            Parameters:
 36              timeoutexists.$: States.ArrayLength($.listTagsForResource.Tags[?(@.Key == TIMEOUT)])
 37            ResultPath: $.timeoutconfiguration
 38            Next: IsTimoutSet
 39          IsTimoutSet:
 40            Type: Choice
 41            Choices:
 42              - Variable: $.timeoutconfiguration.timeoutexists
 43                NumericEquals: 1
 44                Next: GetTimeoutValue
 45            Default: Success
 46          GetTimeoutValue:
 47            Type: Pass
 48            Parameters:
 49              timeoutvalue.$: States.ArrayGetItem($.listTagsForResource.Tags[?(@.Key == TIMEOUT)].Value, 0)
 50            ResultPath: $.timeoutconfiguration
 51            Next: Wait
 52          Success:
 53            Type: Succeed
 54          Wait:
 55            Type: Wait
 56            Next: StopTask
 57            SecondsPath: $.timeoutconfiguration.timeoutvalue
 58          StopTask:
 59            Type: Task
 60            Parameters:
 61              Task.$: $.resources[0]
 62              Cluster.$: $.detail.clusterArn
 63            Resource: arn:aws:states:::aws-sdk:ecs:stopTask
 64            End: true
 65      Logging:
 66        Level: ALL
 67        IncludeExecutionData: true
 68        Destinations:
 69          - CloudWatchLogsLogGroup:
 70              LogGroupArn: !GetAtt tasktimeoutstatemachineLogGroup.Arn
 71      Policies:
 72        - AWSXrayWriteOnlyAccess
 73        - Statement:
 74            - Effect: Allow
 75              Action:
 76                - ecs:ListTagsForResource
 77                - ecs:StopTask
 78                - logs:CreateLogDelivery
 79                - logs:GetLogDelivery
 80                - logs:UpdateLogDelivery
 81                - logs:DeleteLogDelivery
 82                - logs:ListLogDeliveries
 83                - logs:PutResourcePolicy
 84                - logs:DescribeResourcePolicies
 85                - logs:DescribeLogGroups
 86              Resource: '*'
 87      Tracing:
 88        Enabled: true
 89      Type: STANDARD
 90  tasktimeoutstatemachineLogGroup:
 91    Type: AWS::Logs::LogGroup
 92    Properties:
 93      LogGroupName: !Sub
 94        - /aws/vendedlogs/states/${AWS::StackName}-${ResourceId}-Logs
 95        - ResourceId: tasktimeoutstatemachine
 96  ecstaskrunningTotasktimeoutstatemachine:
 97    Type: AWS::IAM::Role
 98    Properties:
 99      AssumeRolePolicyDocument:
100        Version: '2012-10-17'
101        Statement:
102          Effect: Allow
103          Principal:
104            Service: !Sub events.${AWS::URLSuffix}
105          Action: sts:AssumeRole
106          Condition:
107            ArnLike:
108              aws:SourceArn: !Sub
109                - arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/${AWS::StackName}-${ResourceId}-*
110                - ResourceId: ecstaskrunning
111      Policies:
112        - PolicyName: StartExecutionPolicy
113          PolicyDocument:
114            Version: '2012-10-17'
115            Statement:
116              - Effect: Allow
117                Action: states:StartExecution
118                Resource: !Ref tasktimeoutstatemachine
119Transform: AWS::Serverless-2016-10-31

If you instantiate this template in a CloudFormation stack, you change the behaviour of ECS: whenever you set the tag TIMEOUT (expressed in seconds) on an ECS task, the AWS infrastructure will stop it after the timeout value has expired. As always, given this is just a few lines of IaC (Infrastructure as Code) along with a short snippet of ASL (Amazon State Language), there is no need to maintain it. There is no "code" and there is no traditional language framework or runtime to update. Fire and forget.

This is not a replacement for a native ECS feature, but I continue to find interesting how AWS services could be "patched" so rapidly and effectively in a way that is much closer to an actual "built-in" capability than it is to traditional "glue code" (to be maintained).

There are a couple of additional considerations worth noting: First, because this template is using standard Step Functions workflows, the user is not charged for execution time but rather for state transitions (a blessing when you need to wait for a timeout I guess). Second, there is no timeout for the Wait state itself so the limit for your TIMEOUT is the limit of the maximum task execution time for the standard workflow which is... a year (or roughly more than 30.000.000 seconds). Which I assume is enough for most use cases.

Have fun.

Massimo.

Update: some users have reported an issue where the Step Function state machine can't be invoked by the rule. This is due to a limit in the EventBridge rule names length that I have called out in my previous blog and that I am reporting here below for convenience.

Tip: make sure to not use a --stack-name that is too long because EventBridge rule names are limited to 64 characters, and it's easy to get too close to 64 characters when Application Composer builds the rule name adding the stack name, logical ID of the rule and more random characters.