AWS CloudFormation and Generative AI

The CloudFormation team has been on a roll in the last 12 months. Among many releases, they introduced Git stack management, up to 40% faster deployments, stack visualization with Infrastructure Composer, adjustable timeouts and last week the team has introduced the timeline view for deployments.

Being a very visual person the last one picked my curiosity and I gave my Yelb test application a try. And I really liked how they have been able to turn a very dry (and honestly cryptic) list of events into a very intuitive graphical view that makes more intuitive to get a sense of the time it takes to deploy a resource as well as a sense of the dependencies among them.

This is the live view of my Yelb deployment on ECS/Fargate using CloudFormation. In the spirit of "a picture is worth 1000 words", nothing beats a great diagram to communicate to a human being:

Looking at this diagram pushed me to think though. Why are these lines the way they are? Have I configured the dependencies properly? Is there anything that I could do to optimize some of these deployment times by re-organizing the resources? It didn't occur to me to think about these questions before I have actually visualized the... timeline of the events.

Can we use generative AI to answer some of those questions? Maybe.

Leveraging Generative AI to make CloudFormation better

Before getting into some initial experimentation, let's discuss the setup. For these exercises, I want to use a playground that is able to read diagrams as an input and generate diagrams as an output (something that I would like to experiment with). For this reason, I will use claude.ai.

The experiments you will see below will include these 3 artifacts as context for the various conversations:

  1. The Yelb CloudFormation template to deploy the application to Amazon ECS. This is the link to it on GitHub
  2. The deployment timeline layout (the screenshot above as a PNG file)
  3. The raw list of the deployment events as obtained using the standard aws cloudformation describe-stack-events CLI command

On the third point specifically, this is the command that I have used to generate the list of events:

1aws cloudformation describe-stack-events --stack-name yelb-ecs --query 'StackEvents[].[ResourceType,LogicalResourceId,ResourceStatus,ResourceStatusReason,Timestamp,EventId]' --output text --no-cli-pager

And this is the head of the output file that contains the events generated with the command above:

 1[cloudshell-user@ip-10-136-50-183 ~]$ head -10 yelb-ecs-events.txt 
 2------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 3|                                                                                                                             DescribeStackEvents                                                                                                                            |
 4+--------------------------------------------+---------------------------------------+---------------------+---------------------------------------+-----------------------------------+-------------------------------------------------------------------------------------+
 5|  AWS::CloudFormation::Stack                |  yelb-ecs                             |  CREATE_COMPLETE    |  None                                 |  2024-11-14T09:19:49.559000+00:00 |  991af040-a269-11ef-881a-0ee756e81a1d                                               |
 6|  AWS::ECS::Service                         |  ServiceYelbUi                        |  CREATE_COMPLETE    |  None                                 |  2024-11-14T09:19:48.372000+00:00 |  ServiceYelbUi-CREATE_COMPLETE-2024-11-14T09:19:48.372Z                             |
 7|  AWS::ECS::Service                         |  ServiceYelbAppserver                 |  CREATE_COMPLETE    |  None                                 |  2024-11-14T09:18:30.233000+00:00 |  ServiceYelbAppserver-CREATE_COMPLETE-2024-11-14T09:18:30.233Z                      |
 8|  AWS::ECS::Service                         |  ServiceYelbDb                        |  CREATE_COMPLETE    |  None                                 |  2024-11-14T09:18:29.814000+00:00 |  ServiceYelbDb-CREATE_COMPLETE-2024-11-14T09:18:29.814Z                             |
 9|  AWS::CloudFormation::Stack                |  yelb-ecs                             |  CREATE_IN_PROGRESS |  Eventual consistency check initiated |  2024-11-14T09:18:18.410000+00:00 |  62c4b8a0-a269-11ef-93f8-0e298ff799c7                                               |
10|  AWS::ECS::Service                         |  ServiceYelbUi                        |  CREATE_IN_PROGRESS |  Eventual consistency check initiated |  2024-11-14T09:18:18.373000+00:00 |  ServiceYelbUi-012aa4e5-eeb3-481f-8669-216650ed4a1f                                 |
11|  AWS::ECS::Service                         |  ServiceYelbUi                        |  CREATE_IN_PROGRESS |  Resource creation Initiated          |  2024-11-14T09:18:17.576000+00:00 |  ServiceYelbUi-CREATE_IN_PROGRESS-2024-11-14T09:18:17.576Z                          |                                          |

Note: when I consulted with the CloudFormation team they suggested that there is an optimization opportunity of the layout of these raw events that could help an LLM better reason about them. I haven't yet implemented those suggestions in my experiments. All this to say that the results can only improve from what you will see below.

Applying the basics of Generative AI to a CloudFormation stack

The most obvious, and perhaps boring, thing you could do is getting a summary of this CloudFormation stack. I know this app inside-out but imagine yourself landing on the CFN console, seeing a stack you have never seen before and wondering "what on earth is this thing?".

Give me a detailed summary of the resources in this deployment and their deployment sequence.

What does this application do?

Surprisingly, with just these 3 artifacts in the context the level of information you can extract is not trivial (considering there is no access to the source code - except for the fact that claude.ai seems to be getting access to the "public repository for these images").

Prompting to explain cryptic details of the timeline view

As I said, while I know this application inside-out (or I used to, given that I haven't looked at these details for ages), checking the timeline view, I couldn't wrap my head around why some of the Security Groups would not start deploying along with the others. There shouldn't be an (obvious reason) why that would happen. So I decided to ask:

Why does the creation of the YelbDbSecurityGroup and the YelbRedisServerSecurityGroup is not starting at the same time of all the other Security Groups. It looks like there is a dependency but they should not be dependent on anything.

Stupid me! Of course! I forgot about it. However, this simple interaction saved me some time from having to go check the source code of this template. Yes I am lazy but imagine trying to understand some of these nuances for a (complex) stack that you don't know anything about? It also gave me some hints about how to optimize the CloudFormation code (more on this later).

Similarly, you can ask clarification questions on why a given resource took so long to start. This could be particularly useful when you are dealing with AWS resources for services you are not intimately familiar with, and you may not have the level of understanding required to interpret properly their deployment times:

Why are the ecs services starting so late?

This technique allows you to extract AWS services behaviour knowledge starting from questions related to the timeline view.

Playing with visual explorations

This is where I had some fun. I love the timeline view where you can kind of depict the resources dependencies, but they are not explicit. So I wanted to prompt claude.ai to generate a more explicit diagram about the dependencies of the resources in the template:

Give me a very clear graphical representation of the dependency tree.

This prompt gave me a very detailed (yet hard to read) Mermaid diagram, so I followed up with:

Make it more readable

This produced a grouped and more readable diagram (which isn't yet super easy to read so you may want to click to enlarge):

Note that this is not the same view you would get with AWS Infrastructure Composer, which is more of a logical view of how the resources map to each others. These views I am playing with are built around the deployment dependencies that exist among these resources. Different tools for different goals.

Even more interestingly, I could start navigating through the data asking to build ad-hoc graphical explorations for pieces of the infrastructure I want to dig into. For example, I could ask for a visual representation of the northbound and southbound dependencies of a given resource:

Based on the deployment events, what are the resources that the yelb-ui task definition depends on and what are the resources that depend on the yelb-ui task definition?

The attempt to label these resources with times and time gaps is an interesting angle (which I completely ignored and did not explore further for now in my experiments).

Optimizing the CloudFormation template

And last but not least, probably the most interesting (and challenging) use case. Given all that can be known by these three artifacts passed as context, how can the CloudFormation be made better and optimized from a deployment time perspective?

Anecdotal evidence based on a limited set of experiments seem to reveal that a big bang approach doesn't give the result you'd expect. For example using the following "fix it all" prompt produces non accurate results:

Looking at the timeline of the deployment, the dependencies and how long it takes to deploy them, how can the CloudFormation template be optimized for speed of deployment? Consider parallelizing tasks or perhaps change it so resources can be deployed faster. Splitting the template into nested stacks is not an option. Pre-deploying some of the resources is not an option either.

It suggests to remove "Unnecessary DependsOn" but there is only one in the template (and it's required). It also suggests multiple times to "move up [ resources ] in template to start earlier" which is clearly a hallucination. All in all a pretty bad answer in my opinion.

Given this, I am going to try to break down this optimization experiment in pieces and I want to focus on 2 specific optimizations (based on the timeline view that make it so obvious to spot areas you want to attack):

  1. I want to try to pull the load balancer start time in (because the load balancer is an implicit dependency of the load balancer listener which is in turn a dependency of the UI service, the last resource to come online)
  2. I want to try to reduce the consistency check for the ECS services (which is the vast majority of the time they take to come on-line and they take very long)

With #1, I went down into a rat-hole. I was misled by the fact that I was able to fix the start time of the two Security Groups (YelbDbSecurityGroup and the YelbRedisServerSecurityGroup) by not referencing each others but rather creating them independently and then linking them leveraging AWS::EC2::SecurityGroupIngress. This is what the model suggested above when I was inquiring why these two Security Groups did not start deploying at the same time of the others. I am not showing the entire conversation, but it worked (even though it didn't buy me anything cause those resources were not slowing down downstream resources creation). I have tried to explore (and push?) the model to try to do the same with the Load Balancer but clearly this is not technically possible and so it ended up in hallucinations (it simply wasn't able to tell me it's a configuration that cannot be achieved). See the workflow that started with How can I make the YelbLBSecurityGroup resource and the Load Balancer resource start at the same time? and that I pushed (too far) with a leading Is it possible to remove the reference to the YelbLBSecurityGroup from the Load Balancer and add later the Ingress rule?:

I am concluding (the hard way) that optimizing the start time of the load balancer is not possible.

Let's move to item #2 and let's try to reduce the time of the consistency checks of the ECS services. Below is the LLM conversation triggered with the following prompt:

All the ECS services have a very long consistency check time as can be depicted from the timeline view. Since Yelb is mostly deployed for test and there is no need for production ready configurations, are there ways to reduce that consistency check time?:

This seems a promising answer (including what you see and what you don't see from the partial screenshot) but the reality is that it made up a lot of the suggestions. It made up values ("2 validation errors detected: Value '1' at 'healthyThresholdCount' failed to satisfy constraint: Member must have value greater than or equal to 2; Value '3' at 'healthCheckIntervalSeconds' failed to satisfy constraint: Member must have value greater than or equal to 5") and it also made up resource parameters ("Model validation failed (#: extraneous key [StabilityTimeout] is not permitted)"). In this very limited amount of time I dedicated to this experiment I did not find a meaningful way (suggested by claude.ai) to optimize this template for deployment times.

Conclusions

This is the end of my completely unstructured ramblings of how generative AI could help with CloudFormation operations (inspired by the new timeline view). My takeaway from these quick experiments is that there is value to be extracted today when using these tools to explore and explain CloudFormation operations. Yes the outcome can be optimized but, as you think about what you have seen in this blog, please remain focused on the moon (i.e. what can potentially be done) and not at the finger that points to it (the 5 experimental prompts that I ran from the sofa on a lazy weekend). In terms of optimizing the operations, the result I have got are less remarkable. This somewhat maps to my belief that generative AI is still better at code-to-english than it is at english-to-code. Nevertheless, this is an area where the technology can improve (and will improve) drastically, in my opinion, in the months and years to come.

Massimo.