Automating AWS Cost Reduction: A Startup's Boto3 Scripts for Idle Resource Cleanup[1]



Key Takeaways

  • Automating cloud infrastructure cleanup can lead to significant annual cost savings, with some companies seeing reductions of up to 40%.
  • The primary targets for cost-cutting automation are idle EC2 instances, unattached EBS volumes, and databases running after hours.
  • You can build a powerful, automated cleanup system using Python's Boto3 library and scheduling the scripts to run with AWS Lambda and Amazon EventBridge.

Here’s a shocking number for you: up to 40%. That’s the annual cost savings some companies see just by automating their cloud infrastructure cleanup. I saw that stat and it hit me hard.

I thought back to my last startup, where our AWS bill started to look like a phone number. We were shipping code, breaking things, and moving fast. Every developer would spin up a "temporary" EC2 instance for a quick test or a "short-lived" EBS volume for a data experiment.

The problem? "Temporary" has a funny way of becoming permanent in the chaos of a startup.

That mounting bill wasn't from our production traffic; it was from a graveyard of forgotten resources, silently draining our runway month after month. We were manually cleaning things up, but it was like playing whack-a-mole. For every instance we terminated, two more would pop up.

That’s when I realized we weren’t fighting a resource problem; we were fighting a process problem. We needed to stop playing catch-up and build a system—an automated, relentless, cost-cutting machine. We built it with Python and Boto3.

The Problem: How 'Temporary' Resources Become Permanent Costs

We’ve all been there. You need to test a new feature branch, so you spin up a t3.medium instance. You tell yourself you'll terminate it when you're done.

Then a critical bug report comes in, you switch branches, and by the end of the day, you've completely forgotten about that running instance. Now, multiply that by a team of ten engineers over three months. The costs are insidious.

Why manual cleanup is a losing battle.

In a startup, speed is everything. The pressure is to build and ship, not to meticulously audit your cloud spending.

Manual cleanup relies on human memory and discipline, two things that are in short supply when you're trying to find product-market fit. It’s inefficient, prone to error, and frankly, a soul-crushing task nobody wants to do. You need automation to enforce the discipline you don't have time for.

Identifying the top 3 cost culprits.

When we audited our bill, the villains were obvious:

  1. Idle EC2 Instances: These were the biggest offenders. Test servers, forgotten dev boxes, and old PoCs running 24/7 with less than 5% CPU utilization.
  2. Unattached EBS Volumes: When you terminate an EC2 instance, the attached storage volume often doesn't get deleted with it. We had a digital boneyard of these "orphaned" volumes.
  3. After-hours RDS Databases: Our dev and staging databases were running all night and on weekends, even when nobody was working.

The Toolkit: Setting Up Your Boto3 Environment

Before you can start slashing costs, you need the right tools. Boto3 is the AWS SDK for Python, and it’s your key to programmatically controlling your entire AWS universe.

Prerequisites: Python, Boto3, and AWS CLI.

This is the easy part. If you’re reading this, you probably have Python and pip installed. Getting Boto3 is as simple as: pip install boto3.

You'll also want the AWS CLI configured, as Boto3 can use its credentials. Run aws configure and plug in your keys.

Creating an IAM User/Role with the right permissions.

Do not use your root account or an admin user for this. I can't stress this enough. Create a dedicated IAM Role or User for your scripts.

Give it only the permissions it absolutely needs (the Principle of Least Privilege). For our cleanup scripts, this would be permissions like ec2:DescribeInstances, ec2:StopInstances, ec2:DeleteVolume, and cloudwatch:GetMetricStatistics.

Start with read-only permissions (Describe*) to test, then add the destructive permissions (Terminate*, Delete*) once you're confident.

Script 1: The Zombie EC2 Hunter

This script is your front-line soldier. Its mission: find EC2 instances that are just wasting money and take them out.

Defining 'idle' using CloudWatch metrics.

First, you need a clear definition of "idle." We settled on this: an average CPU utilization of less than 5% over a 14-day period. This threshold is aggressive enough to catch true waste but lenient enough to avoid killing a machine that has occasional, important jobs.

The Python Boto3 script to find and terminate low-utilization instances.

The logic is straightforward: 1. Use Boto3 to list all running EC2 instances. 2. For each instance, query CloudWatch for its CPUUtilization metric over the last 14 days. 3. If the average is below your threshold and the instance isn't protected by a specific tag, add it to a "to-be-terminated" list. 4. Iterate through the termination list and shut them down.

Implementing a 'dry run' feature and essential tagging.

Before you unleash a script that can terminate instances, build in a safety switch. A "dry run" mode is crucial. When enabled, the script should print out which instances it would have terminated without actually doing it.

Also, use tags! We implemented a cost_cleanup_exempt tag. Our script was written to ignore any instance with this tag set to true. This is our get-out-of-jail-free card for critical, low-traffic machines.

Script 2: Cleaning Up Orphaned EBS Volumes

These are sneakier than idle EC2s. They don't "run," so they don't show up in CPU metrics, but you pay for every single gigabyte, every single month.

The hidden cost of unattached storage.

An orphaned EBS volume is one that isn't attached to any EC2 instance. This usually happens when an engineer terminates an instance but forgets to check the "delete on termination" box for its volume. It's pure, dead weight.

The Boto3 script to find and delete 'available' EBS volumes.

This script is even simpler than the EC2 hunter: 1. Use the Boto3 EC2 client to call describe_volumes. 2. Filter the results for volumes with a State of 'available'. 3. For each "available" volume, call delete_volume.

Again, start with a dry run that just prints the Volume IDs it would delete. You might be shocked at what you find.

Putting it on Autopilot: Scheduling with AWS Lambda and EventBridge

Writing the scripts is half the battle. Making them run consistently without human intervention is how you win the war.

Packaging your Python scripts with dependencies.

AWS Lambda is perfect for this. It's a serverless compute service that can run your Python code in response to an event. You'll need to package your script and any dependencies into a .zip file.

Creating a Lambda function to execute the cleanup.

Upload your .zip file to a new Lambda function. Configure the handler (the function Lambda should run) and, most importantly, assign it the IAM Role you created earlier with the necessary permissions. You can test it directly in the Lambda console.

Setting up an EventBridge (CloudWatch Events) cron job.

This is the final step. Go to Amazon EventBridge and create a new rule. Set the schedule to run on a cron expression—we chose to run our scripts every night at 2 AM. For the rule's target, select your Lambda function. That's it.

The Results: Real Savings and Next Steps

Within the first month, our AWS bill dropped by over 30%. It wasn't just the one-time cleanup; it was the cultural shift. Knowing that a script would automatically terminate a forgotten instance forced everyone to be more disciplined.

How we measured the impact on our bill.

We used AWS Cost Explorer to track our spending. Before running the scripts, we took a baseline of our daily EC2 and EBS costs. After activating the Lambda functions, we could see a clear, immediate, and permanent drop in that baseline. The data was undeniable.

Ideas for extending the scripts.

This is just the beginning. The same principles can be applied to other AWS services: * Old S3 Snapshots: Find and delete EBS snapshots older than 90 days. * Unused Elastic IPs: Find EIPs that aren't associated with an instance and release them. * Idle RDS Instances: Use CloudWatch metrics to find databases with no connections and flag them for review.

Automating your cloud cost management isn't a one-off project; it's a continuous process. By investing a little time in a few Boto3 scripts, you can build a system that pays for itself over and over again, letting you focus on what actually matters: building your product.



Recommended Watch

πŸ“Ί Learn Python for Data Analysis in 3 Steps (as quickly as possible)
πŸ“Ί Please Learn How To Write Tests in Python… • Pytest Tutorial

πŸ’¬ Thoughts? Share in the comments below!

Comments