Automating AWS Cost Reduction: A Startup's Boto3-Powered Resource Cleanup Case Study[1]

Key Takeaways
- Unchecked Cloud Waste: "Temporary" development resources like idle EC2 instances and orphaned EBS volumes often become a permanent, silent drain on your budget, with costs sometimes growing faster than revenue.
- Automate with Boto3 & Lambda: Manual cleanups are unsustainable. Using Python's Boto3 library and AWS Lambda, you can create a serverless "janitor" to automatically find, tag, and safely delete unused resources on a schedule.
- Beyond Savings: This approach not only slashed our dev environment costs by over 30% but also improved our security posture and fostered a culture of cost-awareness and infrastructure hygiene across the team.
I once spoke to a founder who said their startup’s AWS bill grew faster than their revenue. In one month, their bill for "temporary" development resources was enough to hire a junior engineer for a full year. That's not just a budget overrun; that's a silent killer for a bootstrapped company.
It's a story I hear all too often. It’s the reason I became obsessed with automating our way out of cloud cost chaos.
The Problem: How 'Temporary' Resources Became a Permanent Drain
When you're building fast, you're spinning up resources left and right. An EC2 instance for a quick test, an EBS volume for a database experiment, a NAT gateway for a new VPC configuration. The mantra is "move fast and break things," not "move fast and meticulously clean up after yourself."
The Shock of the Month-End Bill
It started small, with a few extra dollars here and there. But then came the bill—the one that makes your stomach drop. Our AWS costs had spiked 40% month-over-month with no corresponding increase in customer usage.
It was all coming from our dev and staging environments. The "temporary" infrastructure had become a permanent, and expensive, part of our architecture.
Identifying the Culprits: Orphaned EBS Volumes, Idle EC2s, and Unattached EIPs
We dug in using AWS Cost Explorer and found the culprits. They were the ghosts of projects past:
- Orphaned EBS Volumes: Dozens of them, unattached to any EC2 instance, silently racking up storage costs.
- Idle EC2 Instances:
t2.microinstances spun up for a hotfix weeks ago and then forgotten, still running. - Unattached Elastic IPs: Free when attached, but AWS charges you for them when they're just sitting in your account.
Why Manual Cleanup Wasn't a Sustainable Solution
Our first reaction was a manual cleanup spree. We spent an afternoon clicking through the AWS console, cross-referencing IDs, and cautiously hitting "terminate." We saved a few hundred dollars, but we knew it wasn't a real solution.
It was tedious, prone to human error, and didn't solve the underlying problem. Our team was still moving too fast to remember to clean up, so we needed a janitor who never slept.
Our Weapon of Choice: Automating Cleanup with Python and Boto3
If a human can click through the console, a script can do it better, faster, and more reliably. We turned to Boto3, the AWS SDK for Python.
Why Boto3? The Power of Scripting Your Infrastructure
Boto3 is one of the most powerful tools in a cloud engineer's arsenal. It turns the entire AWS API into a set of Python functions. Instead of manually searching, I can write a few lines of code to get a definitive list in seconds.
Step 1: Setting Up a Secure IAM Role for Our Cleanup Script
First things first: security. We created a specific IAM role with the minimum required permissions like ec2:DescribeVolumes and ec2:DeleteVolume. This principle of least privilege is non-negotiable.
Step 2: The Logic - How to Define 'Unused' Resources Safely
This was the most critical part. We started with a very conservative definition of "waste" to avoid accidentally deleting something important.
- An EBS Volume is 'unused' if: It is in the
availablestate (not attached) AND does not have a specific tag, like"backup": "true". - An EC2 Instance is 'idle' if: Its CPU utilization has been below 2% for the last 7 days AND it's in a non-production environment.
Step 3: The Code - A Walkthrough of Our Boto3 Script for Finding and Tagging Waste
The core of our script was surprisingly simple. For EBS volumes, it followed a basic loop: get all volumes, check if a volume's state is available, and if so, check for a "do-not-delete" tag.
Initially, the script just tagged resources for deletion with a "cleanup-candidate" tag. This gave us a chance to review its choices before giving it the power to actually delete anything.
Putting it into Production: From Cron Job to Serverless Lambda
A script on a laptop is an experiment. A Lambda function running on a schedule is a production tool.
Scheduling the Automation for Daily Sweeps
We packaged our Python script into a Lambda function and used Amazon EventBridge to trigger it every night at 2 AM. This serverless approach meant we weren't paying for an idle server to run a cron job. It was the epitome of cost-effective automation.
Implementing a 'Grace Period' with Tags to Prevent Accidents
We built in a safety net. The first time the script finds an unused resource, it doesn't delete it. Instead, it applies a tag: cleanup-candidate-date: YYYY-MM-DD. The script only deletes a resource if that tag is more than 7 days old, giving the team a grace period to intervene.
Adding Slack Notifications for Full Transparency
To close the loop, we integrated a webhook to post a daily summary to our team's Slack channel. This created visibility and built trust in the automation.
"AWS Janitor Report: Tagged 5 EBS volumes and 2 EIPs for cleanup. Deleted 3 resources that passed their grace period. Total estimated savings: $45/month."
The Results: Measurable Savings and Peace of Mind
The impact was immediate and dramatic.
By the Numbers: Charting Our 30% Reduction in Dev Environment Costs
Within the first month, our development and staging environment costs dropped by over 30%. This wasn't from complex architectural changes or expensive Savings Plans. This was purely from eliminating waste.
We improved our Cost Efficiency by tackling idle resources head-on. The industry median Effective Savings Rate (ESR) from commitments is meaningless if you're paying for resources you aren't even using.
Beyond Cost: Fostering a Culture of Infrastructure Hygiene
The daily Slack notifications had a fascinating side effect. It subtly fostered a culture of ownership and infrastructure hygiene. Nobody wanted their "temporary" test instance showing up on the cleanup list for a week straight.
The Unexpected Benefit: Improved Security Posture
Fewer running, unmonitored resources means a smaller attack surface. By cleaning up old instances that weren't being patched, our automated janitor inadvertently became part of our security team.
Conclusion: How You Can Implement Your Own AWS Janitor
This journey from bill shock to automated control was transformative. We didn't just save money; we built a smarter, more efficient way to operate.
Key Lessons Learned on Our Journey
- Start with visibility, not deletion. Tag resources first to let your team get comfortable with the script's logic.
- Automate safely. A grace period and notifications are essential for building trust in the automation.
- Serverless is your friend. Lambda and EventBridge are the perfect, low-cost tools for this kind of scheduled task.
A Call to Action: Start with One Resource Type and Expand
Don't try to boil the ocean. Start with the easiest, most obvious source of waste in your account, which for most is unattached EBS volumes. Get that working, prove the value, and then expand to other resources.
Link to our GitHub Repo with the full script
To help you get started, I’ve cleaned up our scripts and posted them to a public GitHub repository. You can find the full code, along with deployment instructions, here: [Link to Your GitHub Repo Here]
Recommended Watch
π¬ Thoughts? Share in the comments below!
Comments
Post a Comment