Co-founder and CEO, Ofofo.io
Table of Contents
I’m the co-founder and oversee product of a relatively established tech startup (few thousand clients, multiple millions of revenue, VC backed). Without going into too much detail, one of the core functions of our software is to send transactional emails to our clients when their customers update information or their customers are reaching out to our clients with inquiries. We’ve processed millions of emails to our customers and have a bounce rate under 1% and a complaint/spam rate of 0.001%. In the email sending world, with this sort of scale, we are a very safe sender and have only had deliverability issues once with few customers (more info on that to come!). On top of that, we’ve never had a customer mention our emails going to spam. To process emails, we use Amazon’s SES (simple email service) product since we use AWS for other services. For context, we’ve used SES for almost 7 years so it was not on my radar as a potential tech risk…
The morning of our breach was relatively normal. Our leadership team had a few team meetings to make sure everyone was aligned on new strategies and we were breaking out to work on our tasks for the day. When I went to check my email, I noticed a ticket had been opened by Amazon letting us know our account was under review since our bounce rate had exceeded 10%. It was alarming for sure, but there were no changes made to our account and we had 30 days to improve the issue for another analysis to take place. After reviewing the email, I also noticed the main sending domain we use had 50 or so auto reply messages in a 15 minute span…anything ranging from “out of office” to “your support ticket has been received”…At this point, I knew something was up and had our engineering team and dev ops team start digging into what could have caused this.Again, I thought we had 30 days to correct the issue so I felt like the issue had been passed on to the right team members and we could figure out a solution. Unfortunately, we were too late to reverse what had already happened. 15 minutes after the original email of Amazon alerting us of a 10% bounce rate, I received another email letting us know our account was temporarily suspended because our bounce rate had reached 26%…for this to have happened, hundreds of thousands (if not more than a million) emails had to have sent out and the majority of them didn’t even reach the inbox. With such a critical component of our software going down, we immediately alert customers to enable SMS alerts due to a temporary sending issue with email alerts. As expected, some customers had questions, some were annoyed, but there was relatively little backlash since they were aware the issue was being investigated.
With SES and other email APIs, you receive an access key in order to use the API. Our access key appears to have been compromised which allowed someone to send mass emails via the API who was outside of our organization…we’re now putting in safeguards to ensure even more security around then. When our account was paused, we switched all access keys and changed all passwords as a precaution. With all of this being done, we still had to wait for Amazon to go through their review process and reinstate our account which can take WEEKS…Here’s the good news and the takeaway…Remember at the beginning of this post, when I mentioned that we had a few users mention deliverability issues in the past? The root of those issues were actually on Microsoft Outlook, but at the time, we thought it was our issue. Because of this, we built in a backup transactional email platform to manually move the users having issues from Amazon SES to our alternative sending provider. Since we did this in the past, we were able to manually move a large chunk of our high volume customers over to the alternative within the hour to avoid any interruption for them.Over the next 10 hours, our engineering team was able to move all email sending scenarios away from AWS SES to our alternative provider. Every customer was now receiving email again and virtually no one was majorly affected. If we hadn’t built in an alternative sending provider, we would have likely spent a few days configuring everything for a new provider or hoped and prayed that Amazon got back to us to reinstate our account…Amazon ended up taking 5 days so that wouldn’t have been an option.
The key lesson here for any technical founder is to look at their software and see where you might have vulnerabilities. There are always risks with tech products, but there are certainly ways to mitigate your impact if a system does have an issue. A few examples could be your server infrastructure not built to scale, using multiple servers for platform based products, third party softwares/APIs having their own issues (like in our case), different APIs that are having version updates which could affect components of your software, keeping SSL certificates up to date, etc.I hope these lessons are helpful and you can learn something from the experience we had. At the very least, hopefully you got some level of entertainment from our very stressful situation :).