One fine morning a huge chunk of our users from India stopped receiving important SMS alerts. Worst part was we were not tracking this metric on our monitoring dashboard and to be honest, we found out about this from increasing customer support tickets. This indeed is very embarrassing but, I believe its a good learning for us as a whole.
A huge chunk of Geedesk users receive important alerts and notifications through SMS and act on it. When the SMS stopped getting delivered to their mobile phone things came to stand still. Their guests cannot be serviced, managers cannot monitor what is going on in their departments or properties, not to mention the assumption by the end users which is, No SMS = No Complaints.
Our first point of check was the Google Cloud Platform monitoring dashboard and we looked for errors but could not find any relating to the current issue. We then checked recent code deployment logs to see if there was any recent deployment but nothing of that sort has happened.
The next point of check was internal monitoring logs to check if our software was sending out sms as expected. Once we became sure that yes Geedesk was indeed sending out sms, we turned towards our database to see if there is any abnormality with the related table structure or data. Everything was fine, there was absolutely nothing wrong.
At this point we were too confused with the fact that everything with our system seemed to look fine but the issue still persisted. Previous tests had shown that even our phone numbers faced (which happened to be Vodafone) issues. Our last point of reference now was Twilio SMS logs (we used twilio for outbound sms at the time of issue).
The Twilio logs gave us the first damn clue. The sms were not getting delivered to the users from Twilio. I honestly could not believe on what took us so long to check Twilio logs. Why didn’t we check Twilio logs first? Why did we went deep into our application and not check our service providers log. May be we had more trust on Twilio’s infrastructure than we had on our own 🙂 .
Not wasting much time we logged a support ticket with Twilio and went back to google.com and stackoverflow to see if there are others who are facing similar issues. There was one post on stackoverflow which was in fact down voted for being off topic.
Despite being off topic the post gave us a vital clue. The stackoverflow user had mentioned that the issue was being faced by Vodafone cell phone numbers in India. Based on this possible input we performed another round of test with our user base and this time we could ascertain that only users with Vodafone cell phone numbers had not been receiving sms from Geedesk whereas users with Airtel and other phone numbers did not have any issues. First positive progress since the issue first cropped up at 10:00 am (IST). We shared this observation with Twilio hoping it might help them in resolving our issue faster.
With enough troubleshooting done and Twilio support case updated with latest information we assumed that our part was done and Twilio would soon fix this issue as this was definitely a business critical scenario. When we did not hear from Twilio even after few hours I decided to get in touch with our account manager who made us realise that we were on free support and that meant we could not expect Twilio to intervene immediately. There were some limitations to what we could expect from them.
When I read this email it looked like the world was crashing down on me. Geedesk is my primary source of income and for our small and dedicated team. We had good number of customers using Geedesk and also an equally good number in trial. This could be devastating for us from both these perspectives. Customers could leave Geedesk and trial users could never become paying customers.
By this time we had asked our customers to use the web interface and not be dependent on sms alerts till further updates from our side. This was not an ideal suggestion but we did not have any other option with us.
As I was sitting and staring at the monitor a trello board came to my mind which was to look for an Indian service provider for our Indian customers while retaining Twilio for our customers in US, UK etc.
It seemed to me that this could probably be the only option available to us. We had already been using this service provider for inbound sms for our Indian customers. So we decided to get in touch with them to see if they can help us in solving this problem. It did not take long for us to find out that it could take us more than 24 hours to integrate with this new service provider, not from a technical perspective but from a procedural standpoint. But after listening to our sad story they decided to make an exception to help us.
We soon got down to work and for just this one time we directly started making changes to the production branch of the codebase, as we could not afford to waste more time. The integration was very simple. But unlike twilio we had to share the sms templates with this service provider and wait for them to approve it. In the mean time we hoped that the integration would be done without any error.
Though we were very careful while editing the production codebase, we still wanted to be on the safer side and hence deployed the changes to a new version in Google App Engine. This way we could always roll back if things did not work as expected. The best case scenario would be, our users would start receiving sms and in the worst case, we could always migrate the traffic to the previous version to ensure that only sms module had problems.
Once we deployed the changes under the new version we migrated the entire traffic to the new version. We divided ourselves into two teams, one would do the testing on the new version and the other would monitor the sms logs in addition to the other logs.
Slowly the sms logs in the new system were getting populated and the result was “Delivered”. With another round of testing we could confirm that things have got back to normal. We randomly picked few users with Indian Vodafone phone numbers and called them to see if they were receiving sms, and all of them replied in affirmative. We rejoiced as the issue was now resolved but we decided to wait for few hours before sending out an update to our users.
Once we were satisfied that all our users from India are now receiving sms alerts irrespective of their network carrier, we sent out an official announce to all our users through intercom.
We could not be happier that we managed to fix this issue. On the flip side it also taught us few things which has only strengthened our resolve and made us wiser. Not to mention, getting a custom dedicated short code(GEEDSK) was a jackpot to us.
We have realised the need for our systems to have more metrics and logs which we might need in the future. We are also putting together a new monitoring dashboard which would help us in keeping an eye on all our systems and services in addition to notifying us well in advance if something went wrong. This I believe would prepare to resolve the issue well before the support tickets could roll in.