NDA Viewpoints: Why do the worlds biggest social media sites keep crashing?

By Dr. John Bates, CEO, Eggplant

Last week the world stopped communicating. Albeit it was only on digital channels and only for a short period, but the way the internet and world media reacted, you would think it was a darn sight more than that.

When Facebook, which has more than 2.3 billion monthly active users, and Instagram, which has one billion crashed, it was a big problem. The irony being that in the immediate aftermath of the crash, the hashtags #InstagramDown, #FacebookDown and #WhatsApp started trending on Twitter, a result of widespread social conversation surrounding the service interruption.

What happened

So, what happened? During the middle of last week, thousands of Instagram users began reporting that they were having problems. Data from the Downdetector website showed that all three services began experiencing problems around 2pm UK time, with 91% of the Instagram issues pertaining to users  unable to access their news feeds. 

A much smaller proportion (2%) also reported login issues. In Facebook’s case, users encountered problems in accessing photos (91%), while a smaller proportion also reported issues logging in (4%) or said they had encountered a total blackout (3%) when trying to access the site. WhatsApp users, meanwhile, flagged issues with receiving messages (69%), connectivity (28%) and logging in (1%). The location data also showed that users across the UK, parts of Europe and in North and South America appeared to be bear the brunt of the issues. 

A silver lining to the outage was that it made some typically hidden parts of Facebook briefly visible. Most notably, how photo uploads are being analysed and data extracted, using machine learning for tagging the images and reading the description to blind users.

Particular commendation for restraint goes to the person who uploaded a photo of their baby only for Facebook to categorise it as “Image may contain: dog”.

Prevention is better than cure

Instagram and Facebook were both quick to tweet that the companies are “working to get things back to normal as quickly as possible. At the time Facebook said some of its users globally were facing issues while sending media files over its social media platforms including WhatsApp and Instagram, and the social media company said it was working to fix the problem, which it did later that day.

While questions remain over the cause of the outage, Facebook has since released a statement confirming that, “During one of our routine maintenance operations, we triggered an issue that is making it difficult for some people to upload or send photos and videos”. But this isn’t a one off.

It’s the second time in a year Facebook and WhatsApp have faced major technological troubles. In March, the three networks had an outage that lasted 15 hours, which was the worst outage in Facebook’s history. Facebook said in a tweet that the reason for that particular outage was a server configuration change. 

The latest problem followed earlier disruption when Cloudflare – a company that provides internet security to website operators – suffered a fault of its own that caused thousands of websites to display “502 errors” when visited. The US firm has since published a blog blaming a flawed software deployment and said, “Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future”.

Finding answers

It’s the last part of that comment that is particularly telling. Particularly as these outages keep occuring – not just to the social networking sites, I might add. But why?

The reasons for the increase in frequency of outages may be down to the pressure put on brands. In this age of ‘always-on’ brands are under scrutiny to deliver peak performance week-in, week-out, 24 hours a day. This means there’s no longer quite time to test, repair or patch.

While it is absolutely possible to do these things while a site or platform is operational, the obvious risk is that users are at increased exposure of facing an issue. At a time when consumers expect brands to deliver unprecedented levels of customer service excellence, managing this digital risk is critical. In fact, those that struggle with this demand are almost certainly going to suffer severe customer dissatisfaction and are at risk of losing customers and users.

More analysis needs to be put in place by the companies. Systems should be tested during pre and post production for the load, performance issues, and unknown unknowns that might occur. This goes way beyond simply testing the happy path but to combine the best of technology and AI, with human skills and intuition to test platforms continually, using learnings to inform subsequent levels of improvement.

This is what will stop these outages happening so frequently. Huge social platforms may be able to get away with it. After all, outage or no outage, we’re all still going to use Whatsapp to communicate. But other brands must take heed, especially those that are charging for services or products. When it comes to delighting customers, no downtime is acceptable, and organisations need to have the right processes in place to both prevent and rectify outages as quickly as possible. 

Pin It on Pinterest