Friday, October 14, 2011

BlackBerry outage points out faults in cloud computing model

The worst ever BlackBerry outage in the history of RIM took place earlier this week, with disruptions and delays experienced worldwide. Though initially declared "fixed" on early Tuesday morning, the problem came back with vengeance by lunch time, spreading to more regions as engineers struggled to deal with a huge backlog of data waiting to be delivered.
Well, the BlackBerry service was declared "fully restored" on Thursday, though it is obvious that the timing could not have occurred on a worse week for the beleaguered smartphone maker. The iPhone 4S was scheduled to be available today, while iOS 5--with the competing iMessenger feature, was released for download on Wednesday.
Before we wonder about whether the outage could have been avoided, let us first examine RIM's infrastructure.
The NOC
RIM makes use of a Network Operating Center, or NOC, to shuffle traffic to and from BlackBerry smartphone devices around the globe. Though such a design may seem somewhat convoluted, this model yields a range of benefits such as timely delivery of messages with significantly higher data efficiency than competing devices. Moreover, this scheme allows the BlackBerry Enterprise Servers (BES) used by business users to be safely ensconced within corporate firewalls, while keeping battery usage of smartphones to a minimum. In current parley, it can be argued that RIM's NOC is really a private cloud that was created ahead of its time.
Of course, the competitive advantages have declined markedly with the rise of high-speed data networks, and were further hastened by cheap tariffs for mobile data. Modern smartphones and tablets now come with far more advanced processors that make light work of quickly connecting to the Internet to slurp down the desired data before snapping back into a low-power state.
Despite the winnowed-down list of advantages, tens of millions of BlackBerry smartphone users do mean that RIM will likely stick with the cloud for some time.
The complexity of the cloud
As I wrote earlier this year, reacting to an extended outage at Amazon's EC2, the problem is that cloud computing is unavoidably complex. For reasons best known to marketers, the perception that cloud computing is infallible by default has been one that was left uncorrected. As evidenced with the BlackBerry outage however, not only can the cloud fail, but it often goes out "with a bang" that defies efforts to remedy problems quickly.
On one hand, it can be argued that RIM should have engineered its infrastructure to be far more resilient. Yet it is also evident that the company has not been completely negligent in putting secondary and failover systems in place either. Ignoring the debacle with the core switch, the "cascading" failures appear to be an attempt by RIM to spread out the workload among its data centers around the world.
My take on cloud computing: Leverage it if it makes sense to, but don't try to do it without preparing appropriate backup and failover


No comments:

Post a Comment