April 2011 – TrackResults

April 22, 2011 2:28pmMST

Dear TrackResults Clients, Users, Investors, and Staff
If you experienced an interruption of service , we want to inform you of the current status and strategy in regards to TrackResults unavailability to a majority of clients on April 21st. First of all, TrackResults is 100% back online, and although the delay and interruption was a heavy load to recover from, all historical data has been recovered with no loss.

Amazon’s Elastic Cloud Platform (EC2) suffered an unprecedented outage in their Virginia datacenter at 1:48am April 20th, causing mass outages across the internet. This triggeried their EC2 systems to enter a mass rebuild and recovery mode. The systems were not able to replicate, rebuild or heal themselves which resulted in substantial failures , which impact some of our servers. Amazon experienced what might be compared to a rolling blackout, when surges overwhelmed their ability to keep up.

Amazon could not provide us a concrete timetable for total service restoration, so we were forced to make a command decision. Amazon’s inability to restore volumes fast enough had initially compromised our ability to restore some customers data for the period between 3:00am April 19th, and 1:48am April 20th.

Therefore, TrackResults restored some client sites using the backups from 3:00am April 19th. Any data entered for the following 22 hour window would not be in your database, including bookings, tours, etc.

TrackResults backs every clients entire database up EVERYNIGHT at 3:00am to an offsite redundant location as a disaster recovery and business continuity precaution. AWS systems failed at 1:48 am , 72 minutes before the nightly backup was scheduled. That 72 minute window was crucial. Fortunatly, we have recovered 100% of Tuesdays data, and can make that single missing day available to each client via an excel spreadsheet for reentry.

Within the next few days we are implementing additional redundancy strategies to ensure reliability in the event of catastrophic failure of a primary provider.

We are escalating secondary tour data backup to the S3 storage sites, running on a 2 hr cycle instead of 24 hour cycle. This will guarantee a maximum 4 hour integrity model in case of interruption during the backup process.
We are investing in replicated RAID servers with a third party cloud service, always online remotely, ( on standby ) as a third failover in the event ( unlikely) of a complete AWS irrecoverable event.
We are increasing your db server snapshot frequency to every 12 hours instead of 24hrs. This is a complete image of the server and database state.
We already have offsite DNS operating for failover events.

Clients still retain the ability to make an instant backup , at any time, using Trackresults software “export” feature in the administrative panel. It is our responsibility to ensure clients are aware and trained on how to use this feature.

The cloud computing industry experienced its largest growing pain, and will become even stronger because of it. TrackResults business intelligence division, like many others including The NY Times, European Space Agency, Netflix, relies on Amazon to deliver our applications. We apologize for the impact this unforeseen event has had on your operations. We worked around the clock with Amazon to get everyone completely restored. TrackResults will hold Amazon to higher standards if they expect to retain our trust and patronage. TrackResults selected the EC2 platform for its scalability, reliability and redundancy and we remain confident that Amazon’s product is the best on the market today.

We acknowledge you put trust in our hands to prepare for intelligence continuity on your behalf, and protect your datas avaialbility and security. They say every cloud has a silver lining, and TrackResults has defined an improved redundancy and failover strategy. Your data’s availabilty, reliability, and integrity is a critical part of our mission and will remain our commitment to you. We expect a few intermittant interruptions while AWS fixes some bugs over the next few days. Hundreds of AWS clients are rehydrating servers, and periodic ( 10 – 20 mins ) connectivity issues are likley to happen. We invite you to contact us directly with any concerns , ideas, or questions

Sincerely,
Drew and Todd ( Founders )

Archive for month: April, 2011