Every day in the IT consulting business, at any given moment, you can be tossed in a crucible, set on fire and put under relentless pressure by customers that have found themselves in a situation that is dire for their business that they simply cannot get themselves out of.
It is in these moments when your plans, systems, and processes are tested by fire, you get to see what your company is truly made of. You get to see your triumphs, your warts, and in many cases so do your customers. Late last night the pressure arrived with a phone call.
10:02 PM, last night I got a call from a customer. I asked him how he was doing knowing full well that it could not be good if he was calling me at 10PM. Sure enough, there was a problem. Their Microsoft Exchange servers were completely non-responsive. No email, period.
The customer was a little upset because they thought we had not setup their DAG (Database Availability Group) correctly, which should have allowed their Exchange server to fail to a secondary Exchange server instead of bringing Exchange to an immediate halt. They were convinced this failure on our part had to be the cause of their outage.
I think we have a great culture of taking care of the customer baked into the Whitehat DNA, but we were about to find out if real life met the expectation.
10:04 PM I contacted the Whitehat operations side of the house to get this incident in our formal IT support process.
Will they answer the phone? Check.
Will they take ownership of the issue like they would their own environment? Check.
Does the customer have an IT support contract? No.
Does the customer have ANY LEVEL of IT support agreement with us? No.
There is no support contract in place of any kind to assist with rebuilding Exchange or deal with the likely data recovery. How would our support group respond?
They took immediate ownership of the problem. It was never even a consideration to leave the customer alone on Data Recovery Island because they did not buy a support agreement.
10:06 PM Operations contacted our primary Exchange resource. Not available.
10:08 PM Our secondary Exchange resource was contacted. He was on the east coast cleaning out the house he just sold, picking up the truck he left behind and loading up the last few items in the house before driving his truck back to Texas.
How did he respond when he got the call? He responded with “Let me get my laptop.” That makes me smile just thinking about it again.
10:15 PM I got the update and called the customer contact back that had originally called me. I gave him a quick status update on where his ticket was and let him know that he would be contacted shortly by our IT support group to resolve the issue.
10:20 PM Our Whitehat resource engaged with the customer. 18 minutes from ticket to tackling the issue. I am loving it.
11:42 PM I got a call from our Operations VP making sure we were all on the same page and all pulling on the same end of the rope to get the customer taken care of.
1:16 AM Engineer confirms we are engaging Microsoft to try and help unwind some of the damage done during an update before going into full recovery mode. This was not a data recovery issue as end users could see their mail but could not send or receive. With the integrity of the data confirmed, and conversely data recovery being off the table, the hope was that Microsoft could use PowerShell to reverse some of the steps that led to the outage in the first place.
The ultimate root cause of the Exchange problems stemmed from a decision the customer made to upgrade their new Exchange server forgetting the step to prepare the Schema. Realizing the mistake, they decided to upgrade the Schema and begin the Exchange update again. At this point Exchange ground to a halt for most of their end users.
Digging to their concern about our failure to setup the DAG properly revealed that they were indeed running on their secondary Exchange server. The issue here was that all of their clients were pointed at a single Exchange server instead of a DNS round robin. The clients had no way to get to the second Exchange server even though it was up and running.
6:47 AM Issue resolved. Microsoft was able to unwind the steps through PowerShell. We helped get the DNS issues addressed to confirm this would not be an issue in the future. Customer sends us a glowing thank you.
If you find yourself needing to upgrade Microsoft Exchange Server 2013, keep these steps in mind so you do not find yourself in the same position.
Before you install Microsoft Exchange Server 2013 or any cumulative updates on any servers in your organization, the domains and Active Directory must be prepared. To see the specific steps necessary, review the Prepare Active Directory and Domains on TechNet.
In general, if you have a multi-domain AD forest topology you can apply the Exchange 2010 SP3 Schema update in the same AD environment as the Schema Master, assuming of course that you have the appropriate access rights (Schema & Enterprise Admin rights.)
There is a great example of the Active Directory Schema upgrade procedure over on the TechNet blogs.
One last note worth mentioning is on the topic of disaster preparedness. One flaw we discovered in their Disaster Recovery plan was how they notified end users of an outage. Their solution: an email thread. This, of course, works great unless the system with the outage is email.
I am sure that is an issue that will be solved by the end of today, but you might check your own procedures so you can learn from their mistakes, not your own.