EMC Avamar – Epic Fail.

Terrible initial implementation. High-downtime expansion. Unreliable backups. Absentee support. That’s EMC Avamar.

On the tiny upside, deduplication works great…when backups work.

In September 2011, our tragedy began. We’re a 99% VMware-virtualized shop and bought into EMC Avamar on the promise that its VMware readiness and design orientation would make for low-maintenance, high-reliability backups. In our minds, this was a sort of near-warm redundancy with backup sets that could restore mission critical systems to another site in <6 hours. Sales even pitched that we could take backups every four to six hours and thus reduce our RPO. Not to be.

Before continuing, I should qualify all that gloom and woe by saying that we have had a few stretches of uneventful reliability, but that’s only when we avoided changing everything. And one of those supposed times, a bug in the core functionality rendered critical backups unusable. But I digress…

It probably should have struck us as inconsistent to invest in so much hardware when our goal is to reduce it, but we believed this would step us out of our monolithic tape architecture and toward a greater HA ecosystem in the future. EMC has made much progress with its storage array usability, so maybe we thought that extended to Avamar. Perhaps someday it will.

On the personnel front, we started off on the wrong foot when the rack-and-stack crew didn’t know what a Gen4 grid was. That’s important, because EMC changed a lot of things between Gen3 and Gen4–networking, racking, etc. So we (the customer) had to actually help the outsourced (non-EMC) resources get them properly connected. EMC then tried a remote configuration of the Avamar grid software, but the engineer (overseas, we think) didn’t have any ideas on best practices, backup plans, etc. Now, you might say that such items were our responsibilities, but in Avamar, these things need to be aligned to deduplication policies, datasets, and daily windows of backup, blackout, and maintenance. So we appealed for a better resource and they gave us a local engineer who laid a better starting foundation.

Three weeks later, our grid were full. It just so happens that six months before we purchased Avamar, we had some talks about DataDomain and scoped storage there using certain figures. These same figures, which didn’t take into account full VMs but rather mere files, databases, and mailboxes, incredibly overwhelmed our small deployment. Time for another $150,000. You always keep that much spare change in your IT budget, right?

So we bit the bullet and added on three more storage nodes to each array. That’s when we learned that when you add storage nodes, the array has to be taken offline for one or more days while the data rebalances across the nodes. Not exactly what we had in mind. Our backups were already suspended due to space, and now they had to stay offline until the maintenance task completed. Oh, and the larger the grid, the longer the downtime when nodes are added.

That took us into November and December, I think. In December/January, we watched as the utilization in the nodes started spreading. It’s supposed to stay balanced (see previous paragraph and required downtime for this balancing act). After several support requests and insistence that this was not normal, EMC Support finally agreed and we had a heart-to-heart on these issues (February 2012). The consensus was that the entire grid needed to be re-initialized (read: wipe out and start over). Nothing like going back to square one…

The re-initialization fixed the balancing issue and we had relative peace for about a month and a half. Then, at the beginning of April, we had a situation where we needed to restore a critical VM. We tried, but it failed. We tried again in every way possible, but it still failed. So we opened a support request (which is always left for days in an “unassigned” state until you raise a ruckus) and were told that this was a bug that caused white space in VMs to get dropped, rendering the backups invalid–unrecoverable. Apparently this was a time bomb of sorts that deserved a red alert that was never sounded. It didn’t affect every VM (yet), but of course it did the one we needed. We were able to go back about five days and get a pre-broken backup (not ideal, but better than nothing), but had we not discovered this when we did, our mission critical infrastructure would have remained unrecoverable until who knows when.

Then we asked for the 6.1 software upgrade, which went surprisingly well. It was at this point (in June, maybe?) that we discovered our Avamar Data Switches (used by the grid backend for data balancing, etc) were never properly configured. Apparently that was part of the initial rack-and-stack’s job that they didn’t understand. It only took nine or so months to get that right.

All that covers the reliability and support issues, but doesn’t touch the usability (or lack thereof). Restoring a single VM takes roughly a dozen clicks through a slow, Java-based, archaic GUI. We planned to run monthly restores of our production environment to a DR array. You’d think that we could batch that process or grab several VMs and tell them all to restore. Nope. It’s a one-by-one process. We hoped that would change for the better between 6.0 and 6.1, but it didn’t. A few things were rearranged, but one-by-one remains.

At VMWORLD 2012, we heard that Avamar was being rolled into vSphere and our excitement rose as we dreamed of better features, integration, and management. Again, not to be. It’s Avamar Virtual Edition repackaged. Not scalable. No replication. Supposedly it teases you to get the full-blown thing. I hope no one fell for that.

For now, we’re hanging in there (what else can we do after throwing so much at it?). We have our little nightly backup windows and our daily blackout and maintenance windows (the grid has to have its alone time for at least 8 hours a day). And every time something changes, we wait for the failures to start. One day we will be free. But not yet. Not yet…

Leave a Reply Cancel reply