Upgrading a Cisco MDS 9500


We recently performed some upgrade our Cisco MDS 9509 and thought we’d share the steps with you. You’re welcome to hop on Cisco.com as well and grab the user guide, but if you’re running a 9500 with redundant Sup-2’s, this should be all you need to hop between SAN-OS 3.x versions and all the way up to NX-OS 5.x…

Review upgrade guide from Cisco.com
Download new system and kickstart software from Cisco.com
Upload new software via TFTP to MDS
Start SolarWinds TFTP Server on MDS-reachable system (make sure it isn’t firewalled!)
Configure TFTP Server to look in the local directory where you copied the software
Make sure the TFTP Server is started (Configure > Start)
Open a telnet (Putty) session to the MDS (or SSH, if you prefer/use that)
Upload system software: i.e. # copy tftp://192.168.1.10/m9500-sf2ek9-mz.4.2.7d.bin bootflash:
Upload kickstart software: i.e. # copy tftp://192.168.1.10/m9500-sf2ek9-kickstart-mz.4.2.7d.bin bootflash:
Assess impact of upgrade to MDS

show install all impact system bootflash:m9500-sf2ek9-mz.4.2.7d.bin kickstart bootflash:m9500-sf2ek9-kickstart-mz.4.2.7d.bin

If all “verifying” steps return “SUCCESS” and the modules listed show “non-disruptive”, proceed with installation
Install software (example below)

install all system bootflash:m9500-sf2ek9-mz.4.2.7d.bin kickstart bootflash:m9500-sf2ek9-kickstart-mz.4.2.7d.bin

Software upgrade will reboot the supervisor modules which will require reconnecting via Putty to monitor the remainder of the upgrade
After upgrade is complete:
Save the running config to startup: # copy run start
Backup the running config via TFTP: i.e. # copy run tftp://192.168.1.10/mds9509-config-20110127-4.2.7d.cfg

IPv6: Cisco IOS

Addressing. Routing. DHCP. EIGRP. HSRP. Mobility. After consuming Cisco’s 706-page IOS IPv6 Configuration Guide, these are just a few of the areas we’re processing as the deployment plan starts coming together. If you’re running something other than Cisco, some of the commands below, and of course EIGRP, may not directly apply, but perhaps you can abstract the concepts and use them in your own network.

Here’s a rundown of the IOS commands we’ll be utilizing as we begin to implement:

ipv6 address: (Interface) Apply to VLAN interfaces, routing interfaces, etc (i.e. vlan20, g1/10, g2/0/23)
ipv6 general-prefix: (Global) Specifies the prefix of your IPv6 address space (i.e. 2001:d8:91B5::/48)
ipv6 unicast-routing: (Global) Enables IPv6 routing on the switch/router
ip name-server: (Global) Not specific to IPv4 or v6, but necessary to add IPv6 name server addresses
ipv6 dhcp relay destination: (Interface) Configure on all interfaces that need DHCP relaying
ipv6 eigrp: (Interface) Unlike IPv4, EIGRP is interface-specific (no “network” statements); apply to routing interfaces
ipv6 router eigrp: (Global) Creates the EIGRP router process on the switch
ipv6 hello-interval eigrp: (Interface) Configured on interfaces using EIGRP to set the frequency of hello packets to adjacent routers
ipv6 hold-time eigrp: (Interface) Configured on interfaces using EIGRP to tell neighbors how long the sender is valid
Coming next: a consolidated IPv6 deployment plan, derived from NIST Guidelines for the Secure Deployment of IPv6…

EMC Avamar – Epic Fail.


Terrible initial implementation. High-downtime expansion. Unreliable backups. Absentee support. That’s EMC Avamar.

On the tiny upside, deduplication works great…when backups work.

In September 2011, our tragedy began. We’re a 99% VMware-virtualized shop and bought into EMC Avamar on the promise that its VMware readiness and design orientation would make for low-maintenance, high-reliability backups. In our minds, this was a sort of near-warm redundancy with backup sets that could restore mission critical systems to another site in <6 hours. Sales even pitched that we could take backups every four to six hours and thus reduce our RPO. Not to be.

Before continuing, I should qualify all that gloom and woe by saying that we have had a few stretches of uneventful reliability, but that’s only when we avoided changing everything. And one of those supposed times, a bug in the core functionality rendered critical backups unusable. But I digress…

It probably should have struck us as inconsistent to invest in so much hardware when our goal is to reduce it, but we believed this would step us out of our monolithic tape architecture and toward a greater HA ecosystem in the future. EMC has made much progress with its storage array usability, so maybe we thought that extended to Avamar. Perhaps someday it will.

On the personnel front, we started off on the wrong foot when the rack-and-stack crew didn’t know what a Gen4 grid was. That’s important, because EMC changed a lot of things between Gen3 and Gen4–networking, racking, etc. So we (the customer) had to actually help the outsourced (non-EMC) resources get them properly connected. EMC then tried a remote configuration of the Avamar grid software, but the engineer (overseas, we think) didn’t have any ideas on best practices, backup plans, etc. Now, you might say that such items were our responsibilities, but in Avamar, these things need to be aligned to deduplication policies, datasets, and daily windows of backup, blackout, and maintenance. So we appealed for a better resource and they gave us a local engineer who laid a better starting foundation.

Three weeks later, our grid were full. It just so happens that six months before we purchased Avamar, we had some talks about DataDomain and scoped storage there using certain figures. These same figures, which didn’t take into account full VMs but rather mere files, databases, and mailboxes, incredibly overwhelmed our small deployment. Time for another $150,000. You always keep that much spare change in your IT budget, right?

So we bit the bullet and added on three more storage nodes to each array. That’s when we learned that when you add storage nodes, the array has to be taken offline for one or more days while the data rebalances across the nodes. Not exactly what we had in mind. Our backups were already suspended due to space, and now they had to stay offline until the maintenance task completed. Oh, and the larger the grid, the longer the downtime when nodes are added.

That took us into November and December, I think. In December/January, we watched as the utilization in the nodes started spreading. It’s supposed to stay balanced (see previous paragraph and required downtime for this balancing act). After several support requests and insistence that this was not normal, EMC Support finally agreed and we had a heart-to-heart on these issues (February 2012). The consensus was that the entire grid needed to be re-initialized (read: wipe out and start over). Nothing like going back to square one…

The re-initialization fixed the balancing issue and we had relative peace for about a month and a half. Then, at the beginning of April, we had a situation where we needed to restore a critical VM. We tried, but it failed. We tried again in every way possible, but it still failed. So we opened a support request (which is always left for days in an “unassigned” state until you raise a ruckus) and were told that this was a bug that caused white space in VMs to get dropped, rendering the backups invalid–unrecoverable. Apparently this was a time bomb of sorts that deserved a red alert that was never sounded. It didn’t affect every VM (yet), but of course it did the one we needed. We were able to go back about five days and get a pre-broken backup (not ideal, but better than nothing), but had we not discovered this when we did, our mission critical infrastructure would have remained unrecoverable until who knows when.

Then we asked for the 6.1 software upgrade, which went surprisingly well. It was at this point (in June, maybe?) that we discovered our Avamar Data Switches (used by the grid backend for data balancing, etc) were never properly configured. Apparently that was part of the initial rack-and-stack’s job that they didn’t understand. It only took nine or so months to get that right.

All that covers the reliability and support issues, but doesn’t touch the usability (or lack thereof). Restoring a single VM takes roughly a dozen clicks through a slow, Java-based, archaic GUI. We planned to run monthly restores of our production environment to a DR array. You’d think that we could batch that process or grab several VMs and tell them all to restore. Nope. It’s a one-by-one process. We hoped that would change for the better between 6.0 and 6.1, but it didn’t. A few things were rearranged, but one-by-one remains.

At VMWORLD 2012, we heard that Avamar was being rolled into vSphere and our excitement rose as we dreamed of better features, integration, and management. Again, not to be. It’s Avamar Virtual Edition repackaged. Not scalable. No replication. Supposedly it teases you to get the full-blown thing. I hope no one fell for that.

For now, we’re hanging in there (what else can we do after throwing so much at it?). We have our little nightly backup windows and our daily blackout and maintenance windows (the grid has to have its alone time for at least 8 hours a day). And every time something changes, we wait for the failures to start. One day we will be free. But not yet. Not yet…

SQL max worker threads Problem When Using VSS To Backup Numerous Databases


In our ongoing (sort-of pilot) migration from VMware vSphere 5.5 to Microsoft Hyper-V 2012 R2, we encountered a very concerning and puzzling issue with backups. The transition had been smooth for the most part and we used the project to bring aging Windows/SQL 2008 servers up to 2012 R2 and 2014, respectively. Two of our SQL environments had moved over just fine and were being backed up successfully with Microsoft Data Protection Manager 2012 R2 for the time being (other products are being considered, including Veeam). The third of such SQL environments ran into a host of VSS errors once its data was populated and a backup attempted.

sqlvss_dpmfailed
DPM 2012 R2 – Job Failed

Background (before/after):

Hypervisor: vSphere 5.5 to Hyper-V 2012 R2
Guest OS: Windows Server 2008 to 2014
Backup product: EMC Avamar 7.0.1 to MS DPM 2012 R2
Backup method: Crash-consistent image to VSS-quiesced image

We had seen an occasional VSS-related backup failure from time to time in DPM, but most were tied to available disk space for the protection group (DPM doesn’t do so well with deduplication of images, so growing has been near-continual). Retrying didn’t make a difference this time, though. We restarted VSS writers and even took downtime to restart the VM. Still the same failure.

Digging further into the event logs, we accumulated these:

Hyper-V host: Event 10172, Hyper-V-VMMS
Hyper-V host: Event 10172, Hyper-V-VMMS

SQL guest: Event 8229, VSS
SQL guest: Event 8229, VSS

SQL guest: Event 18210, MSSQLSERVER
SQL guest: Event 18210, MSSQLSERVER

SQL guest: Event 24583, SQLWRITER
SQL guest: Event 24583, SQLWRITER

SQL guest: Event 1, SQLVDI
SQL guest: Event 1, SQLVDI

sqlvss_vssadminIn search for the answer, we came across MS KB 2615182, a Social TechNet thread about SQLVDI errors, and finally a blog about “volume shadow barfs with 0x80040e14 code”. The KB would have helped us narrow down the cause as SQL, but it’s a production server (so no stopping the SQL instance), and we already knew SQL was the issue thanks to “vssadmin list writers”. The TechNet thread seemed initially promising, but wasn’t exactly on track for us and also involved stopping SQL (to re-register SQLVDI.dll). Our third (more like 50th) find was the colorfully-named blog with one of our error codes.

We had most of his errors, but didn’t quite have the glaring message about “Cannot create worker thread” in our events (that we could find). Of course, it could be deep in there, but the SQL server in question happens to have 276 databases attached, so the errors are prolific.

Hopping over to MSDN, we looked up the automatically set value for max worker threads in SQL 2014 on a 2-CPU 64-bit server. That value is 512. We resorted more to speculation at this point, but given that 276 databases were attached, SQL background and agent processes also were running, and VSS would need to grab at least 276 threads of its own for the backup, we took a stab in the dark that we might be blowing that 512 cap. Plus, changing the value posed no real risk or service restarts (woot!).

sqlvss_sqlmaxBeing conservative as we try to be most of the time, we raised it to 1024 and clicked OK. Then we click “Resume Backup” on the critical event in the DPM Monitoring console. Finally, we rejoiced to see VSS snaps succeeding, SQL backups happening, and data transferring i