[OPLINTECH] OPLIN Outage Summary

Karl Jendretzky via OPLINTECH oplintech at lists.oplin.org
Wed Sep 20 09:54:34 EDT 2017


OPLIN would like to apologize for yesterday's extended internet outage that
affected so many of you. We are very aware of the importance of the online
services you provide to your communities and regret that those critical
services were interrupted for so long yesterday.

*Outage timeframe*:  2017/9/19 13:45 - 20:43, 6 hours 57 minutes

*Short answer*:

  Yesterday at 1:45pm the OPLIN core routers stopped routing traffic
between them, which cut the Spectrum serviced libraries off from the
network. The secondary core has been bypassed to restore connectivity while
we determine exactly what happened to destabilize the router pair.

*Long answer*:

   For the past eight weeks OPLIN has been operating two Juniper MX480
routers in a Virtual Chassis pair, instead of a single MX480 core router.
This change is part of a larger project for the OPLIN core as we add
redundancy, and also physically move the core into the two primary network
rooms of the SOCC. The first live Spectrum connections moved onto the new
router in NR2 (Network Room 2) about 4 weeks ago, utilizing a new 20Gb
Aggregate Ethernet handoff from the vendor. Two weeks after that the
remaining 150 Spectrum circuits were migrated onto the new trunk.

  Yesterday at 1:45pm the two OPLIN cores stopped routing traffic between
them, which cut the Spectrum circuits off from the rest of the network.
Attempts to reestablish communications between the two cores destabilized
the live core and disrupted traffic to the entire OPLIN network at ~2:40pm.
Rather than risk further disruption to the rest of the network, we decided
to bypass the secondary core and focus efforts on piping the Spectrum
circuits directly to the functioning core.

  Our new problem then became that the Spectrum handoff is multiple
rooms/floors away from the primary core, with the only direct path between
the two being the links for the Virtual Chassis, which was the wrong type
of fiber. We attempted cannibalize the links and to use a switch to convert
the media and pass the trunk through, but ran into troubles with
configuration that kept the Aggregate Ethernet interface from coming up
cleanly. In the end we resolved the issue by identifying a path of jumps
though fiber panels that allowed us to jumper up to the live core,
restoring connectivity for all Spectrum serviced locations.

*Moving forward*:

  Today I'll be sitting down with OIT and Juniper to determine exactly what
went wrong with the Virtual Chassis link yesterday. If the issue can be
isolated and corrected, then utilizing Virtual Chassis will make things
more reliable and easier to manage. If there's any remaining question as to
the reliability of the technology, then we'll simply fall back to a more
labor intensive but older redundancy technology.

Either way, I'm sure we're going to have some after hours maintenance work
to announce in the near future. :)

Sorry again for all the hassle, I know how quickly bad days for us turn
into bad days for you.

Karl Jendretzky
IT Manager - Ohio Public Library Information Network
(614) 728-5252karl at oplin.ohio.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.oplin.org/pipermail/oplintech/attachments/20170920/7e62922a/attachment-0001.html>


More information about the OPLINTECH mailing list