March 4, 2015

Box ‘s “Troubleshooting at Scale” event

This past Monday I left my HFOSS class early to go to a Box event on campus. It was listed on RIT’s job zone page as a “networking” event, but as I found out when I got there it was actually a presentation on troubleshooting at scale.

Although it wasn’t really my part of the stack and I don’t have a lot of expertise, it was very interesting to see the kind of issues that a cloud based company faces at that kind of scale, and the processes they have to fix them. Box uses analytics tools like Splunk and test automation tools like ThousandEyes. I had actually heard of Splunk at Hack Upstate in Syracuse earlier this year when I worked with some of the engineers from the company on an Arduino project.

Box has “on-call” engineers, who function as troubleshooters whenever an issue comes up even if it’s a 3am. While that doesn’t quite sound like an appealing job, I had never thought about having to deal with system issues where a few minutes of downtime directly translates into money lost. The internet can be a crazy place, and even if the loss of traffic isn’t Box’s fault (the example was an ISP’s failure), Box is still responsible for isolating the issue on their end and following up with whatever is hurting their traffic (Comcast, Verizon, etc.).

So while it may not be directly applicable to my current stack environment, Box’s presentation still gave me a valuable perspective and I enjoyed the talk.