ShareBackup could keep data in the fast lane

ShareBackup could keep data in the fast lane
Rice University computer scientist Eugene Ng led the development of ShareBackup, a hardware and software solution to help data centers recover from failures without slowing applications. Credit: Jeff Fitlow/Rice University

Anyone who has ever cursed a computer network as it slowed to a crawl will appreciate the remedy offered by scientists at Rice University.

Rice computer scientist Eugene Ng and his team say their solution will keep data on the fast track when failures inevitably happen.

Ng introduced ShareBackup, a strategy that would allow shared backup switches in to take on network traffic within a fraction of a second after a software or hardware .

He will present a peer-reviewed paper on the work this week at the SIGCOMM 2018 conference in Budapest, Hungary. The paper is online and available for download.

Ng said the idea would solve a common annoyance among data professionals, scientists and everyone who relies on a network to deliver results day in and day out.

"A data network consists of servers and network switches," said Ng, a professor of computer science and electrical and computer engineering. "Switches move data packets to where they need to go. But things fail, especially in large-scale data centers with thousands of pieces of hardware."

The usual response to a failed switch is to shunt the flow of data to another line. "Generally, the network has multiple paths for connecting servers so, just like if there's a closure on the highway, we'd drive around it. This is a conventional, natural approach that makes a lot of sense: You reroute around the failure to get where you need to go."

But sometimes that other road is congested and everything slows down. "Data centers aren't the internet; they're not about people surfing websites," Ng said. "They're about supporting data-intensive applications like data mining or machine learning. And a lot of these applications have stringent performance deadlines, so blindly rerouting traffic could be the wrong thing to do in a data center."

Rather than the expensive option of installing redundant switches throughout a network, the Ng lab's strategy would put fast switches and software in strategic locations that could pick up the traffic from a failed switch in a microsecond. When that problem is resolved, the team's software makes the backup switch available to handle another failure.

The switch is fast enough—the failure-recovery time is 0.73 milliseconds, including latency from hardware and control systems—that most users would never know that part of the system had failed.

"The reality is that the fraction of devices that fail at any given time is very small, and most of these failures can be addressed by things like rebooting the device," Ng said. "Sometimes the software gets screwed up and a simple power cycle will bring it back. These failures may also not last long.

"These are the characteristics we're trying to exploit," he said. "Because of that, we can get away with having very few devices back up a large number of devices."

Ng said ShareBackup could save data centers time and money not only by maintaining full bandwidth but by also helping to analyze problems, including misconfigurations that commonly lead to network failure.

"Part of our work is to help data centers figure out what went wrong in the network," he said. "Once the backup is activated, you can take the failed out of the production and test it to identify which component caused the problem.

"Now, if we take two devices out and can't figure out which went bad, both need to be replaced," he said. "It's very likely only one of the devices is having the problem. Our software can diagnose these devices in a semiautomatic manner, and if one of the parts is good, it can be reinstated."

More information: Dingming Wu et al, Masking failures from application performance in data center networks with shareable backup, Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication - SIGCOMM '18 (2018). DOI: 10.1145/3230543.3230577

Provided by Rice University
Citation: ShareBackup could keep data in the fast lane (2018, August 16) retrieved 14 April 2024 from
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Data traffic system switches control to network administrators


Feedback to editors