Skip to content

Commit 616de7d

Browse files
authored
Merge pull request #19 from agember/talk-to-an-operator
Talk to a network operator today
2 parents 9d6393c + 5003384 commit 616de7d

7 files changed

+187
-0
lines changed

_config.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,18 @@ authors:
8181
avatar: /assets/images/tim-thijm.jpg
8282
bio: "Tim is a PhD candidate at Princeton University, where he is currently advised by Prof. Aarti Gupta. Tim completed his BSc at the University of Toronto in 2018 in Computer Science and English."
8383

84+
akella:
85+
name: Aditya Akella
86+
site: http://pages.cs.wisc.edu/~akella/
87+
avatar: /assets/images/aditya-akella.jpg
88+
bio: "Aditya Akella is a Professor of Computer Sciences at the University of Wisconsin-Madison, where he leads the Wisconsin Internet and Systems Research (WISR) Lab, and a Visiting Scientist at Google. Aditya has also been a Visiting Researcher at Microsoft (2014), a Visiting Associate Professor at the University of Washington (2013), and a Postdoc at Stanford University (2005-2006). Aditya has received numerous awards, including a Finalist for the Blavatnik National Award for Young Scientists (2020), an H.I. Romnes Faculty Fellowship (2018), SACM Student Choice Professor of the Year (COW) Award (2017, 2019), the Internet Engineering Task Force (IETF) Applied Networking Research Prize (2015), and the ACM SIGCOMM Rising Star Award (2014). Aditya received a Ph.D. in Computer Science from Carnegie Mellon University (2005) and a B.Tech. in Computer Science from Indian Insitute of Technology, Madras (2000)."
89+
90+
gemberjacobson:
91+
name: Aaron Gember-Jacobson
92+
site: https://aaron.gember-jacobson.com
93+
avatar: /assets/images/aaron-gember-jacobson.jpg
94+
bio: "Aaron Gember-Jacobson is an Assistant Professor of Computer Science at Colgate University. Aaron's research focuses on network configuration verification and synthesis. Aaron enjoys teaching Introduction to Computing, Operating Systems, Computer Networks, and a First Year Seminar entitled 'The Unreliable Internet.' Aaron received a Ph.D. and Master of Science in Computer Science from the University of Wisconsin-Madison and a Bachelor of Science in Computer Science from Marquette University. During his Ph.D., Aaron was awarded the Internet Engineering Task force (IETF) Applied Networking Research Prize (2015) and an IBM Ph.D. Fellowship."
95+
8496
# Defaults
8597
defaults:
8698

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: "Aaron Gember-Jacobson"
3+
layout: default
4+
permalink: "/author-aaron-gember-jacobson.html"
5+
---
6+
<div class="container">
7+
<div class="row justify-content-center">
8+
<div class="col-md-8">
9+
<div class="row align-items-center mb-5">
10+
<div class="col-md-9">
11+
<h2 class="font-weight-bold">{{page.title}} <span class="small btn btn-outline-success btn-sm btn-round"><a href="{{ site.authors.gemberjacobson.site }}">View</a></span></h2>
12+
<p class="excerpt">{{ site.authors.gemberjacobson.bio }}</p>
13+
</div>
14+
<div class="col-md-3 text-right">
15+
<img alt="{{ site.authors.gemberjacobson.name }}" src="{{site.url}}/{{ site.authors.gemberjacobson.avatar }}" class="rounded-circle" height="100" width="100">
16+
</div>
17+
</div>
18+
<h4 class="font-weight-bold spanborder"><span>Posts by {{page.title}}</span></h4>
19+
{% for post in site.posts %}
20+
{% if post.authors contains "gemberjacobson" %}
21+
{% include main-loop-card.html %}
22+
{% endif %}
23+
{% endfor %}
24+
</div>
25+
</div>
26+
</div>

_pages/author-aditya-akella.html

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: "Aditya Akella"
3+
layout: default
4+
permalink: "/author-aditya-akella.html"
5+
---
6+
<div class="container">
7+
<div class="row justify-content-center">
8+
<div class="col-md-8">
9+
<div class="row align-items-center mb-5">
10+
<div class="col-md-9">
11+
<h2 class="font-weight-bold">{{page.title}} <span class="small btn btn-outline-success btn-sm btn-round"><a href="{{ site.authors.akella.site }}">View</a></span></h2>
12+
<p class="excerpt">{{ site.authors.akella.bio }}</p>
13+
</div>
14+
<div class="col-md-3 text-right">
15+
<img alt="{{ site.authors.akella.name }}" src="{{site.url}}/{{ site.authors.akella.avatar }}" class="rounded-circle" height="100" width="100">
16+
</div>
17+
</div>
18+
<h4 class="font-weight-bold spanborder"><span>Posts by {{page.title}}</span></h4>
19+
{% for post in site.posts %}
20+
{% if post.authors contains "akella" %}
21+
{% include main-loop-card.html %}
22+
{% endif %}
23+
{% endfor %}
24+
</div>
25+
</div>
26+
</div>
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
layout: post
3+
title: "Talk to a network operator today!"
4+
authors: [akella, gemberjacobson]
5+
categories: [overview, research, network, verification, operators]
6+
image: assets/images/management-practices-survey.png
7+
tags: []
8+
---
9+
10+
Recently, we have seen significant advances in tools that bring formal methods to networking. These tools verify if a given network satisfies important properties, automatically repair broken networks, and synthesize networks that provably satisfy properties. The hope is that these tools will help turn the practice of networking from a complex, potentially error-prone process to a principled discipline that is rooted in sound science and engineering, resulting in networks that are less bug-prone and, ideally, bug-free.
11+
12+
While we have made big intellectual strides in developing these tools, it still feels like the beginning of a long journey. In our view, the research community as a whole has a ways to go before the tools are useful and widely applicable. And a major reason is the lack of close interaction with the network operator community.
13+
14+
This article is a call to arms to those researching in this space to have a
15+
"network operator-buddy", actually, several operator-buddies. Talk to these
16+
buddies about their needs and pain-points; talk to them about obtaining access
17+
to operational data, and let the data guide your research as much as possible;
18+
talk to them about your ideas, and get them to use your tools! Our experience
19+
has been that this opens up not only interesting new research problems but
20+
also a direct path to real-world impact. Such conversations can be especially
21+
valuable for newcomers to this space.
22+
23+
Below we outline some highlights from our own exploration of this research topic, and how it was, and continues to be, largely informed by feedback and help from network operators and by insights derived from their invaluable data.
24+
25+
## How data, and conversations with operators, opened the door for us
26+
27+
Early work in this space focussed on network dataplane verification; notable
28+
examples include [Anteater](http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p290.pdf) and [HSA](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final8.pdf). These works aimed to validate if the current forwarding state of the network satisfies important properties, such as ensuring a given pair of hosts are blocked, and that there are no routing blockholes or forwarding loops.
29+
30+
Around roughly that time, we were engaged in [an empirical project on
31+
understanding operational management
32+
planes](https://conferences2.sigcomm.org/imc/2015/papers/p395.pdf) -- how
33+
real-world networks are designed, and operated. To this end, we collected
34+
operational data (longitudinal network configurations, trouble tickets,
35+
topologies) from various networks around the US, including cloud data centers,
36+
enterprise networks, and university campus networks. (Data is also openly
37+
available from various research and education networks, including
38+
[Internet2](https://noc.net.internet2.edu/i2network/live-network-status.html)
39+
and [Purdue University](https://engineering.purdue.edu/~isl/network-config/).)
40+
A key component of this study was the frequency of configuration changes. Our
41+
study shed light on a surprisingly high frequency of change events, where each
42+
change event includes updates to configurations of multiple network devices
43+
close to each other in time. We found that change events can be of various
44+
scopes, some spanning an entire network, and others touch just a couple of
45+
devices.
46+
47+
Crucially, we found critical empirical evidence that networks with a greater degree of change -- either frequent changes, or changes that are big in scope, or both -- were more susceptible to failures (as indicated by higher incidence of trouble tickets). This in particular hinted at a potentially important use case that the state-of-the-art dataplane verification tools may be missing, namely, proactive verification. Dataplane verification tools verify just the current snapshot of the network's forwarding state. Given the frequency of changes, and the relationship between changes and outages, we felt that network operators may also be interested in whether the modifications might result in violations of properties.
48+
49+
Our informal conversation with operators from [US Big-10](https://bigten.org) schools (of which [UW-Madison](https://wisc.edu) is one) confirmed this for us: operators indicate that they would much rather analyze configuration changes before deployment, rather than identify bugs when they manifest after deployment, by which time it is often already too late! About the same time, a group of researchers presented [Batfish](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-fogel.pdf) -- the first control plane verifier to ``derive the actual data plane that would emerge given a configuration and environment.'' Using Batfish, the researchers uncovered a variety of bugs in two real university networks.
50+
51+
Our own and others' analysis of operational data and discussions with operators thus opened the door to our work on
52+
[ARC](https://aaron.gember-jacobson.com/docs/gember-jacobson2016arc.pdf) and
53+
subsequently, [Tiramisu](https://www.usenix.org/system/files/nsdi20-paper-abhashkumar.pdf), tools for proactive verification of network control planes.
54+
55+
## More conversations helped us hone in on our approach
56+
57+
Early on when developing a framework for proactive control plane verification,
58+
our focus was largely on "one-shot" verification of properties, such as verifying if a given control plane configuration allows two hosts or subnets to reach each other, or that the configuration results in no routing blackholes. However, UW-Madison operators anecdotally mentioned several incidents where their network's configurations appeared safe (i.e., passed various simplistic tests) prior to rollout, but the network nevertheless experienced major service outages under unexpected but routine events such as a link or a switch failure. These anecdotes highlighted for us the importance of analyzing a control plane's ability to satisfy key properties under various possible failure scenarios, i.e., being able to answer questions such as: does this configuration change continue to ensure reachability between a given pair of subnets even when an arbitrary set of k links fails unexpectedly?
59+
60+
This observation led to a major change in our approach to verification: we
61+
were initially exploring a simulation based approach (similar to [Batfish](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-fogel.pdf)) where we would execute a control plane's processing to determine what paths it might end up generating, but such an approach becomes quickly untenable when exploring the large number of failure scenarios -- running a simulation (which is already slow) per failure scenario is prohibitively expensive. We then pivoted to develop a control plane model that is amenable to rapid simultaneous exploration of many failure scenarios. The result was [ARC](https://aaron.gember-jacobson.com/docs/gember-jacobson2016arc.pdf), a first-of-a-kind graph based control plane encoding. ARC allows mapping the problem of proactively verifying if a control plane satisfies a given property under failures (e.g., "k-reachability" as exemplified above) to running appropriate fast, polynomial time graph algorithms (e.g., determining if the graph min-cut > k).
62+
63+
In trying to enable fast analysis under failures, ARC makes some assumptions
64+
about the control plane's design. For example, it assumes that routing
65+
protocols such as BGP are used in specific ways (such as, no community
66+
attributes, and no iBGP), and layer-2 (e.g., VLAN) configurations are correct.
67+
Talking to operators from [NYSERNET](https://www.nysernet.org),
68+
[ESnet](http://es.net), UW-Madison, and [Colgate
69+
University](https://www.colgate.edu), showed us that these assumptions are
70+
quite limiting in practice and hinder ARC's applicability. Analyzing the
71+
configuration data from some of these networks further showed the broad extent
72+
to which the protocol features we "assumed away" were in use. This analysis
73+
led us to [Tiramisu](https://www.usenix.org/system/files/nsdi20-paper-abhashkumar.pdf), our most recent verification tool; Tiramisu's control
74+
plane embedding builds off of ARC's simple graph model. Notably, Tiramisu uses
75+
a multi-layer graph to encode inter-protocol dependencies (e.g., dependencies
76+
between BGP and IGPs, and routing protocols dependence on layer-2
77+
technologies like VLANs), and multi-dimensional edge attributes to
78+
encode the various metrics routing protocols use in their decision-making.
79+
(Stay tuned for a future article where we hope to cover Tiramisu's technical
80+
underpinnings in more detail!)
81+
82+
## Identifying new directions
83+
84+
Operators continue to offer us valuable insights on our verification tools
85+
that have steered our work in an interesting new direction, and helped us
86+
identify which problems are the most important to work on.
87+
88+
Notably, at a presentation on ARC and Tiramisu at a large online service
89+
provider, a network operator asked us if the graph-based control plane
90+
abstractions underlying these tools can be used to validate if a given network
91+
is susceptible to overload: more precisely, does there exist a combination of
92+
a group of links failing and a network traffic matrix that causes the
93+
network's control plane to select paths that may overload certain links? This
94+
is an example of a "quantitative verification question" that existing
95+
proactive verification tools cannot answer because of their exclusive focus on
96+
qualitative path properties (such as path existence, number of paths, etc).
97+
Interestingly, the same question arose in a discussion we had with operators at UW-Madison. This led to our work on [QARC](http://wisr.cs.wisc.edu/papers/pldi20qarc.pdf), a tool for exhaustively checking networks for potential susceptibilities to overload. We hope to talk about QARC at length in a future article.
98+
99+
Our work on control plane
100+
[repair](https://aaron.gember-jacobson.com/docs/gember-jacobson2017cpr.pdf)
101+
and [synthesis](http://wisr.cs.wisc.edu/papers/sigmetrics18zeppelin.pdf) -- yet another topic we hope to explore in a future article -- were also both heavily influenced by surveys of network operators, analysis of operational data, and, of course, conversations and interviews with operators.
102+
103+
## How to make operator-buddies?
104+
105+
As our experience illustrates, operator-buddies can be an invaluable resource. They can help understand what problems to focus on, identify how to refine candidate approaches, and also suggest new directions altogether. But how does one (especially someone in an academic setting) go about making operator-buddies?
106+
107+
Luckily in our experience there are plenty of avenues for building relationships with network operators.
108+
109+
First, consider setting up a coffee meeting or two with your campus operators.
110+
You will be surprised at how readily willing they are to meet, talk about
111+
their operational practices and pain points, and also share data! (If they are
112+
hesitant to share raw data with you, you can ask if they could share
113+
configurations that have been anonymized using tools like [Netconan](https://github.com/intentionet/netconan).) Once you have built a sufficiently close relationship, your campus operators may also be willing to introduce you to their friends from other regional schools or networks who themselves may contribute more discussion or even data.
114+
115+
Also, consider spending time in the industry, e.g., as part of an year-long sabbatical or summer internship, and work closely with network operators there. You will likely obtain access to data while employed there and you may lose access when you leave, but the insights you gather will stay with you and inform your work for years to come.
116+
117+
Finally, consider attending network operator events such as
118+
[NANOG](https://nanog.org) and [CHI-NOG](http://chinog.org), or conferences
119+
co-located with events such as the [Open Networking
120+
Summit](https://events.linuxfoundation.org/open-networking-edge-summit-north-america/)
121+
and [IETF meetings](https://www.ietf.org), and setting up conversations with network operators there. These events attract people running a very broad variety of networks, campuses, enterprises, ISPs of various sizes, and cloud operators, giving you perhaps the broadest exposure to your ideas and the broadest possible net for your to rope in operator collaboration. Being a passive observer of the discussions within these groups can also be insightful, so sign-up for their mailing lists.
122+
123+
Or, drop us a note! We would be happy to put you in touch with our local operator-buddies and you can take it from there!
420 KB
Loading

assets/images/aditya-akella.jpg

211 KB
Loading
73.4 KB
Loading

0 commit comments

Comments
 (0)