Skip to content

Commit 21dfef8

Browse files
committed
Draft: Multipath Will Fix This
This doesn't really explain the title, probaly should be a different title. I'm also not really sure who the audience is. It's written from a very high level, hinting at the mechanisms involved without going into technical details. It's also a first draft. But still, please comment.
1 parent 889a027 commit 21dfef8

File tree

1 file changed

+316
-0
lines changed
  • src/app/blog/multipath-will-fix-this

1 file changed

+316
-0
lines changed
Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
import { BlogPostLayout } from '@/components/BlogPostLayout'
2+
import {ThemeImage} from '@/components/ThemeImage'
3+
4+
export const post = {
5+
draft: false,
6+
author: 'Floris Bruynooghe',
7+
date: '2025-07-29',
8+
title: 'Multipath Will Fix This',
9+
description:
10+
"A lightweight rpc crate for iroh protocols",
11+
}
12+
13+
export const metadata = {
14+
title: post.title,
15+
description: post.description,
16+
openGraph: {
17+
title: post.title,
18+
description: post.description,
19+
images: [{
20+
url: `/api/og?title=Blog&subtitle=${post.title}`,
21+
width: 1200,
22+
height: 630,
23+
alt: post.title,
24+
type: 'image/png',
25+
}],
26+
type: 'article'
27+
}
28+
}
29+
30+
export default (props) => <BlogPostLayout article={post} {...props} />
31+
32+
33+
Iroh is a library to establish direct peer-to-peer QUIC connections.
34+
This mean iroh does NAT traversal, colloquially known as holepunching.
35+
36+
The basic idea is that two endpoints, both behind a NAT, establish a
37+
connection via a relay server. Once the connection is established
38+
they can do two things:
39+
40+
- Exchange QUIC datagrams via the relay connection.
41+
- Coordinate holepunching to establish a direct connection.
42+
43+
And once you have holepunched, you can move the QUIC datagrams to the
44+
direct connection and stop relying on the relay server. Simple.
45+
46+
Note:
47+
48+
This post is generally going to simplify the world a lot. Of course
49+
there are many more network situations other than two endpoints both
50+
connected to the internet via a NAT router. And iroh has to work
51+
with all of them. But you would get bored reading this and I would
52+
get lost writing it. So I'm keeping this narrative simple.
53+
54+
55+
# Relay Servers
56+
57+
An iroh relay server is a classical piece of server software, running
58+
in a datacenter. It exists even thouh we want p2p connections,
59+
because in todays internet we can not have direct connections without
60+
holepunching. And you can not have holepunching without being able to
61+
coordinate. Thus the relay server.
62+
63+
Because we would like this relay server to essentialy *always* work,
64+
it uses the most common protocol on the internet: HTTP1.1 inside a TLS
65+
stream. Endpoints establish an entirely normal HTTPS connection to
66+
the relay server and then upgrades it to a WebSocket connection
67+
[^websocket]. This works even in many places where the TLS connection
68+
is Machine-In-The-Middled by inserting new "trusted" root certs
69+
because of "security". As long as an endpoint keeps this WebSocket
70+
connection open it can use the relay server.
71+
72+
[^websocket]: What's that? You're still using iroh < 0.91? Ok fine,
73+
maybe your relay server still uses a custom upgrade protocol
74+
instead of WebSockets.
75+
76+
The relay server itself is the simplest thing we can get away with.
77+
It forwards UDP datagrams from one endpoint to another. Since iroh
78+
endpoints are identified by a [`NodeId`] it means you send it a
79+
destination `NodeId` together with a datagram. The relay server might
80+
now either:
81+
82+
- Drop the datagram on the floor, because the destination endpoint is
83+
not connected to this relay server.
84+
85+
- Forward the datagram to the destination.
86+
87+
The relay server does not need to know what is in the datagram. In
88+
fact iroh makes sure it **does not** know what is inside: the payload
89+
is always encrypted to the destination endpoint [^1].
90+
91+
[^1]: Almost, QUIC's handshake has to establish a TLS connection.
92+
This means it has to send the TLS `ClientHello` message in clear
93+
text like any other TLS connection on the internet.
94+
95+
96+
# Holpunching
97+
98+
UDP holepunching is simple really [^simplehp]. All you need is for each
99+
endpoint to send a UDP datagram to the other at the same time. The
100+
NAT routers will think the incoming datagrams are a response to the
101+
outgoing ones and treat it as a connection. Now you have a
102+
holepunched, direct connection.
103+
104+
[^simplehp]: Of course it isn't. But as already said, the word count
105+
of this post is finite.
106+
107+
To do this an endpoint needs to:
108+
109+
- Know which IP addresses it might be reachable on. Some time we'll
110+
write this up in it's own blog post, for now I'll just assume then
111+
endpoints know.
112+
113+
- Send these IP address candidates to the remote endpoint via the
114+
relay server.
115+
116+
- Once both endpoints have the peer's candidate addresses, send "ping"
117+
datagrams to each candidate address of the peer. Both at the same
118+
time.
119+
120+
- If a "ping" datagram is received, respond with "yay, we
121+
holepunched!". Typically this will be only on 1 IP path out of all
122+
the candidates. Or maybe more and more these days it'll succeed for
123+
both an IPv4 and and IPv6 path.
124+
125+
If you followed arefully you'll have counted 3 special messages that
126+
need to be sent to the peer endpoint:
127+
128+
1. IP address candidates. These are sent via the relay server.
129+
130+
2. Pings. These are sent on the non-relayed IP paths.
131+
132+
3. Pongs. These are also sent on the non-relayed IP paths.
133+
134+
They need to be sent as UDP datagrams. Over the same *paths* as the
135+
QUIC datagrams are also being sent: the **relay path** and any
136+
**direct paths**.
137+
138+
139+
# Multiplexing UDP datagrams
140+
141+
Iroh stands on the shoulders of giants, and it looked carefully at
142+
ZeroTier and Tailscale. In particular it borrowed a lot from the DERP
143+
design from Tailscale. From the above holepunching description we get
144+
two kinds of packets:
145+
146+
- Application payload. For iroh these are QUIC datagrams.
147+
- Holepunching datagrams.
148+
149+
When an iroh endpoint receives a packet it needs to first figure out
150+
which kind of packet this is: a QUIC datagram, or a DERP datagram? If
151+
it is a QUIC packet it is passed onto the QUIC stack [^quicstack]. If
152+
it is a DERP datagram it needs to be handed by iroh itself, by a
153+
component we call the *magic socket*. This is done using the "QUIC
154+
bit", a bit in the UDP datagram defined as always set to 1 in QUIC
155+
version 1 [^greasing].
156+
157+
158+
[^quicstack]: iroh uses Quinn for the QUIC stack, an excellet project.
159+
160+
[^greasing]: Since QUIC has released RFC 9287 which advocateds
161+
"greasing" this bit: effectively toggeling it randomly. This is
162+
an attempt to stop middleboxes from ossifying the protocol by
163+
starting to recognise this bit. Iroh not being able to grease
164+
this bit right now is not ideal either.
165+
166+
167+
# IP Congestion Control
168+
169+
This system works great and is what powers iroh today. However it
170+
also has its limitations. One interesting aspect of the internet is
171+
*congestion control*. Basically IP packets get send around the
172+
internet from router to router, and each hop has it's own speed and
173+
capacity. If you send too many packets the pipes will clog up and
174+
start to slow down. If you send yet more packets routers will start
175+
dropping them.
176+
177+
Congestion control is tasked with threading the fine line of sending
178+
as many packets as fast as possible between two endpoints, without
179+
adversally affecting the latency and packet loss. This is difficult
180+
because there are many independent endpoints using all those links
181+
between routers at the same time. But it also has had a few decades
182+
of research by now, so we achieve reasonably decent results by now.
183+
184+
Each TCP connection has its own congestion controllers, one per
185+
endpoint. As the same goes for each QUIC connection. Unfortunately
186+
our holepunching packets live outside of the QUIC connection so do
187+
not. What is worse: when holepunching succeeds and iroh endpoint will
188+
route the QUIC datagrams via a different path then before: they will
189+
stop flowing over the relay connection and start using the direct
190+
path. This is not great for the congestion controller, so iroh
191+
effectively tells it to restart.
192+
193+
194+
# Multiple Paths
195+
196+
By now I've talked several times about a "relay path" and a "direct
197+
path". A typical iroh connection has probably quite a few possible
198+
paths available between the two endpoints. A typical set would be:
199+
200+
- The path via the relay server [^relaypath].
201+
- An IPv4 path over the WiFi interface.
202+
- An IPv6 path over the WiFi interface.
203+
- An IPv4 path over the mobile data interface.
204+
- An IPv6 path over the mobile data interface.
205+
206+
[^relaypath]: While this is currently a single relay path, you can
207+
easily imagine how you could expand this to a number of relay
208+
server paths. Patience. The future.
209+
210+
The entire point of the relay path is to be able to start
211+
communicating without needing holepunching. So that path just works.
212+
But generally you'd expect the bottom 4 paths to need holepunching.
213+
And currently iroh chooses the one with the lowest latency after
214+
holepunching. But what if iroh was just aware of all those paths all
215+
the time?
216+
217+
218+
# QUIC Multipath
219+
220+
Let's forget holepunching for a minute, and assume we can establish
221+
all those paths without any firewall getting in the way. Would it not
222+
be great if our QUIC stack was aware of these multiple paths? For
223+
example, it could keep a congestion controller for each path
224+
separately. Each path would also have its own Round Trip Time (RTT).
225+
So you can make an educated guess at which path you'd like to send new
226+
packets without them being blocked, dropped or slowed down
227+
[^mpcongestion].
228+
229+
This is exactly what the [QUIC-MULTIPATH] IETF draft has been figuring
230+
out: allow QUIC endpoints to use multiple paths at the same time. And
231+
we totally want to use this in iroh. We can have a world where we
232+
have several possible paths, select one as primary and others as
233+
backup paths and seamlessly transition between them as your endpoint
234+
moves through the network and paths appear and disappear [^irohmove].
235+
236+
There are **a lot** of details about QUIC-MULTIPATH on how to make it
237+
work. And adding this functionality to Quinn has been a major
238+
undertaking. But the branch is becoming functional at last.
239+
240+
[^mpcongestion]: But hey! Some of these paths share at least the
241+
first and last hop. So they are not independent! Indeed, they
242+
are not. Congestion controllers is still a research area,
243+
especially for multiple paths with shared bottlenecks. Though You
244+
should note that this already happens a lot on the internet, your
245+
laptop or phone probably has many TCP and/or QUIC connections to
246+
several servers right now. And these definitely share hops. Yet
247+
the congestion controllers do somehow figure out how make this
248+
work, at least to some degree.
249+
250+
[^irohmove]: Wait, doesn't iroh already say it can do this? Indeed,
251+
indeed. Though if you've tried this you'd have noticed your
252+
application did experience some hicckups for a few seconds as iroh
253+
was figuring out where traffic needs to go. In theory we can do
254+
better with multipath, though it'll take some tweaking and
255+
turning.
256+
257+
[QUIC-MULTIPATH]: https://datatracker.ietf.org/doc/draft-ietf-quic-multipath/
258+
259+
# Mutltipath Holepunching
260+
261+
If you've paid attention you'll have noticed that so far this still
262+
doesn't solve some of our issues: the holepunching datagrams still
263+
live outside of the QUIC stack. This means we send them at whatever
264+
time, not paying attention to the congestion controller. That's fine
265+
under light load, but under heavy load often results in lost packets.
266+
That in turns leads to having to re-try sending those. But preferably
267+
without accidentally DOSing an innocent UDP socket just quietly
268+
hanging out on the internet, accidentally using an IP address that you
269+
thought might belong to the remote endpoint.
270+
271+
So the next step we would like to take with the iroh multipath project
272+
is to move holepunching logic itself into QUIC. We're also not the
273+
first to consider this: Marten Seemann and Christian Huitema have been
274+
thinking about this as well and wrote down [some thoughts in a blog
275+
post]. More importantly they started [QUIC-NAT-TRAVERSAL] draft which
276+
conceptually does a simple thing: move the holepunching packets *into*
277+
QUIC packets.
278+
279+
[some thoughts in a blog post]: https://seemann.io/posts/2024-10-26---p2p-quic/
280+
[QUIC-NAT-TRAVERSAL]: https://datatracker.ietf.org/doc/draft-seemann-quic-nat-traversal/
281+
282+
While QUIC-NAT-TRAVERSAL is highly experimental and we don't expect to
283+
follow it exactly as of the time of writing, this does have a number
284+
of benefits:
285+
286+
- The QUIC packets are already encrypted, we no longer need to manage
287+
our own encryption layer separately.
288+
289+
- QUIC already has very advanced packet acknowledgement and loss
290+
recovery mechanisms. Including the congestion control mechanisms.
291+
Essentially QUIC is a reliable transport, which this gets to benefit
292+
from.
293+
294+
- QUIC already has robust protection against sending too much data to
295+
unspececting hosts on the internet.
296+
297+
- In combination with QUIC-MULTIPATH we get a very robust and flexible ys
298+
299+
Another consideration is that QUIC is already extensible. Notice that
300+
both QUIC-MULTIPATH and QUIC-NAT-TRAVERSAL are negotiated at
301+
connection setup. This is a robust mechanism that allows us to be
302+
confident that in the future we'll be able to improve on these
303+
mechanisms.
304+
305+
306+
# Work In Progress
307+
308+
This would all change the iroh wire-protocol. This is part of the
309+
reason we want this done before our 1.0 release: once we release this
310+
we promise to keep our wire-protocol the same. Right now we're hard
311+
at work building the pieces needed for this all. And sometime
312+
soon-ish they will start landing in the 0.9x releases.
313+
314+
We aim for iroh to become even more reliable for folks who push the
315+
limits, thanks to moving all the holepunching logic right into the
316+
QUIC stack.

0 commit comments

Comments
 (0)