|
| 1 | +import { BlogPostLayout } from '@/components/BlogPostLayout' |
| 2 | +import {ThemeImage} from '@/components/ThemeImage' |
| 3 | + |
| 4 | +export const post = { |
| 5 | + draft: false, |
| 6 | + author: 'Floris Bruynooghe', |
| 7 | + date: '2025-07-29', |
| 8 | + title: 'Multipath Will Fix This', |
| 9 | + description: |
| 10 | + "A lightweight rpc crate for iroh protocols", |
| 11 | +} |
| 12 | + |
| 13 | +export const metadata = { |
| 14 | + title: post.title, |
| 15 | + description: post.description, |
| 16 | + openGraph: { |
| 17 | + title: post.title, |
| 18 | + description: post.description, |
| 19 | + images: [{ |
| 20 | + url: `/api/og?title=Blog&subtitle=${post.title}`, |
| 21 | + width: 1200, |
| 22 | + height: 630, |
| 23 | + alt: post.title, |
| 24 | + type: 'image/png', |
| 25 | + }], |
| 26 | + type: 'article' |
| 27 | + } |
| 28 | +} |
| 29 | + |
| 30 | +export default (props) => <BlogPostLayout article={post} {...props} /> |
| 31 | + |
| 32 | + |
| 33 | +Iroh is a library to establish direct peer-to-peer QUIC connections. |
| 34 | +This mean iroh does NAT traversal, colloquially known as holepunching. |
| 35 | + |
| 36 | +The basic idea is that two endpoints, both behind a NAT, establish a |
| 37 | +connection via a relay server. Once the connection is established |
| 38 | +they can do two things: |
| 39 | + |
| 40 | +- Exchange QUIC datagrams via the relay connection. |
| 41 | +- Coordinate holepunching to establish a direct connection. |
| 42 | + |
| 43 | +And once you have holepunched, you can move the QUIC datagrams to the |
| 44 | +direct connection and stop relying on the relay server. Simple. |
| 45 | + |
| 46 | +Note: |
| 47 | + |
| 48 | + This post is generally going to simplify the world a lot. Of course |
| 49 | + there are many more network situations other than two endpoints both |
| 50 | + connected to the internet via a NAT router. And iroh has to work |
| 51 | + with all of them. But you would get bored reading this and I would |
| 52 | + get lost writing it. So I'm keeping this narrative simple. |
| 53 | + |
| 54 | + |
| 55 | +# Relay Servers |
| 56 | + |
| 57 | +An iroh relay server is a classical piece of server software, running |
| 58 | +in a datacenter. It exists even thouh we want p2p connections, |
| 59 | +because in todays internet we can not have direct connections without |
| 60 | +holepunching. And you can not have holepunching without being able to |
| 61 | +coordinate. Thus the relay server. |
| 62 | + |
| 63 | +Because we would like this relay server to essentialy *always* work, |
| 64 | +it uses the most common protocol on the internet: HTTP1.1 inside a TLS |
| 65 | +stream. Endpoints establish an entirely normal HTTPS connection to |
| 66 | +the relay server and then upgrades it to a WebSocket connection |
| 67 | +[^websocket]. This works even in many places where the TLS connection |
| 68 | +is Machine-In-The-Middled by inserting new "trusted" root certs |
| 69 | +because of "security". As long as an endpoint keeps this WebSocket |
| 70 | +connection open it can use the relay server. |
| 71 | + |
| 72 | +[^websocket]: What's that? You're still using iroh < 0.91? Ok fine, |
| 73 | + maybe your relay server still uses a custom upgrade protocol |
| 74 | + instead of WebSockets. |
| 75 | + |
| 76 | +The relay server itself is the simplest thing we can get away with. |
| 77 | +It forwards UDP datagrams from one endpoint to another. Since iroh |
| 78 | +endpoints are identified by a [`NodeId`] it means you send it a |
| 79 | +destination `NodeId` together with a datagram. The relay server might |
| 80 | +now either: |
| 81 | + |
| 82 | +- Drop the datagram on the floor, because the destination endpoint is |
| 83 | + not connected to this relay server. |
| 84 | + |
| 85 | +- Forward the datagram to the destination. |
| 86 | + |
| 87 | +The relay server does not need to know what is in the datagram. In |
| 88 | +fact iroh makes sure it **does not** know what is inside: the payload |
| 89 | +is always encrypted to the destination endpoint [^1]. |
| 90 | + |
| 91 | +[^1]: Almost, QUIC's handshake has to establish a TLS connection. |
| 92 | + This means it has to send the TLS `ClientHello` message in clear |
| 93 | + text like any other TLS connection on the internet. |
| 94 | + |
| 95 | + |
| 96 | +# Holpunching |
| 97 | + |
| 98 | +UDP holepunching is simple really [^simplehp]. All you need is for each |
| 99 | +endpoint to send a UDP datagram to the other at the same time. The |
| 100 | +NAT routers will think the incoming datagrams are a response to the |
| 101 | +outgoing ones and treat it as a connection. Now you have a |
| 102 | +holepunched, direct connection. |
| 103 | + |
| 104 | +[^simplehp]: Of course it isn't. But as already said, the word count |
| 105 | + of this post is finite. |
| 106 | + |
| 107 | +To do this an endpoint needs to: |
| 108 | + |
| 109 | +- Know which IP addresses it might be reachable on. Some time we'll |
| 110 | + write this up in it's own blog post, for now I'll just assume then |
| 111 | + endpoints know. |
| 112 | + |
| 113 | +- Send these IP address candidates to the remote endpoint via the |
| 114 | + relay server. |
| 115 | + |
| 116 | +- Once both endpoints have the peer's candidate addresses, send "ping" |
| 117 | + datagrams to each candidate address of the peer. Both at the same |
| 118 | + time. |
| 119 | + |
| 120 | +- If a "ping" datagram is received, respond with "yay, we |
| 121 | + holepunched!". Typically this will be only on 1 IP path out of all |
| 122 | + the candidates. Or maybe more and more these days it'll succeed for |
| 123 | + both an IPv4 and and IPv6 path. |
| 124 | + |
| 125 | +If you followed arefully you'll have counted 3 special messages that |
| 126 | +need to be sent to the peer endpoint: |
| 127 | + |
| 128 | +1. IP address candidates. These are sent via the relay server. |
| 129 | + |
| 130 | +2. Pings. These are sent on the non-relayed IP paths. |
| 131 | + |
| 132 | +3. Pongs. These are also sent on the non-relayed IP paths. |
| 133 | + |
| 134 | +They need to be sent as UDP datagrams. Over the same *paths* as the |
| 135 | +QUIC datagrams are also being sent: the **relay path** and any |
| 136 | +**direct paths**. |
| 137 | + |
| 138 | + |
| 139 | +# Multiplexing UDP datagrams |
| 140 | + |
| 141 | +Iroh stands on the shoulders of giants, and it looked carefully at |
| 142 | +ZeroTier and Tailscale. In particular it borrowed a lot from the DERP |
| 143 | +design from Tailscale. From the above holepunching description we get |
| 144 | +two kinds of packets: |
| 145 | + |
| 146 | +- Application payload. For iroh these are QUIC datagrams. |
| 147 | +- Holepunching datagrams. |
| 148 | + |
| 149 | +When an iroh endpoint receives a packet it needs to first figure out |
| 150 | +which kind of packet this is: a QUIC datagram, or a DERP datagram? If |
| 151 | +it is a QUIC packet it is passed onto the QUIC stack [^quicstack]. If |
| 152 | +it is a DERP datagram it needs to be handed by iroh itself, by a |
| 153 | +component we call the *magic socket*. This is done using the "QUIC |
| 154 | +bit", a bit in the UDP datagram defined as always set to 1 in QUIC |
| 155 | +version 1 [^greasing]. |
| 156 | + |
| 157 | + |
| 158 | +[^quicstack]: iroh uses Quinn for the QUIC stack, an excellet project. |
| 159 | + |
| 160 | +[^greasing]: Since QUIC has released RFC 9287 which advocateds |
| 161 | + "greasing" this bit: effectively toggeling it randomly. This is |
| 162 | + an attempt to stop middleboxes from ossifying the protocol by |
| 163 | + starting to recognise this bit. Iroh not being able to grease |
| 164 | + this bit right now is not ideal either. |
| 165 | + |
| 166 | + |
| 167 | +# IP Congestion Control |
| 168 | + |
| 169 | +This system works great and is what powers iroh today. However it |
| 170 | +also has its limitations. One interesting aspect of the internet is |
| 171 | +*congestion control*. Basically IP packets get send around the |
| 172 | +internet from router to router, and each hop has it's own speed and |
| 173 | +capacity. If you send too many packets the pipes will clog up and |
| 174 | +start to slow down. If you send yet more packets routers will start |
| 175 | +dropping them. |
| 176 | + |
| 177 | +Congestion control is tasked with threading the fine line of sending |
| 178 | +as many packets as fast as possible between two endpoints, without |
| 179 | +adversally affecting the latency and packet loss. This is difficult |
| 180 | +because there are many independent endpoints using all those links |
| 181 | +between routers at the same time. But it also has had a few decades |
| 182 | +of research by now, so we achieve reasonably decent results by now. |
| 183 | + |
| 184 | +Each TCP connection has its own congestion controllers, one per |
| 185 | +endpoint. As the same goes for each QUIC connection. Unfortunately |
| 186 | +our holepunching packets live outside of the QUIC connection so do |
| 187 | +not. What is worse: when holepunching succeeds and iroh endpoint will |
| 188 | +route the QUIC datagrams via a different path then before: they will |
| 189 | +stop flowing over the relay connection and start using the direct |
| 190 | +path. This is not great for the congestion controller, so iroh |
| 191 | +effectively tells it to restart. |
| 192 | + |
| 193 | + |
| 194 | +# Multiple Paths |
| 195 | + |
| 196 | +By now I've talked several times about a "relay path" and a "direct |
| 197 | +path". A typical iroh connection has probably quite a few possible |
| 198 | +paths available between the two endpoints. A typical set would be: |
| 199 | + |
| 200 | +- The path via the relay server [^relaypath]. |
| 201 | +- An IPv4 path over the WiFi interface. |
| 202 | +- An IPv6 path over the WiFi interface. |
| 203 | +- An IPv4 path over the mobile data interface. |
| 204 | +- An IPv6 path over the mobile data interface. |
| 205 | + |
| 206 | +[^relaypath]: While this is currently a single relay path, you can |
| 207 | + easily imagine how you could expand this to a number of relay |
| 208 | + server paths. Patience. The future. |
| 209 | + |
| 210 | +The entire point of the relay path is to be able to start |
| 211 | +communicating without needing holepunching. So that path just works. |
| 212 | +But generally you'd expect the bottom 4 paths to need holepunching. |
| 213 | +And currently iroh chooses the one with the lowest latency after |
| 214 | +holepunching. But what if iroh was just aware of all those paths all |
| 215 | +the time? |
| 216 | + |
| 217 | + |
| 218 | +# QUIC Multipath |
| 219 | + |
| 220 | +Let's forget holepunching for a minute, and assume we can establish |
| 221 | +all those paths without any firewall getting in the way. Would it not |
| 222 | +be great if our QUIC stack was aware of these multiple paths? For |
| 223 | +example, it could keep a congestion controller for each path |
| 224 | +separately. Each path would also have its own Round Trip Time (RTT). |
| 225 | +So you can make an educated guess at which path you'd like to send new |
| 226 | +packets without them being blocked, dropped or slowed down |
| 227 | +[^mpcongestion]. |
| 228 | + |
| 229 | +This is exactly what the [QUIC-MULTIPATH] IETF draft has been figuring |
| 230 | +out: allow QUIC endpoints to use multiple paths at the same time. And |
| 231 | +we totally want to use this in iroh. We can have a world where we |
| 232 | +have several possible paths, select one as primary and others as |
| 233 | +backup paths and seamlessly transition between them as your endpoint |
| 234 | +moves through the network and paths appear and disappear [^irohmove]. |
| 235 | + |
| 236 | +There are **a lot** of details about QUIC-MULTIPATH on how to make it |
| 237 | +work. And adding this functionality to Quinn has been a major |
| 238 | +undertaking. But the branch is becoming functional at last. |
| 239 | + |
| 240 | +[^mpcongestion]: But hey! Some of these paths share at least the |
| 241 | + first and last hop. So they are not independent! Indeed, they |
| 242 | + are not. Congestion controllers is still a research area, |
| 243 | + especially for multiple paths with shared bottlenecks. Though You |
| 244 | + should note that this already happens a lot on the internet, your |
| 245 | + laptop or phone probably has many TCP and/or QUIC connections to |
| 246 | + several servers right now. And these definitely share hops. Yet |
| 247 | + the congestion controllers do somehow figure out how make this |
| 248 | + work, at least to some degree. |
| 249 | + |
| 250 | +[^irohmove]: Wait, doesn't iroh already say it can do this? Indeed, |
| 251 | + indeed. Though if you've tried this you'd have noticed your |
| 252 | + application did experience some hicckups for a few seconds as iroh |
| 253 | + was figuring out where traffic needs to go. In theory we can do |
| 254 | + better with multipath, though it'll take some tweaking and |
| 255 | + turning. |
| 256 | + |
| 257 | +[QUIC-MULTIPATH]: https://datatracker.ietf.org/doc/draft-ietf-quic-multipath/ |
| 258 | + |
| 259 | +# Mutltipath Holepunching |
| 260 | + |
| 261 | +If you've paid attention you'll have noticed that so far this still |
| 262 | +doesn't solve some of our issues: the holepunching datagrams still |
| 263 | +live outside of the QUIC stack. This means we send them at whatever |
| 264 | +time, not paying attention to the congestion controller. That's fine |
| 265 | +under light load, but under heavy load often results in lost packets. |
| 266 | +That in turns leads to having to re-try sending those. But preferably |
| 267 | +without accidentally DOSing an innocent UDP socket just quietly |
| 268 | +hanging out on the internet, accidentally using an IP address that you |
| 269 | +thought might belong to the remote endpoint. |
| 270 | + |
| 271 | +So the next step we would like to take with the iroh multipath project |
| 272 | +is to move holepunching logic itself into QUIC. We're also not the |
| 273 | +first to consider this: Marten Seemann and Christian Huitema have been |
| 274 | +thinking about this as well and wrote down [some thoughts in a blog |
| 275 | +post]. More importantly they started [QUIC-NAT-TRAVERSAL] draft which |
| 276 | +conceptually does a simple thing: move the holepunching packets *into* |
| 277 | +QUIC packets. |
| 278 | + |
| 279 | +[some thoughts in a blog post]: https://seemann.io/posts/2024-10-26---p2p-quic/ |
| 280 | +[QUIC-NAT-TRAVERSAL]: https://datatracker.ietf.org/doc/draft-seemann-quic-nat-traversal/ |
| 281 | + |
| 282 | +While QUIC-NAT-TRAVERSAL is highly experimental and we don't expect to |
| 283 | +follow it exactly as of the time of writing, this does have a number |
| 284 | +of benefits: |
| 285 | + |
| 286 | +- The QUIC packets are already encrypted, we no longer need to manage |
| 287 | + our own encryption layer separately. |
| 288 | + |
| 289 | +- QUIC already has very advanced packet acknowledgement and loss |
| 290 | + recovery mechanisms. Including the congestion control mechanisms. |
| 291 | + Essentially QUIC is a reliable transport, which this gets to benefit |
| 292 | + from. |
| 293 | + |
| 294 | +- QUIC already has robust protection against sending too much data to |
| 295 | + unspececting hosts on the internet. |
| 296 | + |
| 297 | +- In combination with QUIC-MULTIPATH we get a very robust and flexible ys |
| 298 | + |
| 299 | +Another consideration is that QUIC is already extensible. Notice that |
| 300 | +both QUIC-MULTIPATH and QUIC-NAT-TRAVERSAL are negotiated at |
| 301 | +connection setup. This is a robust mechanism that allows us to be |
| 302 | +confident that in the future we'll be able to improve on these |
| 303 | +mechanisms. |
| 304 | + |
| 305 | + |
| 306 | +# Work In Progress |
| 307 | + |
| 308 | +This would all change the iroh wire-protocol. This is part of the |
| 309 | +reason we want this done before our 1.0 release: once we release this |
| 310 | +we promise to keep our wire-protocol the same. Right now we're hard |
| 311 | +at work building the pieces needed for this all. And sometime |
| 312 | +soon-ish they will start landing in the 0.9x releases. |
| 313 | + |
| 314 | +We aim for iroh to become even more reliable for folks who push the |
| 315 | +limits, thanks to moving all the holepunching logic right into the |
| 316 | +QUIC stack. |
0 commit comments