You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/content/xapi/storage/sxm/index.md
+229-6Lines changed: 229 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,17 @@ Title: Storage migration
9
9
-[Thought experiments on an alternative design](#thought-experiments-on-an-alternative-design)
10
10
-[Design](#design)
11
11
-[SMAPIv1 migration](#smapiv1-migration)
12
+
-[Preparation](#preparation)
13
+
-[Establishing mirror](#establishing-mirror)
14
+
-[Mirror](#mirror)
15
+
-[Snapshot](#snapshot)
16
+
-[Copy and compose](#copy-and-compose)
17
+
-[Finish](#finish)
12
18
-[SMAPIv3 migration](#smapiv3-migration)
19
+
-[Preparation](#preparation-1)
20
+
-[Establishing mirror](#establishing-mirror-1)
21
+
-[Limitations](#limitations)
22
+
-[Finish](#finish-1)
13
23
-[Error Handling](#error-handling)
14
24
-[Preparation (SMAPIv1 and SMAPIv3)](#preparation-smapiv1-and-smapiv3)
15
25
-[Snapshot and mirror failure (SMAPIv1)](#snapshot-and-mirror-failure-smapiv1)
@@ -122,10 +132,44 @@ it will be handled just as before.
122
132
123
133
## SMAPIv1 migration
124
134
135
+
This section is about migration from SMAPIv1 SRs to SMAPIv1 or SMAPIv3 SRs, since
136
+
the migration is driven by the source host, it is usally the source host that
137
+
determines most of the logic during a storage migration.
138
+
139
+
First we take a look at an overview diagram of what happens during SMAPIv1 SXM:
140
+
the diagram is labelled with S1, S2 ... which indicates different stages of the migration.
141
+
We will talk about each stage in more detail below.
142
+
143
+

144
+
145
+
### Preparation
146
+
147
+
Before we can start our migration process, there are a number of preparations
148
+
needed to prepare for the following mirror. For SMAPIv1 this involves:
149
+
150
+
1. Create a new VDI (called leaf) that will be used as the receiving VDI for all the new writes
151
+
2. Create a dummy snapshot of the VDI above to make sure it is a differencing disk and can be composed later on
152
+
3. Create a VDI (called parent) that will be used to receive the existing content of the disk (the snapshot)
153
+
154
+
Note that the leaf VDI needs to be attached and activated on the destination host (to a non-exsiting `mirror_vm`)
155
+
since it will later on accept writes to mirror what is written on the source host.
156
+
157
+
The parent VDI may be created in two different ways: 1. If there is a "similar VDI",
158
+
clone it on the destination host and use it as the parent VDI; 2. If there is no
159
+
such VDI, create a new blank VDI. The similarity here is defined by the distances
160
+
between different VDIs in the VHD tree, which is exploiting the internal representation
161
+
of the storage layer, hence we will not go into too much detail about this here.
162
+
163
+
Once these preparations are done, a `mirror_receive_result` data structure is then
164
+
passed back to the source host that will contain all the necessary information about
165
+
these new VDIs, etc.
166
+
167
+
### Establishing mirror
168
+
125
169
At a high level, mirror establishment for SMAPIv1 works as follows:
126
170
127
171
1. Take a snapshot of a VDI that is attached to VM1. This gives us an immutable
128
-
copy of the current state of the VDI, with all the data until the point we took
172
+
copy of the current state of the VDI, with all the data up until the point we took
129
173
the snapshot. This is illustrated in the diagram as a VDI and its snapshot connecting
130
174
to a shared parent, which stores the shared content for the snapshot and the writable
131
175
VDI from which we took the snapshot (snapshot)
@@ -135,12 +179,174 @@ client VDI will also be written to the mirrored VDI on the remote host (mirror)
135
179
4. Compose the mirror and the snapshot to form a single VDI
136
180
5. Destroy the snapshot on the local host (cleanup)
137
181
182
+
#### Mirror
183
+
184
+
The mirroring process for SMAPIv1 is rather unconventional, so it is worth
185
+
documenting how this works. Instead of a conventional client server architecture,
186
+
where the source client connects to the destination server directly through the
187
+
NBD protocol in tapdisk, the connection is established in xapi and then passed
188
+
onto tapdisk. It was done in this rather unusual way mainly due to authentication
189
+
issues. Because it is xapi that is creating the connection, tapdisk does not need
190
+
to be concerned about authentication of the connection, thus simplifying the storage
191
+
component. This is reasonable as the storage component should focus on handling
192
+
storage requests rather than worrying about network security.
193
+
194
+
The diagram below illustrates this prcess. First, xapi on the source host will
195
+
initiate an https request to the remote xapi. This request contains the necessary
196
+
information about the VDI to be mirrored, and the SR that contains it, etc. This
197
+
information is then passed onto the https handler on the destination host (called
198
+
`nbd_handler`) which then processes this information. Now the unusual step is that
199
+
both the source and the destination xapi will pass this connection onto tapdisk,
200
+
by sending the fd representing the socket connection to the tapdisk process. On
201
+
the source this would be nbd client process of tapdisk, and on the destination
202
+
this would be the nbd server process of the tapdisk. After this step, we can consider
203
+
a client-server connection is established between two tapdisks on the client and
204
+
server, as if the tapdisk on the source host makes a request to the tapdisk on the
205
+
destination host and initiates the connection. On the diagram, this is indicated
206
+
by the dashed lines between the tapdisk processes. Logically, we can view this as
207
+
xapi creates the connection, and then passes this connection down into tapdisk.
208
+
209
+

210
+
211
+
#### Snapshot
212
+
213
+
The next step would be create a snapshot of the VDI. This is easily done as a
214
+
`VDI.snapshot` operation. If the VDI was in VHD format, then internally this would
215
+
create two children for, one for the snapshot, which only contains the metadata
216
+
information and tends to be small, the other for the writable VDI where all the
217
+
new writes will go to. The shared base copy contains the shared blocks.
218
+
219
+

220
+
221
+
#### Copy and compose
222
+
223
+
Once the snapshot is created, we can then copy the snapshot from the source
224
+
to the destination. This step is done by `sparse_dd` using the nbd protocol. This
225
+
is also the step that takes the most time to complete.
226
+
227
+
`sparse_dd` is a process forked by xapi that does the copying of the disk blocks.
228
+
`sparse_dd` can supports a number of protocols, including nbd. In this case, `sparse_dd`
229
+
will initiate an https put request to the destination host, with a url of the form
230
+
`<address>/services/SM/nbdproxy/<sr>/<vdi>`. This https request then
231
+
gets handled by the https handler on the destination host B, which will then spawn
232
+
a handler thread. This handler will find the
233
+
"generic" nbd server[^2] of either tapdisk or qemu-dp, depending on the destination
234
+
SR type, and then start proxying data between the https connection socket and the
235
+
socket connected to the nbd server.
236
+
237
+
[^2]: The server is generic because it does not accept fd passing, and I call those
238
+
"special" nbd server/fd receiver.
239
+
240
+

241
+
242
+
Once copying is done, the snapshot and mirrored VDI can be then composed into a
243
+
single VDI.
244
+
245
+
#### Finish
246
+
247
+
At this point the VDI is synchronised to the new host! Mirror is still working at this point
248
+
though because that will not be destroyed until the VM itself has been migrated
249
+
as well. Some cleanups are done at this point, such as deleting the snapshot
250
+
that is taken on the source, destroying the mirror datapath, etc.
251
+
252
+
The end results look like the following. Note that VM2 is in dashed line as it
253
+
is not yet created yet. The next steps would be to migrate the VM1 itself to the
254
+
destination as well, but this is part of the VM migration process and will not
255
+
be covered here.
256
+
257
+

138
258
139
-
more detail to come...
140
259
141
260
## SMAPIv3 migration
142
261
143
-
More detail to come...
262
+
This section covers the mechanism of migrations *from* SRs using SMAPIv3 (to
263
+
SMAPIv1 or SMAPIv3). Although the core ideas are the same, SMAPIv3 has a rather
264
+
different mechanism for mirroring: 1. it does not require xapi to take snapshot
265
+
of the VDI anymore, since the mirror itself will take care of replicating the
266
+
existing data to the destination; 2. there is no fd passing for connection establishment anymore, and instead proxies are used for connection setup.
267
+
268
+
### Preparation
269
+
270
+
The preparation work for SMAPIv3 is greatly simplified by the fact that the mirror
271
+
at the storage layer will copy the existing data in the VDI to the destination.
272
+
This means that snapshot of the source VDI is not required anymore. So we are left
273
+
with only one thing:
274
+
275
+
1. Create a VDI used for mirroring the data of the source VDI
276
+
277
+
For this reason, the implementation logic for SMAPIv3 preparation is also shorter,
278
+
as the complexity is now handled by the storage layer, which is where it is supposed
279
+
to be handled.
280
+
281
+
### Establishing mirror
282
+
283
+
The other significant difference is that the storage backend for SMAPIv3 `qemu-dp`
284
+
SRs no longer accepts fds, so xapi needs to proxy the data between two nbd client
285
+
and nbd server.
286
+
287
+
SMAPIv3 provides the `Data.mirror uri domain remote` which needs three parameters:
288
+
`uri` for accessing the local disk, `doamin` for the domain slice on which mirroring
289
+
should happen, and most importantly for this design, a `remote` url which represents
290
+
the remote nbd server to which the blocks of data can be sent to.
291
+
292
+
This function itself, when called by xapi and forwarded to the storage layer's qemu-dp
293
+
nbd client, will initiate a nbd connection to the nbd server pointed to by `remote`.
294
+
This works fine when the storage migration happens entirely within a local host,
295
+
where qemu-dp's nbd client and nbd server can communicate over unix domain sockets.
296
+
However, it does not work for inter-host migrations as qemu-dp's nbd server is not
297
+
exposed publicly over the network (just as tapdisk's nbd server). Therefore a proxying
298
+
service on the source host is needed for forwarding the nbd connection from the
299
+
source host to the destination host. And it would be the responsiblity of
300
+
xapi to manage this proxy service.
301
+
302
+
The following diagram illustrates the mirroring process of a single VDI:
303
+
304
+

305
+
306
+
The first step for xapi is then to set up a nbd proxy thread that will be listening
307
+
on a local unix domain socket with path `/var/run/nbdproxy/export/<domain>` where
308
+
domain is the `domain` parameter mentioned above in `Data.mirror`. The nbd proxy
309
+
thread will accept nbd connections (or rather any connections, it does not
310
+
speak/care about nbd protocol at all) and sends an https put request
311
+
to the remote xapi. The proxy itself will then forward the data exactly as it is
312
+
to the remote side through the https connection.
313
+
314
+
Once the proxy is set up, xapi will call `Data.mirror`, which
315
+
will be forwarded to the xapi-storage-script and is further forwarded to the qemu-dp.
316
+
This call contains, among other parameters, the destination NBD server url (`remote`)
317
+
to be connected. In this case the destination nbd server is exactly the domain
318
+
socket to which the proxy thread is listening. Therefore the `remote` parameter
319
+
will be of the form `nbd+unix:///<export>?socket=<socket>` where the export is provided
320
+
by the destination nbd server that represents the VDI prepared on the destination
321
+
host, and the socket will be the path of the unix domain socket where the proxy
322
+
thread (which we just created) is listening at.
323
+
324
+
When this connection is set up, the proxy process will talk to the remote xapi via
325
+
https requests, and on the remote side, an https handler will proxy this request to
326
+
the appropriate nbd server of either tapdisk or qemu-dp, using exactly the same
327
+
[import proxy](#copy-and-compose) as mentioned before.
328
+
329
+
Note that this proxying service is tightly integrated with outbound SXM of SMAPIv3
330
+
SRs. This is to make it simple to focus on the migration itself.
331
+
332
+
Although there is no need to explicitly copy the VDI anymore, we still need to
333
+
transfer the data and wait for it finish. For this we use `Data.stat` call provided
334
+
by the storage backend to query the status of the mirror, and wait for it to finish
335
+
as needed.
336
+
337
+
#### Limitations
338
+
339
+
This way of establishing the connection simplifies the implementation of the migration
340
+
for SMAPIv3, but it also has limitations:
341
+
342
+
One proxy per live VDI migration is needed, which can potentially consume lots of resources in dom0, and we should measure the impact of this before we switch to using more resource-efficient ways such as wire guard that allows establishing a single connection between multiple hosts.
343
+
344
+
345
+
### Finish
346
+
347
+
As there is no need to copy a VDI, there is also no need to compose or delete the
348
+
snapshot. The cleanup procedure would therefore just involve destroy the datapath
349
+
that was used for receiving writes for the mirrored VDI.
144
350
145
351
## Error Handling
146
352
@@ -168,10 +374,10 @@ helps separate the error handling logic into the `with` part of a `try with` blo
168
374
which is where they are supposed to be. Since we need to accommodate the existing
169
375
SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are
170
376
introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that
171
-
each stage also roughly corresponds to a helper function that is called within `MIRROR.start`,
377
+
each stage also roughly corresponds to a helper function that is called within `Storage_migrate.start`,
172
378
which is the wrapper function that initiates storage migration. And each helper
173
379
functions themselves would also have error handling logic within themselves as
174
-
needed (e.g. see `Storage_smapiv1_migrate.receive_start) to deal with exceptions
380
+
needed (e.g. see `Storage_smapiv1_migrate.receive_start`) to deal with exceptions
175
381
that happen within each helper functions.
176
382
177
383
### Preparation (SMAPIv1 and SMAPIv3)
@@ -203,7 +409,16 @@ are migrating from.
203
409
204
410
### Mirror failure (SMAPIv3)
205
411
206
-
To be filled...
412
+
The `Data.stat` call in SMAPIv3 returns a data structure that includes the current
413
+
progress of the mirror job, whether it has completed syncing the existing data and
414
+
whether the mirorr has failed. Similar to how it is done in SMAPIv1, we wait for
415
+
the sync to complete once we issue the `Data.mirror` call, by repeatedly polling
416
+
the status of the mirror using the `Data.stat` call. During this process, the status
417
+
of the mirror is also checked and if a failure is detected, a `Migration_mirror_failure`
418
+
will be raised and then gets handled by the code in `storage_migrate.ml` by calling
419
+
`Storage_smapiv3_migrate.receive_cancel2`, which will clean up the mirror datapath
420
+
and destroy the mirror VDI, similar to what is done in SMAPIv1.
421
+
207
422
208
423
### Copy failure (SMAPIv1)
209
424
@@ -215,6 +430,14 @@ failure during copying.
215
430
216
431
## SMAPIv1 Migration implementation detail
217
432
433
+
{{% notice info %}}
434
+
The following doc refers to the xapi a [version](https://github.com/xapi-project/xen-api/blob/v24.37.0/ocaml/xapi/storage_migrate.ml)
435
+
of xapi that is before 24.37 after which point this code structure has undergone
436
+
many changes as part of adding support for SMAPIv3 SXM. Therefore the following
437
+
tutorial might be less relevant in terms of the implementation detail. Although
0 commit comments