-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add support for anyraid vdevs #17567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d8526e8
to
c3b8110
Compare
vdev_t *vd = mg->mg_vd; | ||
if (B_FALSE) { | ||
weight = 2 * weight - (msp->ms_id * weight) / vd->vdev_ms_count; | ||
weight = MIN(weight, METASLAB_MAX_WEIGHT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suspicious about this math:
- Unlike seems like linear space-based weight, segment-based weight is exponential. So doubling weight of the first metaslabs, you actually exponentially increasing it. I.e, if the first metaslab has only one free segment of only 1MB, and so weight with INDEX=20 and COUNT=1, doubling that will give INDEX=40 and COUNT=2, which would make it absolutely unbeatable for the most metaslabs, since it would require a free segment of up to 1TB.
- all free metaslabs are identical and selected based on their offset, which is OK, if it is expected, but they also don't go through this path, so they have a very small chance to ever be used until all the earlier metaslabs are filled to the brim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah doubling is probably overkill for this. But we do need something that will interpolate nicely later into the vdev. Perhaps what we do is something like "Add 3 - ((ms_id * 4) / vdev_ms_count)
to the index and add 20 - (((ms_id * 20) / (vdev_ms_count / 5)) % 20)
to the count? So for the first quarter we would add 3 to the count, then 2, 1, and 0. And within each quarter, we would add 19 to the first metaslab, 18 to the second, etc. It's not ideal, since larger vdevs will have large plateaus where the modifications are the same, but that's probably alright. We just want to try to concentrate writes generally earlier, a little mixing in adjacent metaslabs on large pools is probably fine. And with the largest index difference being 3, the last metaslabs only needs to have segments 8x as large as the first one to compete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For purposes of spinning disks performance we don't really need a smooth curve. Having just several "speed zones" would be enough to get the most of performance. I propose to leave the first zone as it is, for the second zone to subtract 1 from index, while doubling the count to keep it equivalent, for the third zone subtract another one from index and again double the count, etc. This logic may not be great beyond 3-4 zones, but IMO that should be enough to get the most of speed, and I think it could actually be applied independently of anyraid to any rotating vdevs. We may not apply this to metaslabs with index below some threshold, since sequential speed there should not worth much, and we should better focus on lower fragmentation.
For purposes of anyraid's tiles allocation I think you only need to a certain degree prefer already used metaslabs to empty ones. Once some metaslabs are used, I am not sure (yet?) why would you really prefer one used metaslab to another, since single used block would be enough for tile to stay allocated no matter what. So I think it may be enough to just account free metaslabs not as one free segment (index of size and count of 1, that makes it unbeatable now), but split it few times in a way I described above for HDDs to a level just below (or equal to) the last speed zone of used metaslabs. Free metaslabs do not need zones, since current code already sort identical metaslabs by offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we probably don't need gradations beyond the few top-level speed zones. I think the simplest algorithm that satisfies our goals is to just add N - 1
to the index in the first zone, N - 2
in the second, etc, for N
zones. That will prefer earlier metaslabs, and since untouched metaslabs won't hit this code, they will naturally end up with nothing added to the index, and end up sorting with everything in the final zone (and then earlier ones will sort first, as you said). I see the idea behind decreasing the index and doubling the count to match, but I don't think that's actually important; we don't multiply the segment size and count together to get the available space anywhere, so we don't need to preserve that relationship.
This prefers earlier metaslabs for the general rotational performance bonus, and prefers already used metaslabs for anyraid. We can also disable this logic if the index is below a critical value (24?) so that if we're very fragmented, we abandon this and only focus on the actual free space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also disable this logic if the index is below a critical value (24?) so that if we're very fragmented, we abandon this and only focus on the actual free space.
Yea. In unrelated context I was also recently thinking that we could do more when we reach some fragmentation or capacity threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One annoying caveat about changing the weighting algorithm dynamically is that the assertions that the weight not decrease while a metaslab is unloaded means we have to be a little careful to design/test it with those constraints in mind.
if (vd->vdev_parent->vdev_ops == &vdev_anyraid_ops) { | ||
vdev_anyraid_write_map_sync(vd, zio, txg, good_writes, flags, | ||
status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this will write all the maps every time? In addition to 2 one-sector uberblock writes per leaf vdev we now write up to 64MB per one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we now write out the whole map every TXG. In theory that could be 64MiB, but in practice it's usually on the order of kilobytes; an anyraid with 32 disks, with an average of 256 tiles each, and 80% of them mapped would use 26KB. We have 64MB here because that's the maximum the mapping could possibly reach, not because we expect it to be anywhere close to that in practice.
We could add an optimization that doesn't update the map if nothing changed in a given txg; we'd still want to update the header, but the mapping itself could be left unmodified.
void *buf = abd_borrow_buf(map_abd, SPA_MAXBLOCKSIZE); | ||
|
||
rw_enter(&var->vd_lock, RW_READER); | ||
anyraid_tile_t *cur = avl_first(&var->vd_tile_map); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, you have only one copy of a tile map, that covers all TXGs. It may be OK as long as you don't need precise accounting and not going to ever free the tiles. But what is not OK, I suppose, is that for writes done in open context (such as Direct I/O and ZIL) maps will not be written till the end of next committed TXG. Direct I/O might not care, but it may be impossible to replay the ZIL after crash if its blocks or blocks they reference were written to a new tiles that are not yet synced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not correct, there are 4 copies per vdev, which rotate per txg. So there are plenty of copies of the map for each TXG, and there are 4 TXGs worth of maps.
As for the ZIL issue, that is a good point. We need to prevent ZIL blocks from ending up in unmapped tiles. While this probably wouldn't actually cause a problem in practice (when you import the pool again the tile would get mapped in the same way as before, since the vdev geometry hasn't changed), it definitely could in theory (if you had multiple new tiles in the same TXG, and they got remapped in a different order than they were originally).
The best fix I came up with is to prevent ZIL writes from being allocated to unmapped tiles in the first place. I also considered trying to stall those writes until the TXG synced, but that's slow and also technically annoying. I also considered having a journal of tile mappings that we would update immediately, but that adds a lot of complexity. Preventing the allocation nicely solves the problem, and if they can't find a place to allocate in all the mapped tiles, we already have logic to force the ZIL to fall back to txg_wait_synced.
Overall this is a really nice feature! I haven't looked at the code yet, but did kick the tires a little and have some comments/questions.
How did you arrive at the 16GiB min tile size? (forgive me if this is mentioned in the code comments) I ask, since it would be nice to have a smaller tile size to accommodate smaller vdevs (and give more free space, since it's rounded to tile-sized bounderies).
That way there's no ambiguity if the anyraid TLD is a mirror or raidz flavor. It also opens the path to anyraidz1, which was mentioned in the Anyraid announcement: "With ZFS AnyRaid, we will see at least two new layouts added: AnyRaid-Mirror and AnyRaid-Z1. The AnyRaid-Mirror feature will come first, and will allow users to have a pool of more than two disks of varying sizes while ensuring all data is written to two different disks. The AnyRaid-Z1 feature will apply the same concepts of ZFS RAID-Z1, but while supporting mixed size disks." https://hexos.com/blog/introducing-zfs-anyraid-sponsored-by-eshtek |
I also noticed that the anyraid TLD names don't include the parity level. They all just say "anyraid":
We should have it match the raidz TLD convention where the parity level is included:
|
16GiB was selected mostly because that would make the minimum line up with the standard fraction (1/64th) at a 1TiB disk. That's a nice round number, and a pretty reasonable size for "a normal size disk" these days. Anything less than 1TiB is definitely on the smaller side. The other effect of this value is that with this tile size, you can have any disk up to 1PiB in size and still be able to use all the space; any disk that's more than 2^24 tiles can't all be used. It is possible to have smaller tile sizes; we do it in the test suite a bunch. There is a tunable,
That's fair, we could have a better error message for this case. I can work on that.
I think autoexpand works like normal? It doesn't affect the tile size or anything, because the tile size is locked in immediately when the vdev is created, but it should affect the disk sizes like normal. Maybe the tile capacity doesn't change automatically? But that's probably a bug, if so. Did you run into this in your testing?
Interesting, I will investigate why that happened. Those should be able to mix for sure.
I'm open to new naming options. My vague plan was to use
Good point, I will fix that too. |
This is a nit, but in the description I believe there is a typo: "Anyraid works by diving each of the disks that makes up the vdev..." I believe the intent was for dividing. |
64b9223
to
ae25ac3
Compare
6e96d25
to
7c87ca3
Compare
I prefer the
I'm guessing first two vdevs report a |
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]> Sponsored-by: Eshtek, creators of HexOS Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
static int | ||
log_10(uint64_t v) { | ||
char buf[32]; | ||
snprintf(buf, sizeof (buf), "%llu", (u_longlong_t)v); | ||
return (strlen(buf)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use log10()
from math.h?
#define ZPOOL_CONFIG_DRAID_NGROUPS "draid_ngroups" | ||
|
||
/* ANYRAID configuration */ | ||
#define ZPOOL_CONFIG_ANYRAID_PARITY_TYPE "parity_type" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should the string be "anyraid_parity_type", just so it's clear if someone is printing the config?
} | ||
|
||
zfeature_register(SPA_FEATURE_ANYRAID, | ||
"com.klarasystems:anyraid", "anyraid", "Support for anyraid VDEV", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the overarching name for anymirror+anyraid is called "anyraid". For this feature flag though, do we need to be more specific and call it "anymirror", and then add another feature flag for the actual raid part follow-on ("anyraid")? Or can we use the same feature flag for both anymirror and the future anyraid?
* Initialize private VDEV specific fields from the nvlist. | ||
*/ | ||
static int | ||
vdev_anyraid_init(spa_t *spa, nvlist_t *nv, void **tsd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be?:
- vdev_anyraid_init(spa_t *spa, nvlist_t *nv, void **tsd)
+ vdev_anyraid_init(spa_t *spa, nvlist_t *nv, vdev_anyraid_t **tsd)
{ | ||
echo $(zpool iostat -v $1 | awk '(NR > 4) {print $1}' | \ | ||
grep -vEe '^-----' -e "^(mirror|raidz[1-3]|draid[1-3]|spare|log|cache|special|dedup)|\-[0-9]$") | ||
grep -vEe '^-----' -e "^(mirror|raidz[1-3]|anyraid|draid[1-3]|spare|log|cache|special|dedup)|\-[0-9]$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be anymirror[0-3]?
log_must zpool create -f $TESTPOOL anymirror$parity $disks special mirror $sdisks | ||
log_must poolexists $TESTPOOL | ||
|
||
log_must dd if=/dev/urandom of=/$TESTPOOL/file.bin bs=1M count=128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to write two files, with one file's writes being small enough to land on special, and the other's large enough to land on anymirror. You can set special_small_blocks
to threshold the size of writes for the special device. You can also write much less than 128MB here to make the test run faster.
# Verify a variety of AnyRAID pools with a special VDEV AnyRAID. | ||
# | ||
# STRATEGY: | ||
# 1. Create an AnyRAID pool with a special VDEV AnyRAID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very similiar to anyraid_special_vdev_001_pos.ksh
You could probably combine the two tests. Also, you could try with a dedup device as well.
if [[ $vdev != "" && \ | ||
$vdev != "mirror" && \ | ||
$vdev != "raidz" && \ | ||
$vdev != "anyraid" && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"anymirror" ?
log_mustnot zpool add -f $TESTPOOL $disk0 | ||
|
||
for type in "" "mirror" "raidz" "draid" "spare" "log" "dedup" "special" "cache" | ||
for type in "" "mirror" "raidz" "anyraid" "draid" "spare" "log" "dedup" "special" "cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"anymirror"?
"feature@redaction_list_spill" | ||
"feature@dynamic_gang_header" | ||
"feature@physical_rewrite" | ||
"feature@anyraid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"feature@anymirror"?
for type in "mirror" "anymirror1"; do | ||
log_must zpool create -f $TESTPOOL $type $DISK1 $DISK2 | ||
if [[ "$type" == "anymirror1" ]]; then | ||
log_must dd if=/dev/urandom of=/$TESTPOOL/f1 bs=1M count=2k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dd
necessary? 1:52 of the total 1:55 test time is the dd
:
19:33:02.54 SUCCESS: zpool create -f testpool anymirror1 loop0 loop1
19:34:54.98 SUCCESS: dd if=/dev/urandom of=/testpool/f1 bs=1M count=2k
progress="$(initialize_progress $TESTPOOL $DISK1)" | ||
[[ -z "$progress" ]] && log_fail "Initializing did not start" | ||
log_mustnot eval "initialize_prog_line $TESTPOOL $DISK1 | grep suspended" | ||
if [[ "$type" =~ "anyraid" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if [[ "$type" =~ "anymirror" ]]; then
log_must zpool create -f $TESTPOOL $type $DISK1 $DISK2 | ||
else | ||
log_must zpool create -f $TESTPOOL $type $DISK1 $DISK2 $DISK3 | ||
log_must dd if=/dev/urandom of=/$TESTPOOL/f1 bs=1M count=400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/dev/urandom isn't that fast:
19:35:40.29 SUCCESS: zpool create -f testpool anymirror1 loop0 loop1 loop2
19:35:51.26 SUCCESS: dd if=/dev/urandom of=/testpool/f1 bs=1M count=400
Consider using file_write -d R ...
instead.
log_must zpool list -v | ||
log_must zpool create -f $TESTPOOL $type $DISK1 $DISK2 $DISK3 | ||
if [[ "$type" == "anymirror2" ]]; then | ||
log_must dd if=/dev/urandom of=/$TESTPOOL/f1 bs=1M count=2k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dd
here is slow, consider alternatives.
19:36:10.46 SUCCESS: zpool create -f testpool anymirror2 loop0 loop1 loop2
19:38:23.81 SUCCESS: dd if=/dev/urandom of=/testpool/f1 bs=1M count=2k
log_must zpool create -f $TESTPOOL $type $DISK1 $DISK2 $DISK3 | ||
status_check_all $TESTPOOL "uninitialized" | ||
if [[ "$type" == "anymirror1" ]]; then | ||
log_must dd if=/dev/urandom of=/$TESTPOOL/f1 bs=1M count=2k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slow dd
, consider alternatives.
log_must zfs set primarycache=none $TESTPOOL | ||
|
||
# Write initial data | ||
log_must dd if=/dev/urandom of=/$TESTPOOL/file1.bin bs=1M count=$(( DEVSIZE / 2 / 1024 / 1024 )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slow dd
18:45:08.30 SUCCESS: zfs set primarycache=none testpool
18:45:33.54 SUCCESS: dd if=/dev/urandom of=/testpool/file1.bin bs=1M count=2048
|
||
# Read initial data, write new data | ||
dd if=/$TESTPOOL/file1.bin of=/dev/null bs=1M count=$(( DEVSIZE / 2 / 1024 / 1024 )) | ||
log_must dd if=/dev/urandom of=/$TESTPOOL/file1.bin bs=1M count=$(( DEVSIZE / 2 / 1024 / 1024 )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slow dd
Sponsored by: Eshtek, creators of HexOS; Klara, Inc.
Motivation and Context
For industry/commercial use cases, the existing redundancy solutions in ZFS (mirrors and RAIDZ) work great. They provide high performance, reliable, efficient storage options. For enthusiast users, however, they have a drawback. RAIDZ and mirrors will use the size of the smallest drive that is part of the vdev as the size of every drive, so they can provide their reliability guarantees. If you can afford to buy a new box of drives for your pool, like large-scale enterprise users, that's fine. But if you already have a mix of hard drives, of various sizes, and you want to use all of the space they have available while still benefiting from ZFS's reliability and featureset, there isn't currently a great solution for that problem.
Description
The goal of Anyraid is to fill that niche. Anyraid allows devices of mismatched sizes to be combined together into a single top-level vdev. In the current version, Anyraid only supports mirror-type parity, but raidz-type parity is planned for the near future.
Anyraid works by dividing each of the disks that makes up the vdev into tiles. These tiles are the same size across all disks within a given anyraid vdev. The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger. These tiles are then combined together to form the logical vdev that anyraid presents, with sets of tiles from different disks acting as mini-mirrors, allowing the reliability guarantees to be preserved. Tiles are allocated on demand; when a write comes into a part of the logical vdev that doesn't have backing tiles yet, the Anyraid logic picks the
nparity + 1
disks with the most unallocated tiles, and allocates one tile from each of them. These physical tiles are combined together into one logic tile, which is used to store data for that section of the logical vdev.One important note with this design is that we need to understand this mapping from logical offset to tiles (and therefore to actual physical disk locations) in order to read anything from the pool. As a result, we cannot store the mapping in the MOS, since that would result in a bootstrap problem. To solve this issue, we allocate a region at the start of each disk where we store the Anyraid tile map. The tile map is made up of 4 copies of all the data necessary to reconstruct the tile map. These copies are updated in rotating order, like uberblocks. In addition, each disk has a full copy of all 4 maps, ensuring that as long as any drive's copy survives, the tile map for a given TXG can be read successfully. The size of one copy of the tile map is 64MiB; that size determines the maximum number of tiles an anyraid vdev can have, which is 2^24. This is made up of up to 2^8 disks, and up to 2^16 tiles per disk. This does mean that the largest device that can be fully used by an anyraid vdev is 1024 times the size of the smallest disk that was present at vdev creation time. This was considered to be an acceptable tradeoff, though it is a limit that could be alleviated in the future if needed; the primary difficulty is that either the tile map needs to grow substantially, or logic needs to be added to handle/prevent the tile map filling up.
Anyraid vdevs support all the operations that normal vdevs do. They can be resilvered, removed, and scrubbed. They also support expansion; new drives can be attached to the anyraid vdev, and their tiles will be used in future allocations. There is currently no support for rebalancing tiles onto new devices, although that is also planned. VDEV Contraction is also planned for the future.
New ZDB functionality was added to print out information about the anyraid mapping, to aid in debugging and understanding. A number of tests were also added, and ztest support for the new type of vdev was implemented.
How Has This Been Tested?
In addition to the tests added to the test suite and zloop runs, I also ran many manual tests of unusual configurations to verify that the tile layout behaves correctly. There was also some basic performance testing to verify that nothing was obviously wrong. Performance is not the primary design goal of anyraid, however, so in-depth analysis was not performed.
Types of changes
Checklist:
Signed-off-by
.