Skip to content

[Xet] Basic shard creation #1633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
3b5e2b9
createXorbs also outputs file hash, sha256 and representation
coyotte508 Jul 16, 2025
3127a01
createXorbs handles a stream of blobs
coyotte508 Jul 16, 2025
099cc40
basic shard creation
coyotte508 Jul 16, 2025
8b35869
shard magic tag
coyotte508 Jul 16, 2025
6676431
remove shard key expiry
coyotte508 Jul 16, 2025
3fc715b
actually make API calls to xet backend to upload shards/xorbs
coyotte508 Jul 18, 2025
7a38fc7
fix prefix for shard upload
coyotte508 Jul 18, 2025
947a926
fix verificaiton data
coyotte508 Jul 21, 2025
9cd2e66
update wasm bindings
coyotte508 Jul 21, 2025
bf8ae1c
fixup! update wasm bindings
coyotte508 Jul 21, 2025
70f0a0e
no need to compute shard hash client-side
coyotte508 Jul 21, 2025
431949b
no need for shard hash
coyotte508 Jul 23, 2025
515cae6
Merge remote-tracking branch 'origin/main' into shard-creation
coyotte508 Jul 24, 2025
adbe363
add va prefix to xorb too
coyotte508 Jul 24, 2025
2f9d4a0
progress events for uploading xorbs
coyotte508 Jul 24, 2025
a13da79
commit leftover xorb after all chunks have been processed
coyotte508 Jul 24, 2025
3cbde5e
integrate xet upload in commit function
coyotte508 Jul 24, 2025
651f6c0
add local dedup for xet uploads
coyotte508 Jul 24, 2025
f3e190f
move chunk caching to its own class/file
coyotte508 Jul 24, 2025
06dcd6d
Make sure to not OOB when writing shards
coyotte508 Jul 25, 2025
c093c64
dedup boolean when loading chunks from wasm
coyotte508 Jul 29, 2025
2e27699
global dedup! (just need hmac algorithm)
coyotte508 Jul 29, 2025
7c397dd
delay file events until matching xorb is emitted
coyotte508 Jul 29, 2025
5f3a61b
add dedup ratio to information
coyotte508 Jul 29, 2025
66299a3
fixup! add dedup ratio to information
coyotte508 Jul 29, 2025
988d85b
use hmac function from wasm
coyotte508 Jul 29, 2025
433284e
add bench script
coyotte508 Jul 31, 2025
74dfb71
fix wasm instantiation
coyotte508 Jul 31, 2025
b19209f
fix api calls
coyotte508 Jul 31, 2025
6d03446
use custom fetch for chunk call
coyotte508 Aug 1, 2025
627f024
correct global dedup call
coyotte508 Aug 1, 2025
e2fd1c1
fixes and add a shard file for tests
coyotte508 Aug 1, 2025
29eaef7
more recent target for tsup
coyotte508 Aug 1, 2025
c6f922f
fixes with xet protocol
coyotte508 Aug 1, 2025
9dbd3af
endianness matters when writing/reading hashes
coyotte508 Aug 1, 2025
072d889
shard parser works
coyotte508 Aug 1, 2025
82dffa6
fix OOBs
coyotte508 Aug 1, 2025
63d2e07
top-level comment
coyotte508 Aug 1, 2025
ad697ae
update wasm
coyotte508 Aug 1, 2025
c85003f
improve stats in bench scritp
coyotte508 Aug 1, 2025
846de2b
fix data intake for createXorb
coyotte508 Aug 1, 2025
23afed8
add commit option to bench script
coyotte508 Aug 1, 2025
235ce44
fix PUT => POST calls
coyotte508 Aug 1, 2025
ab9ced2
error in dedup
coyotte508 Aug 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion packages/hub/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@
"test": "vitest run",
"test:browser": "vitest run --browser.name=chrome --browser.headless --config vitest-browser.config.mts",
"check": "tsc",
"build:xet-wasm": "./scripts/build-xet-wasm.sh -t bundler -c -b hoytak/250714-eliminate-mdb-v1"
"build:xet-wasm": "./scripts/build-xet-wasm.sh -t bundler --clean",
"bench": "tsx scripts/bench.ts"
},
"files": [
"src",
Expand Down
299 changes: 299 additions & 0 deletions packages/hub/scripts/bench.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,299 @@
import { uploadShards } from "../src/utils/uploadShards.js";
import { sha256 } from "../src/utils/sha256.js";
import { parseArgs } from "node:util";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { writeFile, readFile, stat, mkdir } from "node:fs/promises";
import type { RepoId } from "../src/types/public.js";
import { toRepoId } from "../src/utils/toRepoId.js";
import { commitIter } from "../src/index.js";
import { pathToFileURL } from "node:url";

/**
* This script downloads the files from openai-community/gpt2 and simulates an upload to a xet repo.
* It prints the dedup % and the statistics
*
* Usage:
*
* pnpm --filter hub bench -t <write token> -r <xet repo>
* pnpm --filter hub bench -t <write token> -r <xet repo> --commit # Actually upload files
*/

const FILES_TO_DOWNLOAD = [
{
url: "https://huggingface.co/openai-community/gpt2/resolve/main/64-8bits.tflite?download=true",
filename: "64-8bits.tflite",
},
{
url: "https://huggingface.co/openai-community/gpt2/resolve/main/64-fp16.tflite?download=true",
filename: "64-fp16.tflite",
},
];

async function downloadFileIfNotExists(url: string, filepath: string): Promise<void> {
try {
await stat(filepath);
console.log(`File ${filepath} already exists, skipping download`);
return;
} catch {
// File doesn't exist, proceed with download
}

console.log(`Downloading ${url} to ${filepath}...`);
const response = await fetch(url);
if (!response.ok) {
throw new Error(`Failed to download ${url}: ${response.status} ${response.statusText}`);
}

const buffer = await response.arrayBuffer();
await writeFile(filepath, new Uint8Array(buffer));
console.log(`Downloaded ${filepath} (${buffer.byteLength} bytes)`);
}

async function* createFileSource(
files: Array<{ filepath: string; filename: string }>
): AsyncGenerator<{ content: Blob; path: string; sha256: string }> {
for (const file of files) {
console.log(`Processing ${file.filename}...`);
const buffer = await readFile(file.filepath);
const blob = new Blob([buffer]);

// Calculate sha256
console.log(`Calculating SHA256 for ${file.filename}...`);
const sha256Iterator = sha256(blob, { useWebWorker: false });
let res: IteratorResult<number, string>;
do {
res = await sha256Iterator.next();
} while (!res.done);
const sha256Hash = res.value;

console.log(`SHA256 for ${file.filename}: ${sha256Hash}`);
yield {
content: blob,
path: file.filename,
sha256: sha256Hash,
};
}
}

function getBodySize(body: RequestInit["body"]): string {
if (!body) {
return "no body";
}
if (body instanceof ArrayBuffer) {
return body.byteLength.toString();
}
if (body instanceof Blob) {
return "blob";
}
if (body instanceof Uint8Array) {
return body.byteLength.toString();
}
return "unknown size";
}

function createMockFetch(): {
fetch: typeof fetch;
getStats: () => { xorbCount: number; shardCount: number; xorbBytes: number; shardBytes: number };
} {
let xorbCount = 0;
let shardCount = 0;
let xorbBytes = 0;
let shardBytes = 0;

const mockFetch = async function (input: string | URL | Request, init?: RequestInit): Promise<Response> {
const url = typeof input === "string" ? input : input.toString();

// Mock successful responses for xorb and shard uploads
if (url.includes("/xorb/")) {
xorbCount++;
const bodySize = getBodySize(init?.body);
xorbBytes += parseInt(bodySize);
console.log(`[MOCK] Xorb upload ${xorbCount}: ${init?.method || "GET"} ${url} (${bodySize})`);

return new Response(null, {
status: 200,
statusText: "OK",
});
}

if (url.includes("/shard/")) {
shardCount++;
const bodySize = getBodySize(init?.body);
shardBytes += parseInt(bodySize);
console.log(`[MOCK] Shard upload ${shardCount}: ${init?.method || "GET"} ${url} (${bodySize})`);

return new Response(null, {
status: 200,
statusText: "OK",
});
}

// For other requests, use real fetch
return fetch(input, init).then((res) => {
console.log(`[real] ${res.status} ${res.statusText} ${url} ${res.headers.get("content-length")}`);
return res;
});
};

return {
fetch: mockFetch,
getStats: () => ({ xorbCount, shardCount, xorbBytes, shardBytes }),
};
}

async function main() {
const { values: args } = parseArgs({
options: {
token: {
type: "string",
short: "t",
},
repo: {
type: "string",
short: "r",
},
commit: {
type: "boolean",
short: "c",
default: false,
},
},
});

if (!args.token || !args.repo) {
console.error("Usage: pnpm --filter hub bench -t <write token> -r <xet repo>");
console.error("Example: pnpm --filter hub bench -t hf_... -r myuser/myrepo");
process.exit(1);
}

// Setup temp directory
const tempDir = tmpdir();
const downloadDir = join(tempDir, "hf-bench-downloads");

// Ensure download directory exists
await mkdir(downloadDir, { recursive: true });

// Download files
const files: Array<{ filepath: string; filename: string }> = [];

for (const fileInfo of FILES_TO_DOWNLOAD) {
const filepath = join(downloadDir, fileInfo.filename);
await downloadFileIfNotExists(fileInfo.url, filepath);
files.push({ filepath, filename: fileInfo.filename });
}

// Parse repo
const repoName = args.repo;

const repo: RepoId = toRepoId(repoName);

// Create mock fetch
const mockFetchObj = createMockFetch();

// Setup upload parameters
const uploadParams = {
accessToken: args.token,
hubUrl: "https://huggingface.co",
customFetch: mockFetchObj.fetch,
repo,
rev: "main",
};

// Track statistics
const stats: Array<{
filename: string;
size: number;
dedupRatio: number;
}> = [];

console.log("\n=== Starting upload simulation ===");

// Process files through uploadShards
const fileSource = createFileSource(files);

for await (const event of uploadShards(fileSource, uploadParams)) {
switch (event.event) {
case "file": {
console.log(`\n📁 Processed file: ${event.path}`);
console.log(` SHA256: ${event.sha256}`);
console.log(` Dedup ratio: ${(event.dedupRatio * 100).toFixed(2)}%`);

// Find the file size
const file = files.find((f) => f.filename === event.path);
if (file) {
const fileStats = await stat(file.filepath);

stats.push({
filename: event.path,
size: fileStats.size,
dedupRatio: event.dedupRatio,
});
}
break;
}

case "fileProgress": {
const progress = (event.progress * 100).toFixed(1);
console.log(` 📈 Progress for ${event.path}: ${progress}%`);
break;
}
}
}

// Get actual upload counts from the mock fetch
const uploadStats = mockFetchObj.getStats();
console.log(`\n📊 Actual upload counts: ${uploadStats.xorbCount} xorbs, ${uploadStats.shardCount} shards`);

// Output final statistics
console.log("\n=== BENCHMARK RESULTS ===");
console.log("File Statistics:");
console.log("================");

for (const stat of stats) {
console.log(`\n📄 ${stat.filename}:`);
console.log(` Size: ${(stat.size / 1024 / 1024).toFixed(2)} MB`);
console.log(` Deduplication: ${(stat.dedupRatio * 100).toFixed(2)}%`);
}

console.log("\n=== SUMMARY ===");
const totalSize = stats.reduce((sum, s) => sum + s.size, 0);
const avgDedup = stats.reduce((sum, s) => sum + s.dedupRatio, 0) / stats.length;

console.log(`Total files: ${stats.length}`);
console.log(`Total size: ${(totalSize / 1024 / 1024).toFixed(2)} MB`);
console.log(`Total xorbs: ${uploadStats.xorbCount}`);
console.log(`Total shards: ${uploadStats.shardCount}`);
console.log(`Total xorb bytes: ${uploadStats.xorbBytes.toLocaleString("fr")} bytes`);
console.log(`Total shard bytes: ${uploadStats.shardBytes.toLocaleString("fr")} bytes`);
console.log(`Average deduplication: ${(avgDedup * 100).toFixed(2)}%`);

if (args.commit) {
console.log("\n=== Committing files ===");
const iterator = commitIter({
repo,
operations: files.map((file) => ({
operation: "addOrUpdate",
content: pathToFileURL(file.filepath),
path: file.filename,
})),
accessToken: args.token,
title: "Upload xet files with JS lib",
xet: true,
});
for await (const event of iterator) {
if (event.event === "fileProgress" && event.state === "hashing") {
// We don't care about the hashing progress
} else {
console.log(event);
}
}

console.log("Done committing");
}
}

main().catch((error) => {
console.error("Error:", error);
process.exit(1);
});
8 changes: 5 additions & 3 deletions packages/hub/scripts/build-xet-wasm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -224,13 +224,15 @@ fi

# copy the generated hf_xet_thin_wasm_bg.js to the hub package and hf_xet_thin_wasm_bg.wasm to the hub package
cp "$CLONE_DIR/$PACKAGE/pkg/hf_xet_thin_wasm_bg.js" "./src/vendor/xet-chunk/chunker_wasm_bg.js"
cp "$CLONE_DIR/$PACKAGE/pkg/hf_xet_thin_wasm_bg.wasm.d.ts" "./src/vendor/xet-chunk/chunker_wasm_bg.wasm.d.ts"
echo "// Generated by build-xet-wasm.sh" > "./src/vendor/xet-chunk/chunker_wasm_bg.wasm.base64.ts"
echo "export const wasmBase64 = atob(\`" >> "./src/vendor/xet-chunk/chunker_wasm_bg.wasm.base64.ts"
base64 "$CLONE_DIR/$PACKAGE/pkg/hf_xet_thin_wasm_bg.wasm" | fold -w 100 >> "./src/vendor/xet-chunk/chunker_wasm_bg.wasm.base64.ts"
cat << 'EOF' >> "./src/vendor/xet-chunk/chunker_wasm_bg.wasm.base64.ts"
`)
.trim()
.replaceAll("\n", "");
`
.trim()
.replaceAll("\n", "")
);
const wasmBinary = new Uint8Array(wasmBase64.length);
for (let i = 0; i < wasmBase64.length; i++) {
wasmBinary[i] = wasmBase64.charCodeAt(i);
Expand Down
Loading
Loading