Skip to content

Commit 710425a

Browse files
authored
fix: update readme, unit test issue (#308)
* fix: update readme, unit test issue * fix: add gen one crd script * fix: optimize connection prefix * fix: unit test issue * fix: add gen crd script * fix: unit test issue
1 parent d4cf358 commit 710425a

File tree

7 files changed

+79
-23
lines changed

7 files changed

+79
-23
lines changed

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,10 @@ fmt: ## Run go fmt against code.
6060
vet: ## Run go vet against code.
6161
go vet ./...
6262

63+
.PHONY: one-crd
64+
one-crd:
65+
bash scripts/generate-crd.sh
66+
6367
.PHONY: test
6468
test: manifests generate fmt vet envtest ## Run tests.
6569
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" GO_TESTING=true go run github.com/onsi/ginkgo/v2/ginkgo -p -timeout 0 -cover -coverprofile cover.out -r --skip-file ./test/e2e

README.md

Lines changed: 17 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
<p align="center"><a href="javascript:void(0);" target="_blank" rel="noreferrer"><img width="200" src="https://cdn.tensor-fusion.ai/logo.svg" alt="Logo"></a></p>
1+
<p align="center"><a href="javascript:void(0);" target="_blank" rel="noreferrer"><img width="100%" src="https://cdn.tensor-fusion.ai/logo-banner.png" alt="Logo"></a></p>
22

33
<p align="center">
4-
<strong><a href="https://tensor-fusion.ai" target="_blank">TensorFusion.AI</a></strong><br/>Next-Generation GPU Virtualization and Pooling for Enterprises<br><b>Less GPUs, More AI Apps.</b>
4+
<br /><strong><a href="https://tensor-fusion.ai" target="_blank">TensorFusion.AI</a></strong><br/><b>Less GPUs, More AI Apps.</b>
55
<br />
66
<a href="https://tensor-fusion.ai/guide/overview"><strong>Explore the docs »</strong></a>
77
<br />
@@ -13,12 +13,9 @@
1313
</p>
1414

1515

16-
# ♾️ Tensor Fusion
17-
1816
[![Contributors][contributors-shield]][contributors-url]
1917
[![Forks][forks-shield]][forks-url]
2018
[![Stargazers][stars-shield]][stars-url]
21-
[![Issues][issues-shield]][issues-url]
2219
[![MIT License][license-shield]][license-url]
2320
[![LinkedIn][linkedin-shield]][linkedin-url]
2421
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/NexusGPU/tensor-fusion)
@@ -27,11 +24,11 @@ Tensor Fusion is a state-of-the-art **GPU virtualization and pooling solution**
2724

2825
## 🌟 Highlights
2926

30-
#### 📐 Fractional GPU with Single TFlops/MiB Precision
31-
#### 🔄 Battle-tested GPU-over-IP Remote GPU Sharing
27+
#### 📐 Fractional Virtual GPU
28+
#### 🔄 Remote GPU Sharing over Ethernet/InfiniBand
3229
#### ⚖️ GPU-first Scheduling and Auto-scaling
33-
#### 📊 Computing Oversubscription and GPU VRAM Expansion
34-
#### 🛫 GPU Pooling, Monitoring, Live Migration, AI Model Preloading and more
30+
#### 📊 GPU Oversubscription and VRAM Expansion
31+
#### 🛫 GPU Pooling, Monitoring, Live Migration, Model Preloading and more
3532

3633
## 🎬 Demo
3734

@@ -88,27 +85,26 @@ https://cdn.tensor-fusion.ai/GPU_Content_Migration.mp4
8885
- [x] GPU compaction/bin-packing
8986
- [x] Seamless onboarding experience for Pytorch, TensorFlow, llama.cpp, vLLM, Tensor-RT, SGlang and all popular AI training/serving frameworks
9087
- [x] Centralized Dashboard & Control Plane
91-
- [ ] GPU-first autoscaling policies, auto set requests/limits/replicas
92-
- [ ] Request multiple vGPUs with group scheduling for large models
93-
- [ ] Support different QoS levels
88+
- [x] GPU-first autoscaling policies, auto set requests/limits/replicas
89+
- [x] Request multiple vGPUs with group scheduling for large models
90+
- [x] Support different QoS levels
9491

9592
### Enterprise Features
9693

97-
- [x] GPU live-migration, snapshot/distribute/restore GPU context cross cluster, fastest in the world
94+
- [x] GPU live-migration, snapshot and restore GPU context cross cluster
9895
- [ ] AI model registry and preloading, build your own private MaaS(Model-as-a-Service)
9996
- [ ] Advanced auto-scaling policies, scale to zero, rebalance of hot GPUs
10097
- [ ] Advanced observability features, detailed metrics & tracing/profiling of CUDA calls
10198
- [ ] Monetize your GPU cluster by multi-tenancy usage measurement & billing report
10299
- [ ] Enterprise level high availability and resilience, support topology aware scheduling, GPU node auto failover etc.
103-
- [ ] Enterprise level security, complete on-premise deployment support, encryption in-transit & at-rest
100+
- [ ] Enterprise level security, complete on-premise deployment support
104101
- [ ] Enterprise level compliance, SSO/SAML support, advanced audit, ReBAC control, SOC2 and other compliance reports available
105102

106103
### 🗳️ Platform Support
107104

108105
- [x] Run on Linux Kubernetes clusters
109106
- [x] Run on Linux VMs or Bare Metal (one-click onboarding to Edge K3S)
110-
- [x] Run on Windows (Docs not ready, contact us for support)
111-
- [ ] Run on MacOS (Imagining mount a virtual NVIDIA GPU device on MacOS!)
107+
- [x] Run on Windows (Not open sourced, contact us for support)
112108

113109
See the [open issues](https://github.com/NexusGPU/tensor-fusion/issues) for a full list of proposed features (and known issues).
114110

@@ -131,12 +127,13 @@ Don't forget to give the project a star! Thanks again!
131127
<img src="https://contrib.rocks/image?repo=NexusGPU/tensor-fusion" alt="contrib.rocks image" />
132128
</a>
133129

134-
<!-- LICENSE -->
135130
## 🔷 License
136131

137-
1. This repo is open sourced with [Apache 2.0 License](./LICENSE), which includes **GPU pooling, scheduling, management features**, you can use it for free and modify it.
138-
2. **GPU virtualization and GPU-over-IP features** are also free to use as the part of **Community Plan**, the implementation is not fully open sourced
139-
3. Features mentioned in "**Enterprise Features**" above are paid, **licensed users can automatically unlock these features**.
132+
1. [TensorFusion main repo](https://github.com/NexusGPU/tensor-fusion) is open sourced with [Apache 2.0 License](./LICENSE), which includes **GPU pooling, scheduling, management features**, you can use it for free and customize it as you want.
133+
2. [vgpu.rs repo](https://github.com/NexusGPU/vgpu.rs) is open sourced with [Apache 2.0 License](./LICENSE), which includes **Fractional GPU** and **vGPU hypervisor features**, you can use it for free and customize it as you want.
134+
3. **Advanced GPU virtualization and GPU-over-IP sharing features** are also free to use when **GPU total number of your organization is less than 10**, but the implementation is not fully open sourced, please [contact us](mailto:[email protected]) for more details.
135+
4. Features mentioned in "**Enterprise Features**" above are paid, **licensed users can use these features in [TensorFusion Console](https://app.tensor-fusion.ai)**.
136+
5. For large scale deployment that involves non-free features of #3 and #4, please [contact us](mailto:[email protected]), pricing details are available [here](https://tensor-fusion.ai/pricing)
140137

141138
[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FNexusGPU%2Ftensor-fusion.svg?type=large&issueType=license)](https://app.fossa.com/projects/git%2Bgithub.com%2FNexusGPU%2Ftensor-fusion?ref=badge_large&issueType=license)
142139

internal/constants/env.go

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,9 @@ const (
6969
LdPreloadFileName = "ld.so.preload"
7070
LdPreloadFile = "/etc/ld.so.preload"
7171

72-
TFLibsVolumeName = "tf-libs"
73-
TFLibsVolumeMountPath = "/tensor-fusion"
72+
TFLibsVolumeName = "tf-libs"
73+
TFLibsVolumeMountPath = "/tensor-fusion"
74+
TFConnectionNamePrefix = "tf-vgpu-"
7475

7576
HostIPFieldRef = "status.hostIP"
7677
NodeNameFieldRef = "spec.nodeName"

internal/controller/pod_controller.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ func (r *PodReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.R
9090
return ctrl.Result{}, err
9191
}
9292
delete(pod.Annotations, constants.SetPendingOwnedWorkloadAnnotation)
93+
log.Info("Pending owned workload set", "pod", pod.Name, "ownedWorkload", ownedWorkloadName)
9394
if err := r.Update(ctx, pod); err != nil {
9495
return ctrl.Result{}, err
9596
}

internal/controller/pod_controller_test.go

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ import (
3131
"k8s.io/apimachinery/pkg/api/resource"
3232
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
3333
"k8s.io/apimachinery/pkg/types"
34+
"k8s.io/utils/ptr"
3435
"sigs.k8s.io/controller-runtime/pkg/client"
3536
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
3637
)
@@ -158,6 +159,14 @@ var _ = Describe("Pod Controller", func() {
158159
},
159160
}
160161
Expect(k8sClient.Create(ctx, workload)).To(Succeed())
162+
Eventually(func() error {
163+
updatedWorkload := &tfv1.TensorFusionWorkload{}
164+
err := k8sClient.Get(ctx, client.ObjectKeyFromObject(workload), updatedWorkload)
165+
if err != nil {
166+
return err
167+
}
168+
return nil
169+
}).Should(Succeed())
161170

162171
clientPod = &corev1.Pod{
163172
ObjectMeta: metav1.ObjectMeta{
@@ -191,17 +200,35 @@ var _ = Describe("Pod Controller", func() {
191200
},
192201
},
193202
},
203+
TerminationGracePeriodSeconds: ptr.To(int64(0)),
194204
},
195205
}
196206
})
197207

198208
AfterEach(func() {
199209
if workload != nil {
200210
_ = k8sClient.Delete(ctx, workload)
211+
Eventually(func() error {
212+
return k8sClient.Get(ctx, client.ObjectKeyFromObject(workload), workload)
213+
}).Should(Satisfy(errors.IsNotFound))
201214
}
202215
if clientPod != nil {
203216
_ = k8sClient.Delete(ctx, clientPod)
217+
Eventually(func() error {
218+
return k8sClient.Get(ctx, client.ObjectKeyFromObject(clientPod), clientPod)
219+
}).Should(Satisfy(errors.IsNotFound))
204220
}
221+
222+
connection := &tfv1.TensorFusionConnection{
223+
ObjectMeta: metav1.ObjectMeta{
224+
Name: "test-connection-pod-controller",
225+
Namespace: "default",
226+
},
227+
}
228+
_ = k8sClient.Delete(ctx, connection)
229+
Eventually(func() error {
230+
return k8sClient.Get(ctx, client.ObjectKeyFromObject(connection), connection)
231+
}).Should(Satisfy(errors.IsNotFound))
205232
})
206233

207234
It("should successfully create TensorFusion connection for client pod", func() {
@@ -331,6 +358,14 @@ var _ = Describe("Pod Controller", func() {
331358
},
332359
}
333360
Expect(k8sClient.Create(ctx, workload)).To(Succeed())
361+
Eventually(func() error {
362+
updatedWorkload := &tfv1.TensorFusionWorkload{}
363+
err := k8sClient.Get(ctx, client.ObjectKeyFromObject(workload), updatedWorkload)
364+
if err != nil {
365+
return err
366+
}
367+
return nil
368+
}).Should(Succeed())
334369

335370
pod = &corev1.Pod{
336371
ObjectMeta: metav1.ObjectMeta{
@@ -351,6 +386,7 @@ var _ = Describe("Pod Controller", func() {
351386
Image: "test-image",
352387
},
353388
},
389+
TerminationGracePeriodSeconds: ptr.To(int64(0)),
354390
},
355391
}
356392
})
@@ -426,6 +462,7 @@ var _ = Describe("Pod Controller", func() {
426462
},
427463
},
428464
},
465+
TerminationGracePeriodSeconds: ptr.To(int64(0)),
429466
},
430467
}
431468

internal/webhook/v1/pod_webhook.go

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -388,7 +388,11 @@ func assignPodLabelsAndAnnotations(isLocalGPU bool, pod *corev1.Pod, pool *tfv1.
388388
}
389389

390390
func addConnectionForRemoteFixedReplicaVirtualGPU(pod *corev1.Pod, container *corev1.Container, clientConfig *tfv1.ClientConfig) {
391-
connectionName := fmt.Sprintf("%s%s", pod.GenerateName, utils.NewShortID(10))
391+
prefix := pod.GenerateName
392+
if pod.GenerateName == "" {
393+
prefix = pod.Name + constants.TFConnectionNamePrefix
394+
}
395+
connectionName := fmt.Sprintf("%s%s", prefix, utils.NewShortID(10))
392396
connectionNamespace := pod.Namespace
393397

394398
// metadata TF_POD_NAME and TF_CONNECTION_NAMESPACE

scripts/generate-crd.sh

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/bin/bash
2+
3+
CRD_DIR="./charts/tensor-fusion/crds"
4+
OUTPUT_FILE="./tmp.tensor-fusion-crds.yaml"
5+
6+
echo "Generating combined CRD file..."
7+
> "$OUTPUT_FILE"
8+
for file in "$CRD_DIR"/*.yaml; do
9+
[ -s "$OUTPUT_FILE" ]
10+
cat "$file" >> "$OUTPUT_FILE"
11+
done
12+
echo "Generated: $OUTPUT_FILE"

0 commit comments

Comments
 (0)