EGGROLL in C

A minimalist, dependency-free implementation of the EGGROLL (Evolution Guided General Optimization via Low-rank Learning) algorithm family.

Mission: To get the most capable pairs of models / platforms cross-implementations of EGGROLL family of algorithms as possible and squeeze every possible bit of performance from the equipment, implementing everything required from the scratch in hardware-optimized fashion.

This project demonstrates integer-only training of a language model, completely bypassing the need for standard floating-point arithmetic or heavy ML frameworks like PyTorch or JAX.

Key Features

Pure C / Bare Metal: Old-school fashioned and goal-oriented. Zero external dependencies on the CPU side, keeping it close to the metal.
Apple Silicon Optimized: Vectorized operations using ARM NEON intrinsics and parallelized via Grand Central Dispatch (GCD).
NVIDIA CUDA Optimized: Custom GPU kernels utilizing Warp-level primitives, Shared Memory, and CUB/Thrust for maximum throughput.
Integer Only: Operates primarily on int8 weights/activations with int32 (CPU) or int64 (GPU) accumulation—sticking to integer math as long as it yields the best performance for the hardware.
Gradient Free: Uses Evolution Strategies (ES) with low-rank perturbations instead of backpropagation. It's both wisdom and freedom!

Quick Start

1. Prepare Data

Ensure you have a text dataset named input.txt in the current directory.

2. Compile & Run

Apple Silicon / CPU

clang -O3 full_trained_egg.c -o egg
./egg

NVIDIA GPU (CUDA)

nvcc -O3 full_cuda_train_egg.cu -o egg_cuda
./egg_cuda

Advanced Implementations

Int8NativeFormer (`full_cuda_train_transformer_adam_mgpu.cu`)

An int8 model.

Native int8 Architecture: Operates on raw bytes with a compact N-layer, H-dim topology.
Quantized Sigmoid Self-Attention: An int32/int64 accumulation scheme and quantized weighting.
Auto-Norm & Entropy Monitoring: Adaptive normalization layers.
EGG DEBUG: debug-printing "tool" to monitor entropy flow through the network and weights distribution and saturation.
Information-Regulated Optimizer: A hybrid ES-AdamW approach where the optimizer (float32) regulates the amount of updates applied to the integer weights, ensuring stable learning.
Performance: Achieves ~300k tokens/second with a population of 40,000+ (8192×5) on a single 4090 GPU setup, reaching loss rates (~1.45 bits/byte).

Multi-GPU Strategy

The system employs a Synchronous Replicated Model with Sharded Evaluation:

Sharded Evaluation: The population is split across GPUs, with each evaluating a subset of perturbations in parallel.
Implicit Synchronization: Instead of exchanging gradients (All-Reduce), GPUs receive only fitness scores. Since noise is deterministic, each GPU independently reconstructs the update, keeping replicas synchronized with negligible bandwidth.

Compile & Run (Multi-GPU)

nvcc -O3 -arch=native full_cuda_train_transformer_adam_mgpu.cu -o egg_transformer_mgpu
./egg_transformer_mgpu

Debugging (`egg_debug_printer.h`)

A lightweight header-only tool for monitoring integer model stability and detecting saturation or mode collapse.

Metrics: Tracks Mean, StdDev, bit-level Entropy (0.00-8.00), and Saturation percentages per layer.
Usage: Define EGG_DEBUG during compilation to enable ANSI-colored logs for activations and attention scores.

Configuration

Community & Contributing

We are friendly and welcome all sorts of contributions!

Testers: Open issues with a description of your available compute, join existing issues if you can platforms described there.
Moderators: To keep all this under control.
Creatives: Even if you have nice creative IDEA on README design - you're welcome.

References

Original JAX Implementation: ESHyperscale/nano-egg
Original Paper & Project: EGGROLL Website

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
_imgs_		_imgs_
wikigen		wikigen
README.md		README.md
egg_adaptive_normalize.h		egg_adaptive_normalize.h
egg_debug_printer.h		egg_debug_printer.h
egg_disk_log.h		egg_disk_log.h
egg_disk_utils.h		egg_disk_utils.h
full_cuda_train_egg.cu		full_cuda_train_egg.cu
full_cuda_train_egg_transformer.cu		full_cuda_train_egg_transformer.cu
full_cuda_train_egg_transformer_adam.cu		full_cuda_train_egg_transformer_adam.cu
full_cuda_train_transformer_adam_mgpu.cu		full_cuda_train_transformer_adam_mgpu.cu
full_trained_egg.c		full_trained_egg.c
muon_internal.cuh		muon_internal.cuh
sweep_hyperparams.sh		sweep_hyperparams.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EGGROLL in C

Key Features

Quick Start

1. Prepare Data

2. Compile & Run

Apple Silicon / CPU

NVIDIA GPU (CUDA)

Advanced Implementations

Int8NativeFormer (`full_cuda_train_transformer_adam_mgpu.cu`)

Multi-GPU Strategy

Compile & Run (Multi-GPU)

Debugging (`egg_debug_printer.h`)

Configuration

Community & Contributing

References

About

Uh oh!

Releases

Packages

Languages

d0rc/egg.c

Folders and files

Latest commit

History

Repository files navigation

EGGROLL in C

Key Features

Quick Start

1. Prepare Data

2. Compile & Run

Apple Silicon / CPU

NVIDIA GPU (CUDA)

Advanced Implementations

Int8NativeFormer (full_cuda_train_transformer_adam_mgpu.cu)

Multi-GPU Strategy

Compile & Run (Multi-GPU)

Debugging (egg_debug_printer.h)

Configuration

Community & Contributing

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Int8NativeFormer (`full_cuda_train_transformer_adam_mgpu.cu`)

Debugging (`egg_debug_printer.h`)

Packages