High-performance inter-process communication for Python using memory-mapped files. Sidestep the GIL by using separate processes with fast shared memory.
Python's GIL prevents parallel execution in threads. This library enables true parallelism by:
- Using separate processes (each with its own GIL)
- Sharing memory via mmap (no pickling overhead)
- Providing 10-50x better throughput than
multiprocessing.Queue(depending on workload)
uv venv && source .venv/bin/activate
uv pip install -e .from mmapipc import SharedMemoryRing
import multiprocessing as mp
def producer():
with SharedMemoryRing.open("/tmp/ring") as ring:
for i in range(1000):
ring.write(f"Message {i}".encode(), block=True)
def consumer():
with SharedMemoryRing.open("/tmp/ring") as ring:
while True:
data = ring.read(block=True, timeout=1.0)
if data:
process(data)
# Create buffer
with SharedMemoryRing.create("/tmp/ring", capacity=1024*1024):
pass
# Start processes
mp.Process(target=consumer).start()
mp.Process(target=producer).start()from mmapipc import SharedMemoryQueue
def worker(worker_id):
with SharedMemoryQueue.open("/tmp/queue") as queue:
while True:
task = queue.get(block=True, timeout=5.0)
if task:
process_task(task)
# Create and populate queue
with SharedMemoryQueue.create("/tmp/queue", capacity=1024*1024) as queue:
for i in range(100):
queue.put(f"Task {i}".encode())
# Start workers
for i in range(4):
mp.Process(target=worker, args=(i,)).start()from mmapipc import SharedMemory
# Process 1: Write
with SharedMemory.create("/tmp/shm", size=1024) as shm:
shm.write(0, b"Hello")
shm.write_uint(100, 42)
# Process 2: Read
with SharedMemory.open("/tmp/shm") as shm:
data = shm.read(0, 5)
num = shm.read_uint(100)Lock-free ring buffer for streaming data (single producer, single consumer).
create(path, capacity)/open(path)- Create or open bufferwrite(data, block=False, timeout=None)- Write bytes (returns bool)read(block=False, timeout=None)- Read bytes (returns bytes or None)available_write()/available_read()- Get available space/dataclear()- Reset bufferclose()/unlink()- Close or delete
This library is designed to bypass the Global Interpreter Lock (GIL) for high-throughput data transfer between Python processes.
Why it's useful:
- True Parallelism: Processes run independently, utilizing multiple CPU cores.
- Zero Serialization: Unlike
multiprocessing.Queue(which pickles objects), this library copies raw bytes (or avoids copying entirely withget_view), resulting in massive throughput gains. - Zero-Copy Reads: With
get_view(), you can process data directly from shared memory without creating intermediate Pythonbytesobjects.
Benchmarks (MacBook Pro M3):
- Throughput: Up to 19 GB/s (Zero-Copy Read).
- Message Rate: Up to 800,000 msg/sec (Batch Write).
- Latency: Sub-millisecond overhead.
Best For:
- Video processing pipelines (passing raw frames).
- High-frequency trading / financial data feeds.
- Machine Learning data loading (passing tensors/arrays).
Thread-safe queue for discrete messages (multiple producers/consumers).
create(path, capacity)/open(path)- Create or open queueput(data, block=True, timeout=None)- Enqueue message (returns bool)put_batch(messages, block=True, timeout=None)- Enqueue multiple messages (returns count)get(block=True, timeout=None)- Dequeue message (returns bytes or None)get_batch(max_count, block=True, timeout=None)- Dequeue multiple messages (returns list[bytes])get_view(block=True, timeout=None)- Zero-copy dequeue (returns list[memoryview] or None)qsize()/empty()- Get size or check if emptyclear()- Clear queueclose()/unlink()- Close or delete
create(path, size)/open(path)- Create or open mappingwrite(offset, data)/read(offset, length)- Read/write byteswrite_int(offset, val)/read_int(offset)- Read/write signed int64write_uint(offset, val)/read_uint(offset)- Read/write unsigned int64get_view(offset, size)- Get zero-copy memoryviewclose()/unlink()- Close or delete
python examples/ring_buffer_example.py # Streaming demo
python examples/queue_example.py # Worker pool demo
python examples/benchmark.py # Basic performance comparison
python examples/benchmark_advanced.py # Batching & Throughput benchmarkuv pip install -e ".[dev]"
pytest tests/ -v- ✅ Same-machine Python processes
- ✅ High throughput required (>1 GB/s)
- ✅ Low latency critical (<100μs)
- ✅ Unix/Linux environment
| Solution | Throughput | Use Case |
|---|---|---|
| mmap-ipc | 19 GB/s | High-performance local IPC |
multiprocessing.shared_memory |
~19 GB/s* | Cross-platform, DIY sync |
| NumPy + shared memory | ~19 GB/s* | Numerical arrays only |
| Redis (localhost) | 1-2 GB/s | Distributed systems |
*Theoretical maximum if you implement synchronization correctly
Bottom line: For high-performance, same-machine Python IPC, mmap-ipc provides the best balance of performance and ease-of-use.
- Version: 1.0.0 (Beta)
- Test Coverage: 27 comprehensive tests
- CI/CD: Automated testing on Python 3.8-3.12, Ubuntu/macOS
- Stability: Built on battle-tested primitives (mmap, fcntl)
- ✅ Functional correctness (all operations)
- ✅ Cross-process communication
- ✅ Crash safety (process dies while holding lock)
- ✅ Performance benchmarks
- ✅ Edge cases (wrap-around, buffer full, etc.)
- Unix-only: Requires
fcntl(Linux, macOS). Windows not supported. - File-based: Uses
/tmp/for shared memory files.
- Development: Use freely, report issues
- Production: Thoroughly test your specific use case first
- Fallback: Have a plan to use
multiprocessing.Queueif needed
- All primitives support context managers (
withstatements) - Use
close()to release resources (automatic with context managers) - Use
unlink()to delete the file (only call once when completely done) - Files are typically in
/tmp/and cached in RAM by the OS - Ring buffer: lock-free, ~48K msg/sec (optimized with slice assignment)
- Queue: robust
fcntllocking (crash-safe), ~38K msg/sec - Works on Unix-like systems (Linux, macOS) - requires
fcntl
MIT