Skip to content

Commit 2e922b9

Browse files
authored
Merge branch 'master' into master
2 parents e3d0e0f + 8504fdf commit 2e922b9

10 files changed

+166
-0
lines changed

.github/actions/spelling/allow/names.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ Svrin
7474
Tadel
7575
Taras
7676
Thessaloniki
77+
Timmaraju
7778
Universitat
7879
Unveristy
7980
Uppili
@@ -196,6 +197,7 @@ tapaswenipathak
196197
tfransham
197198
thakkar
198199
tharun
200+
timmaraju
199201
tlattner
200202
vaibhav
201203
vassil

.github/actions/spelling/allow/terms.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,15 @@ CINT
44
CMSSW
55
Cppyy
66
Debian
7+
EPC
78
GPGPU
9+
GPT
810
GSo
911
GSoC
1012
HSF
1113
JIT'd
1214
Jacobians
15+
LLMs
1316
LLVM
1417
NVIDIA
1518
NVMe
@@ -30,12 +33,14 @@ gitlab
3033
gridlay
3134
gsoc
3235
gpu
36+
llm
3337
llvm
3438
pushforward
3539
linkedin
3640
microenvironments
3741
pythonized
3842
ramview
43+
reoptimize
3944
samtools
4045
sitemap
4146
softsusy

_data/contributors.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -334,6 +334,34 @@
334334
proposal: /assets/docs/de_la_torre_gonzalez_salvador_proposal_gsoc_2025.pdf
335335
mentors: Vassil Vassilev, Lukas Breitwieser
336336

337+
- name: Rohan Timmaraju
338+
photo: Rohan_Timmaraju.jpg
339+
info: "Google Summer of Code 2025 Contributor"
340+
341+
education: "B.S. Computer Science, Columbia University"
342+
github: "https://github.com/Rohan-T144"
343+
active: 1
344+
linkedin: "https://www.linkedin.com/in/rohan-timmaraju-650ba3221/"
345+
projects:
346+
- title: "Enhancing LLM Training Efficiency with Clad for Automatic Differentiation"
347+
status: Ongoing
348+
description: |
349+
Training Large Language Models is computationally expensive, often
350+
limited by the performance limitations of Python-based frameworks. This
351+
project addresses this challenge by enhancing LLM training efficiency
352+
within a C++ environment through the integration of Clad, a Clang/LLVM
353+
compiler plugin for automatic differentiation (AD). We will develop a
354+
custom C++ tensor library specifically designed for optimal interaction
355+
with Clad. The core objective is to replace traditional runtime or
356+
manual gradient computations with Clad's efficient compile-time
357+
differentiation for key LLM operations within a GPT-2 training pipeline.
358+
This involves investigating effective strategies to bridge Clad's static
359+
analysis with dynamic neural network computations, benchmarking the
360+
resulting performance gains in speed and memory usage against a non-Clad
361+
baseline, and leveraging OpenMP for further parallelization.
362+
proposal: /assets/docs/Rohan_Timmaraju_Proposal_2025.pdf
363+
mentors: Vassil Vassilev, David Lange, Jonas Rembser, Christina Koutsou
364+
337365
- name: Abdelrhman Elrawy
338366
photo: Abdelrhman.jpg
339367
info: "Google Summer of Code 2025 Contributor"

_pages/team/rohan-timmaraju.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
title: "Compiler Research - Team - Rohan Timmaraju"
3+
layout: gridlay
4+
excerpt: "Compiler Research: Team members"
5+
sitemap: false
6+
permalink: /team/RohanTimmaraju
7+
8+
---
9+
10+
{% include team-profile.html %}
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: "Advanced symbol resolution and re-optimization for Clang-Repl"
3+
layout: post
4+
excerpt: "Advanced symbol resolution and re-optimization for Clang-Repl is a Google Summer of Code 2025 project. It aims to improve Clang-Repl and ORC JIT by adding support for automatically loading dynamic libraries when symbols are missing. This removes the need for users to load libraries manually and makes things work more smoothly."
5+
sitemap: false
6+
author: Sahil Patidar
7+
permalink: blogs/gsoc25_sahil_introduction_blog/
8+
banner_image: /images/blog/gsoc_clang_repl.jpeg
9+
date: 2025-05-18
10+
tags: gsoc LLVM clang-repl ORC-JIT auto-loading
11+
---
12+
13+
### Introduction
14+
15+
I am Sahil Patidar, a student during the 2025 Google Summer of Code. I will be
16+
working on the project "Advanced symbol resolution and re-optimization for Clang-Repl".
17+
18+
**Mentors**: Vassil Vassilev
19+
20+
### Overview of the Project
21+
22+
[Clang-Repl](https://clang.llvm.org/docs/ClangRepl.html) is a powerful interactive C++ interpreter that leverages LLVM’s ORC JIT to support incremental compilation and execution. Currently, users must manually load dynamic libraries when their code references external symbols, as Clang-Repl lacks the ability to automatically resolve symbols from dynamic libraries.
23+
To address this limitation, we propose a solution to enable **auto-loading of dynamic libraries for unresolved symbols** within ORC JIT, which is central to Clang-Repl’s runtime infrastructure.
24+
25+
Another part of this project is to add **re-optimization support** to Clang-Repl. Currently, Clang-Repl does not have the ability to optimize hot functions at runtime. With this feature, Clang-Repl will be able to detect frequently called functions and re-optimize them using a runtime call threshold.
26+
27+
### Objectives
28+
29+
* Implement **auto-loading** of dynamic libraries in ORC JIT.
30+
* Add **re-optimization support** to Clang-Repl for hot functions.
31+
32+
33+
### Implementation Details and Plans
34+
35+
The primary objective of this project is to enable **automatic loading of dynamic libraries for unresolved symbols** in Clang-Repl. Since Clang-Repl heavily relies on LLVM's **ORC JIT** for incremental compilation and execution, our work focuses on extending ORC JIT to support this capability for out-of-process execution enviroment.
36+
37+
Currently, ORC JIT handles dynamic library symbol resolution through the `DynamicLibrarySearchGenerator`, which is registered for each loaded dynamic library. This generator is responsible for symbol lookup and interacts with the **Executor Process Control** layer to resolve symbols during execution. Specifically, it uses a `DylibHandle` to identify which dynamic library to search for the unresolved symbol. On the executor side, the `SimpleExecutorDylibManager` API performs the actual lookup using this handle.
38+
39+
To support **auto-loading in out-of-process execution**, Lang Hames proposed a design involving two new components:
40+
41+
* **`ExecutorResolver` API**: This is an abstract interface for resolving symbols on the executor side. It can be implemented in different ways—for example:
42+
43+
* `PerDylibResolver`, which wraps a native handle for a specific library.
44+
* `AutoLoadDylibResolver`, which attempts to load libraries automatically when a symbol is unresolved.
45+
46+
The `SimpleExecutorDylibManager` will be responsible for creating and managing these resolvers, returning a `ResolverHandle` instead of the traditional `DylibHandle`.
47+
48+
* **`ExecutorSymbolResolutionGenerator`**: This generator replaces the existing `EPCDynamicLibrarySearchGenerator` for out-of-process execution. Unlike the previous design that relied on `DylibHandle`, this generator will use the new `ResolverHandle` to resolve symbols via the `ResolverHandle->resolve()` interface.
49+
50+
In out-of-process execution, **per-library lookup** requires an RPC call for each dynamic library when resolving a symbol. If the symbol is in the **(N-1)th** library, **N-1 RPC calls** are made—introducing significant overhead.
51+
In **auto-loading mode**, only one RPC call is made, but it scans all libraries, which is also inefficient if the symbol is missing.
52+
53+
To reduce this overhead, we propose using a **Bloom filter** to quickly check symbol presence in both modes before making costly lookups. The main challenge lies in designing an efficient and accurate filtering approach.
54+
55+
The second goal of this project is to add **re-optimization support** for Clang-Repl. Since ORC JIT is the core component used by Clang-Repl for runtime compilation and execution, we will build on its existing capabilities. ORC JIT supports runtime re-optimization using the `ReOptimizeLayer` and `RedirectableManager`.
56+
57+
At a high level, the `ReOptimizeLayer` emits boilerplate "sugar" code into the IR module. This code triggers a call to `__orc_rt_reoptimize_tag` when a threshold count is exceeded. This call is handled by `ReOptimizeLayer::rt_reoptimize`, which is triggered by the ORC runtime to generate an optimized version of a "hot" function. The `RedirectableManager` then updates the function’s stub pointer to point to the new optimized version. To achieve this, we will implement a custom `ReOptFunc`. If runtime profiling is needed to detect hot functions, we may also need to make small changes to the ORC runtime to collect this data.
58+
59+
### Conclusion
60+
61+
Upon completion of this project, ORC JIT will gain the ability to **automatically load dynamic libraries** to resolve previously unresolved symbols. Additionally, the integration of **filter-based optimizations** on the controller side will significantly reduce the overhead of unnecessary RPC calls.
62+
Overall, this work enhances the flexibility and performance of ORC JIT and improves the user experience in tools like Clang-Repl that rely on it.
63+
64+
65+
### Related Links
66+
67+
- [LLVM Repository](https://github.com/llvm/llvm-project)
68+
- [Project Description](https://discourse.llvm.org/t/gsoc2025-advanced-symbol-resolution-and-reoptimization-for-clang-repl/84624/3)
69+
- [My GitHub Profile](https://github.com/SahilPatidar)
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
title: "Enhancing LLM Training Efficiency with Clad for Automatic Differentiation"
3+
layout: post
4+
excerpt: "This GSoC project leverages Clad to optimize LLM training in C++, aiming to boost efficiency by developing a custom tensor library and integrating Clad for compiler-level gradient calculations."
5+
sitemap: true
6+
author: Rohan Timmaraju
7+
permalink: blogs/gsoc25_rohan_introduction_blog/
8+
banner_image: /images/blog/LLM_project_banner.jpg
9+
date: 2025-05-21
10+
tags: gsoc c++ clang clad llm
11+
---
12+
13+
### Introduction
14+
15+
I am Rohan Timmaraju, a Computer Science student at Columbia University. During Google Summer of Code 2025, I will be working on the "Enhancing LLM Training Efficiency with Clad for Automatic Differentiation" project with the Compiler Research group.
16+
17+
**Mentors**: Vassil Vassilev, David Lange, Jonas Rembser, Christina Koutsou
18+
19+
### About LLM Training
20+
21+
Large Language Models (LLMs) like ChatGPT have revolutionized AI, but their training is incredibly computationally intensive. Currently, Python-based frameworks such as PyTorch and TensorFlow are the go-to tools. While they offer excellent flexibility and a rich ecosystem, their reliance on interpreted execution and dynamic computation graphs can lead to performance bottlenecks and high memory consumption. This is particularly noticeable when we consider deploying or training these models in resource-constrained environments or within C++-centric high-performance computing (HPC) setups, which are common in scientific research.
22+
23+
While C++ provides the tools for fine-grained control over system resources and has proven its capabilities in efficient LLM inference (as seen with projects like [llama.cpp](https://github.com/ggml-org/llama.cpp)), the critical component for *training* – flexible and efficient Automatic Differentiation (AD) – presents an ongoing challenge for C++ solutions.
24+
25+
### Why Use Clad?
26+
27+
This project proposes to tackle this challenge by integrating Clad, an Automatic Differentiation plugin for the Clang compiler. Unlike traditional AD libraries that often operate at runtime, Clad performs source-to-source transformation. It analyzes the C++ Abstract Syntax Tree (AST) at compile time and generates optimized C++ code for computing derivatives. This compiler-level approach has the potential to reduce runtime overhead and improve memory efficiency compared to dynamic methods.
28+
29+
To facilitate this integration, I am developing a custom C++ tensor library to be used in neural network training. Inspired by the powerful approaches of libraries such as [llm.c](https://github.com/karpathy/llm.c) and [pytorch](https://docs.pytorch.org/cppdocs/), this library is being designed from the ground up with Clad compatibility in mind. The core idea is to replace manual or internally managed gradient computations with Clad's reverse-mode AD (as in `clad::gradient`) for key LLM operations like matrix multiplications, activation functions, normalization layers, and the final loss function.
30+
31+
### Implementation Plan
32+
1. **Foundation & Baseline:** The implementation will start by implementing a complete GPT-2 training loop in C++ *without* Clad. This will serve as our performance baseline. GPT-2 is chosen here as a relatively simple open-source LLM architecture capable of being trained on local devices. This could be extended to other architectures like Llama or Mistral.
33+
2. **Core Clad Integration Strategy:** We will investigate and evaluate different strategies for applying Clad to tensor network gradient calculations, potentially also identifying potential areas where Clad itself could be enhanced for deep learning workloads.
34+
3. **Expanding Integration:** Once a promising strategy is identified and validated on simpler operations, we'll systematically integrate Clad into more complex components of the GPT-2 architecture.
35+
4. **Benchmarking & Optimization:** Benchmarking against our baseline will be crucial to quantify the performance gains (speed, memory). We'll also use profiling tools to identify bottlenecks and optimize the tensor library with Clad. OpenMP may be employed for parallelization to further boost performance.
36+
5. **Documentation & Potential Extensions:** Thorough documentation of the tensor library, the Clad integration process, and our findings will also be a primary focus. Time permitting, we'll explore extending this work to other LLM architectures like Llama.
37+
38+
39+
### Conclusion
40+
By successfully integrating Clad into a C++ LLM training pipeline, we aim to:
41+
* **Demonstrate Performance Gains:** Show tangible improvements in training speed and memory efficiency.
42+
* **Clad for ML:** Provide a significant real-world use case, potentially identifying areas for Clad's improvement in supporting ML tasks.
43+
* **Offer a C++ Alternative:** Provide a foundation for more efficient, compiler-driven LLM training within the C++ ecosystems.
44+
* **Learn and Share:** Gain insights into the practicalities of applying compiler-based AD to complex ML problems and share these learnings with the community.
45+
46+
I believe this project has the potential to make a valuable contribution to both the compiler research field and the ongoing efforts to make powerful AI models more accessible and efficient to train.
47+
48+
### Related Links
49+
50+
- [Project Description](https://hepsoftwarefoundation.org/gsoc/2025/proposal_Clad-LLM.html)
51+
- [Clad Repository](https://github.com/vgvassilev/clad)
52+
- [My GitHub Profile](https://github.com/Rohan-T144)
187 KB
Binary file not shown.

images/blog/LLM_project_banner.jpg

354 KB
Loading

images/blog/gsoc_clang_repl.jpeg

99 KB
Loading

images/team/Rohan_Timmaraju.jpg

294 KB
Loading

0 commit comments

Comments
 (0)