Jank Lisp Compiler IR: How Custom Intermediate Representation Works in 2025
What Is Jank and Why It Needed a Custom IR
Jank's Position as a Clojure-Compatible Language on LLVM
Jank is a Clojure-dialect language that compiles to native binaries via LLVM. Maintained by Jeaye Wilkerson, jank targets full Clojure compatibility while delivering native performance — no JVM, no startup lag, real C++ interop. If you've wanted to write idiomatic (reduce + coll) and ship a 5ms-startup CLI binary, jank is the project to watch. As of 2025, it's in active alpha development, with the compiler architecture undergoing significant changes — the most important of which is the introduction of a custom intermediate representation (IR).
The Problem With Relying Solely on Clang/LLVM IR Too Early
Before its custom IR, jank lowered Clojure semantics directly into a Clang C++ AST using libClang. This worked, but it was a dead end for optimization. Clang's AST has no concept of persistent vectors, var indirection, or lazy sequences. Every optimization had to happen after the Clojure-level semantics were already destroyed. You couldn't look at a Clang AST node and say "this branch is dead because the condition is a Clojure literal" — the information was gone. Lowering directly to C++ also meant that the compiler couldn't apply Lisp-aware transformations: constant folding of keyword literals, inline caching at var call sites, or escape analysis of persistent data structures.
What a Compiler IR Actually Does and Why It Matters
A compiler IR is the internal language a compiler uses between parsing your source and generating machine code. Think of it as a structured, queryable representation of your program that sits at a comfortable distance from both "human-readable source" and "CPU instructions." The key property of a good IR is that it's lowered enough to enable mechanical transformation but high-level enough to preserve the semantics that make your optimizations meaningful. For jank, that means an IR that knows what a defn, a var deref, and a persistent map are — so optimization passes can reason about them directly rather than reverse-engineering them from C++ pointer arithmetic.
Core Concepts: Jank's IR Design Philosophy
The Layered Architecture
Jank's compilation pipeline has six stages:
| Stage | Input | Output | |---|---|---| | Reader | Source text | S-expressions | | Macro expander | S-expressions | Expanded S-expressions | | Semantic analyzer | Expanded forms | Typed AST | | Jank IR lowering | Typed AST | Jank IR node tree | | LLVM codegen | Jank IR node tree | LLVM IR | | LLVM backend | LLVM IR | Native binary |
The jank IR layer is where all Clojure-aware optimization passes run. Once you leave that layer and enter LLVM IR, you're in generic SSA territory and all knowledge of PersistentVector semantics is gone.
SSA Form and Why Jank's IR Is Not SSA (Yet)
Static Single Assignment (SSA) form means every variable is assigned exactly once and every use refers to a unique definition — a prerequisite for many classical optimizations like value numbering and copy propagation. Jank's IR is currently not in SSA form. This is a deliberate deferral: SSA construction requires dominator-tree analysis and phi-node insertion, which is non-trivial infrastructure to build correctly. Instead, jank relies on LLVM's own SSA-based optimization passes for lower-level transforms, while keeping the jank IR simpler and easier to extend during the current alpha phase. SSA is listed as a planned future enhancement.
The Expression-Based Node Hierarchy
Every construct in jank IR is an expression node — there are no statements. This mirrors Clojure's semantics, where everything returns a value. The primary node types currently include:
expression_node— base type for all IR nodesdef_node— top-level var definitionfn_node— function with arity casesdo_node— sequential expression list, returns lastif_node— conditional with test, then, and else brancheslet_node— local bindingsinvoke_node— function/var call sitevar_deref_node— var lookup with namespace metadatanative_raw_node— escape hatch for inline C++ interopprimitive_node— typed literal (integer, float, boolean, keyword, nil)
Immutability and Persistent Data Structures in IR
Because Clojure's data structures are persistent and immutable, jank's IR can make stronger aliasing assumptions than a C++ compiler could. An IR pass that sees two references to the same PersistentHashMap node can freely share them without defensive copies. This is currently used conservatively, but it's the foundation for future escape analysis: if a map created in a function never escapes to a var or closure, it can be stack-allocated.
Quick Start: Reading and Understanding Jank IR Output
Building Jank From Source With IR Debug Output Enabled
Jank's build system uses CMake. To enable IR dump output, you need to pass the jank_tests target with IR logging turned on:
# Clone the repository
git clone https://github.com/jank-lang/jank.git
cd jank
# Configure with debug IR output enabled
cmake -B build \
-DCMAKE_BUILD_TYPE=Debug \
-Djank_ir_debug=ON \
-Djank_tests=ON
cmake --build build -- -j$(nproc)
Once built, you can invoke the jank compiler with the IR dump flag:
./build/jank compile --dump-ir path/to/your_file.jank
Dumping IR for a Simple Clojure Expression
Create a minimal source file:
;; add.jank
(defn add [a b]
(+ a b))
Run the compiler with IR dump:
./build/jank compile --dump-ir add.jank
Navigating the IR Node Tree in Debug Output
Expected console output for add.jank looks like this:
[jank IR dump]
def_node {
name: clojure.core/add
value: fn_node {
arities: [
fn_arity_node {
params: [
local_node { name: a, type: any }
local_node { name: b, type: any }
]
body: do_node {
expressions: [
invoke_node {
fn: var_deref_node { var: clojure.core/+ }
args: [
local_ref_node { name: a }
local_ref_node { name: b }
]
call_type: dynamic_var
}
]
}
}
]
}
}
Read this top-down: a def_node wraps a fn_node, which has one arity case. The arity's body is a do_node (implicit in Clojure function bodies) containing a single invoke_node. Notice call_type: dynamic_var — the call to + goes through var indirection at this stage. Type inference and inline caching passes will later annotate this to a direct call, eliminating the runtime lookup.
Use Case 1: Constant Folding and Dead Code Elimination via the IR
How the IR Enables Compile-Time Constant Evaluation
When a conditional's test expression is a compile-time constant, the IR optimization pass can eliminate the dead branch entirely before LLVM ever sees the code. This was impossible in the previous architecture because by the time jank emitted Clang AST, literal true had already been converted into a clang::BoolLiteral node buried inside condition-checking boilerplate — context that Clang's optimization passes don't interpret as "this if is always taken."
Dead Code Elimination: Before and After
Consider:
;; constant-if.jank
(defn always-a []
(if true :a :b))
Unoptimized IR (before constant folding pass):
fn_node {
arities: [
fn_arity_node {
params: []
body: do_node {
expressions: [
if_node {
test: primitive_node { type: boolean, value: true }
then: primitive_node { type: keyword, value: :a }
else: primitive_node { type: keyword, value: :b } ; <-- dead branch
}
]
}
}
]
}
Optimized IR (after constant-folding pass eliminates dead branch):
fn_node {
arities: [
fn_arity_node {
params: []
body: do_node {
expressions: [
primitive_node { type: keyword, value: :a } ; if_node replaced entirely
]
}
}
]
}
The entire if_node is replaced with just the then branch. LLVM never has to reason about a conditional at all — the codegen emits a single keyword literal. This kind of optimization is only possible when you have an IR layer that understands Clojure's true as a primitive constant and not as a C++ bool produced by a runtime call to RT::booleanCast().
Use Case 2: Inline Caching and Call Site Optimization
How Dynamic Dispatch Creates IR-Level Challenges
In standard Clojure on the JVM, calling a var like (+ a b) involves a var lookup, a deref to get the current value, a cast to IFn, and then an invokeDynamic call. The JVM's JIT handles this efficiently after warmup via inline caches. In jank, without a JIT, every dynamic var call has full indirection overhead at runtime — unless the compiler can resolve and hardcode the call target at compile time. The IR is the place to do this, because it still has namespace and var metadata attached to each call site.
Representing Polymorphic Call Sites in the Custom IR
Here is how the same conceptual call — (+ a b) — appears in IR before and after type inference annotates the call site:
Unoptimized var-call (full runtime dispatch):
invoke_node {
call_type: dynamic_var
fn: var_deref_node {
var: clojure.core/+
ns: clojure.core
resolved: false
}
args: [ local_ref_node { name: a }, local_ref_node { name: b } ]
cache_slot: null
monomorphic: false
}
After type inference + inline cache annotation:
invoke_node {
call_type: direct_call
fn: fn_ref_node {
target: clojure.core/+__2 ; direct arity-2 fn pointer
resolved: true
}
args: [ local_ref_node { name: a }, local_ref_node { name: b } ]
cache_slot: ic_slot_3 ; inline cache slot allocated
monomorphic: true ; type inference proved single target
}
The cache_slot and monomorphic annotations are the hooks a future inline caching pass will use to emit a fast-path check at the call site: "if the var still holds the same function pointer I saw at compile time, jump directly; otherwise fall back to full dispatch." The IR preserves this metadata all the way to LLVM codegen, which emits the conditional branch.
Use Case 3: Interop with C++ and LLVM Through the IR Boundary
How Native Interop Calls Are Represented as IR Nodes
Jank's C++ interop uses a native_raw_node — an escape hatch that embeds a string of C++ code or a typed call descriptor directly into the IR. This is structurally different from a regular invoke_node. The IR must carry explicit C++ type annotations because LLVM codegen can't infer them from Clojure's dynamic type system.
Here is a side-by-side comparison:
Pure-jank function call node:
invoke_node {
call_type: direct_call
fn: fn_ref_node {
target: myns/square
resolved: true
}
args: [ primitive_node { type: integer, value: 4 } ]
return_type: jank::object_ptr ; always object_ptr for jank fns
cache_slot: ic_slot_7
}
Native C++ interop call node:
;; jank source: call C++ sqrt
(defn native-sqrt [x]
(native/raw "std::sqrt(~{x})"))
native_raw_node {
source: "std::sqrt(~{x})"
interpolations: [
interp_binding { name: x, cpp_type: double }
]
return_type: cpp::double ; explicit C++ type, NOT object_ptr
requires_box: true ; result must be boxed to jank::object_ptr
inlineable: false ; cannot inline across boundary yet
}
Ensuring Type Safety at the IR Level
The requires_box flag tells LLVM codegen to emit a jank::box<double>() wrapper around the C++ return value before it's usable in jank code. The IR validates that every native_raw_node has explicit return_type and requires_box set — a missing annotation is a compile error at the IR validation pass, not a runtime crash. This is one of the places where having an IR layer pays an immediate safety dividend: the codegen never sees an untyped native return.
Current Limitations Across the Interop Boundary
Cross-boundary inlining is not implemented. If a jank function calls a native function that calls back into jank, the IR cannot currently represent the re-entrant type flow. The inlineable: false flag blocks the inliner from attempting it and silently producing incorrect code. Escape analysis also stops at native_raw_node boundaries — the compiler conservatively assumes any pointer passed to native code escapes, preventing stack allocation of objects that touch interop calls.
Limitations and Current Constraints of Jank's IR
Being direct: jank's IR is young. It's already valuable, but developers using or contributing to jank need a clear picture of what's there and what isn't.
No SSA Form: What That Means Practically
Without SSA, classical dataflow optimizations — copy propagation, global value numbering, loop-invariant code motion — cannot be applied at the jank IR level. These happen in LLVM instead, but LLVM operates on already-lowered code and lacks the Clojure-semantic context to make Lisp-aware decisions. For example, LLVM cannot tell that two var dereferences to an unmodified var are equivalent; without SSA-based alias analysis in jank IR, that optimization is missed entirely.
Optimization Pass Status
| Optimization | Supported in Jank IR Now | Planned | Handled by LLVM Instead |
|---|---|---|---|
| Constant folding | ✅ (basic literals) | SSA-based full pass | ✅ LLVM does post-lowering |
| Dead code elimination | ✅ (constant branches) | Full DCE with SSA | ✅ LLVM -O2 |
| Inline caching | ❌ (hooks only) | Q3 2025 | ❌ Not possible in LLVM |
| Escape analysis | ❌ | 2025/2026 | Partial (LLVM alloca promo) |
| Loop unrolling | ❌ | Post-SSA | ✅ LLVM -O3 |
| Recur/tail call opt | ✅ (trampoline) | Direct jump TCO | Partial |
| Type-based dispatch | ❌ | Post inline cache | ❌ Not possible in LLVM |
| Cross-interop inlining | ❌ | Undecided | ❌ Not possible |
Tooling Gaps
There is currently no interactive IR explorer. You get a textual dump and that's it — no graphical node viewer, no diff tool for before/after optimization passes. When an optimization pass produces incorrect IR, you're reading raw text output and mentally diffing it. GDB/LLDB work for debugging the compiler itself, but there's no equivalent of LLVM's opt -print-after-all pipeline visualization for jank IR yet.
Contributing to Jank's IR: Where to Start
Repository Structure: Where IR Code Lives
The jank repository is at https://github.com/jank-lang/jank. The IR-relevant code lives in:
jank/
├── compiler+runtime/
│ ├── include/jank/analyze/
│ │ └── expression.hpp # IR node type definitions
│ ├── src/jank/analyze/
│ │ ├── processor.cpp # Semantic analysis → IR lowering
│ │ └── expression.cpp # Node constructors and visitors
│ ├── src/jank/codegen/
│ │ └── llvm_processor.cpp # Jank IR → LLVM IR
│ └── test/jank/analyze/
│ └── *_test.cpp # IR-level tests
How to Add a New IR Node Type
Adding a node type follows a four-step pattern. Here's a sketch for adding a hypothetical loop_node:
Step 1 — Define the struct in expression.hpp:
// include/jank/analyze/expression.hpp
struct loop_node : expression_base {
native_vector<local_binding> bindings;
expression_ptr body;
// loop_node is always a tail position
static constexpr bool is_tail_recursive = true;
};
// Add to the expression variant:
using expression = std::variant<
def_node,
fn_node,
do_node,
if_node,
let_node,
loop_node, // <-- new entry
invoke_node,
/* ... */
>;
Step 2 — Add the visitor case in expression.cpp:
// src/jank/analyze/expression.cpp
template <typename Fn>
auto visit(expression const& expr, Fn&& fn) {
return std::visit(std::forward<Fn>(fn), expr);
}
// In the IR dump visitor:
result operator()(loop_node const& n) const {
fmt::print("loop_node {{\n");
for (auto const& b : n.bindings)
fmt::print(" binding: {} = ...\n", b.name);
fmt::print(" body: {}\n", dump(n.body));
fmt::print("}}\n");
}
Step 3 — Hook into the analysis pass in processor.cpp:
// src/jank/analyze/processor.cpp
expression_ptr processor::analyze_loop(list_ptr const& form, ctx& c) {
auto node = make_node<loop_node>();
node->bindings = analyze_bindings(form->data[1], c); // (loop [x 0] ...)
node->body = analyze_body(form->rest(2), c);
return node;
}
// Register in the special-form dispatch table:
special_forms["loop*"] = &processor::analyze_loop;
Step 4 — Add the codegen case in llvm_processor.cpp:
// src/jank/codegen/llvm_processor.cpp
llvm::Value* llvm_processor::codegen(loop_node const& n) {
auto* loop_bb = llvm::BasicBlock::Create(ctx, "loop", current_fn);
auto* after_bb = llvm::BasicBlock::Create(ctx, "after_loop", current_fn);
builder.CreateBr(loop_bb);
builder.SetInsertPoint(loop_bb);
// ... emit bindings, body, conditional branch back
return codegen(n.body);
}
Running the IR Test Suite
# Run all IR/analyze tests
ctest --test-dir build -R analyze --output-on-failure
# Run a specific IR test file
./build/compiler+runtime/test/jank_test \
--gtest_filter="analyze.constant_folding.*"
New IR-level tests belong in compiler+runtime/test/jank/analyze/. Each test should construct IR nodes directly (bypassing the reader/analyzer pipeline), run an optimization pass, and assert on the resulting node tree structure. Look at existing if_node tests for the pattern — they're the most complete examples of IR-level unit testing in the codebase right now.
If you're a Clojure developer who wants to contribute without writing C++, the best entry point is writing jank source test cases that exercise edge cases in existing optimization passes — these live in compiler+runtime/test/jank/ as .jank files and get compiled and checked by the test runner automatically.