Jank Lisp Compiler IR: How Custom Intermediate Representation Works in 2025

What Is Jank and Why It Needed a Custom IR

Jank's Position as a Clojure-Compatible Language on LLVM

Jank is a Clojure-dialect language that compiles to native binaries via LLVM. Maintained by Jeaye Wilkerson, jank targets full Clojure compatibility while delivering native performance — no JVM, no startup lag, real C++ interop. If you've wanted to write idiomatic (reduce + coll) and ship a 5ms-startup CLI binary, jank is the project to watch. As of 2025, it's in active alpha development, with the compiler architecture undergoing significant changes — the most important of which is the introduction of a custom intermediate representation (IR).

The Problem With Relying Solely on Clang/LLVM IR Too Early

Before its custom IR, jank lowered Clojure semantics directly into a Clang C++ AST using libClang. This worked, but it was a dead end for optimization. Clang's AST has no concept of persistent vectors, var indirection, or lazy sequences. Every optimization had to happen after the Clojure-level semantics were already destroyed. You couldn't look at a Clang AST node and say "this branch is dead because the condition is a Clojure literal" — the information was gone. Lowering directly to C++ also meant that the compiler couldn't apply Lisp-aware transformations: constant folding of keyword literals, inline caching at var call sites, or escape analysis of persistent data structures.

What a Compiler IR Actually Does and Why It Matters

A compiler IR is the internal language a compiler uses between parsing your source and generating machine code. Think of it as a structured, queryable representation of your program that sits at a comfortable distance from both "human-readable source" and "CPU instructions." The key property of a good IR is that it's lowered enough to enable mechanical transformation but high-level enough to preserve the semantics that make your optimizations meaningful. For jank, that means an IR that knows what a defn, a var deref, and a persistent map are — so optimization passes can reason about them directly rather than reverse-engineering them from C++ pointer arithmetic.


Core Concepts: Jank's IR Design Philosophy

The Layered Architecture

Jank's compilation pipeline has six stages:

| Stage | Input | Output | |---|---|---| | Reader | Source text | S-expressions | | Macro expander | S-expressions | Expanded S-expressions | | Semantic analyzer | Expanded forms | Typed AST | | Jank IR lowering | Typed AST | Jank IR node tree | | LLVM codegen | Jank IR node tree | LLVM IR | | LLVM backend | LLVM IR | Native binary |

The jank IR layer is where all Clojure-aware optimization passes run. Once you leave that layer and enter LLVM IR, you're in generic SSA territory and all knowledge of PersistentVector semantics is gone.

SSA Form and Why Jank's IR Is Not SSA (Yet)

Static Single Assignment (SSA) form means every variable is assigned exactly once and every use refers to a unique definition — a prerequisite for many classical optimizations like value numbering and copy propagation. Jank's IR is currently not in SSA form. This is a deliberate deferral: SSA construction requires dominator-tree analysis and phi-node insertion, which is non-trivial infrastructure to build correctly. Instead, jank relies on LLVM's own SSA-based optimization passes for lower-level transforms, while keeping the jank IR simpler and easier to extend during the current alpha phase. SSA is listed as a planned future enhancement.

The Expression-Based Node Hierarchy

Every construct in jank IR is an expression node — there are no statements. This mirrors Clojure's semantics, where everything returns a value. The primary node types currently include:

  • expression_node — base type for all IR nodes
  • def_node — top-level var definition
  • fn_node — function with arity cases
  • do_node — sequential expression list, returns last
  • if_node — conditional with test, then, and else branches
  • let_node — local bindings
  • invoke_node — function/var call site
  • var_deref_node — var lookup with namespace metadata
  • native_raw_node — escape hatch for inline C++ interop
  • primitive_node — typed literal (integer, float, boolean, keyword, nil)

Immutability and Persistent Data Structures in IR

Because Clojure's data structures are persistent and immutable, jank's IR can make stronger aliasing assumptions than a C++ compiler could. An IR pass that sees two references to the same PersistentHashMap node can freely share them without defensive copies. This is currently used conservatively, but it's the foundation for future escape analysis: if a map created in a function never escapes to a var or closure, it can be stack-allocated.


Quick Start: Reading and Understanding Jank IR Output

Building Jank From Source With IR Debug Output Enabled

Jank's build system uses CMake. To enable IR dump output, you need to pass the jank_tests target with IR logging turned on:

# Clone the repository
git clone https://github.com/jank-lang/jank.git
cd jank

# Configure with debug IR output enabled
cmake -B build \
  -DCMAKE_BUILD_TYPE=Debug \
  -Djank_ir_debug=ON \
  -Djank_tests=ON

cmake --build build -- -j$(nproc)

Once built, you can invoke the jank compiler with the IR dump flag:

./build/jank compile --dump-ir path/to/your_file.jank

Dumping IR for a Simple Clojure Expression

Create a minimal source file:

;; add.jank
(defn add [a b]
  (+ a b))

Run the compiler with IR dump:

./build/jank compile --dump-ir add.jank

Navigating the IR Node Tree in Debug Output

Expected console output for add.jank looks like this:

[jank IR dump]
def_node {
  name: clojure.core/add
  value: fn_node {
    arities: [
      fn_arity_node {
        params: [
          local_node { name: a, type: any }
          local_node { name: b, type: any }
        ]
        body: do_node {
          expressions: [
            invoke_node {
              fn:  var_deref_node { var: clojure.core/+ }
              args: [
                local_ref_node { name: a }
                local_ref_node { name: b }
              ]
              call_type: dynamic_var
            }
          ]
        }
      }
    ]
  }
}

Read this top-down: a def_node wraps a fn_node, which has one arity case. The arity's body is a do_node (implicit in Clojure function bodies) containing a single invoke_node. Notice call_type: dynamic_var — the call to + goes through var indirection at this stage. Type inference and inline caching passes will later annotate this to a direct call, eliminating the runtime lookup.


Use Case 1: Constant Folding and Dead Code Elimination via the IR

How the IR Enables Compile-Time Constant Evaluation

When a conditional's test expression is a compile-time constant, the IR optimization pass can eliminate the dead branch entirely before LLVM ever sees the code. This was impossible in the previous architecture because by the time jank emitted Clang AST, literal true had already been converted into a clang::BoolLiteral node buried inside condition-checking boilerplate — context that Clang's optimization passes don't interpret as "this if is always taken."

Dead Code Elimination: Before and After

Consider:

;; constant-if.jank
(defn always-a []
  (if true :a :b))

Unoptimized IR (before constant folding pass):

fn_node {
  arities: [
    fn_arity_node {
      params: []
      body: do_node {
        expressions: [
          if_node {
            test:  primitive_node { type: boolean, value: true }
            then:  primitive_node { type: keyword, value: :a }
            else:  primitive_node { type: keyword, value: :b }  ; <-- dead branch
          }
        ]
      }
    }
  ]
}

Optimized IR (after constant-folding pass eliminates dead branch):

fn_node {
  arities: [
    fn_arity_node {
      params: []
      body: do_node {
        expressions: [
          primitive_node { type: keyword, value: :a }  ; if_node replaced entirely
        ]
      }
    }
  ]
}

The entire if_node is replaced with just the then branch. LLVM never has to reason about a conditional at all — the codegen emits a single keyword literal. This kind of optimization is only possible when you have an IR layer that understands Clojure's true as a primitive constant and not as a C++ bool produced by a runtime call to RT::booleanCast().


Use Case 2: Inline Caching and Call Site Optimization

How Dynamic Dispatch Creates IR-Level Challenges

In standard Clojure on the JVM, calling a var like (+ a b) involves a var lookup, a deref to get the current value, a cast to IFn, and then an invokeDynamic call. The JVM's JIT handles this efficiently after warmup via inline caches. In jank, without a JIT, every dynamic var call has full indirection overhead at runtime — unless the compiler can resolve and hardcode the call target at compile time. The IR is the place to do this, because it still has namespace and var metadata attached to each call site.

Representing Polymorphic Call Sites in the Custom IR

Here is how the same conceptual call — (+ a b) — appears in IR before and after type inference annotates the call site:

Unoptimized var-call (full runtime dispatch):

invoke_node {
  call_type:   dynamic_var
  fn:          var_deref_node {
                 var:       clojure.core/+
                 ns:        clojure.core
                 resolved:  false
               }
  args:        [ local_ref_node { name: a }, local_ref_node { name: b } ]
  cache_slot:  null
  monomorphic: false
}

After type inference + inline cache annotation:

invoke_node {
  call_type:   direct_call
  fn:          fn_ref_node {
                 target:    clojure.core/+__2  ; direct arity-2 fn pointer
                 resolved:  true
               }
  args:        [ local_ref_node { name: a }, local_ref_node { name: b } ]
  cache_slot:  ic_slot_3   ; inline cache slot allocated
  monomorphic: true        ; type inference proved single target
}

The cache_slot and monomorphic annotations are the hooks a future inline caching pass will use to emit a fast-path check at the call site: "if the var still holds the same function pointer I saw at compile time, jump directly; otherwise fall back to full dispatch." The IR preserves this metadata all the way to LLVM codegen, which emits the conditional branch.


Use Case 3: Interop with C++ and LLVM Through the IR Boundary

How Native Interop Calls Are Represented as IR Nodes

Jank's C++ interop uses a native_raw_node — an escape hatch that embeds a string of C++ code or a typed call descriptor directly into the IR. This is structurally different from a regular invoke_node. The IR must carry explicit C++ type annotations because LLVM codegen can't infer them from Clojure's dynamic type system.

Here is a side-by-side comparison:

Pure-jank function call node:

invoke_node {
  call_type:   direct_call
  fn:          fn_ref_node {
                 target:    myns/square
                 resolved:  true
               }
  args:        [ primitive_node { type: integer, value: 4 } ]
  return_type: jank::object_ptr   ; always object_ptr for jank fns
  cache_slot:  ic_slot_7
}

Native C++ interop call node:

;; jank source: call C++ sqrt
(defn native-sqrt [x]
  (native/raw "std::sqrt(~{x})"))
native_raw_node {
  source:      "std::sqrt(~{x})"
  interpolations: [
    interp_binding { name: x, cpp_type: double }
  ]
  return_type: cpp::double          ; explicit C++ type, NOT object_ptr
  requires_box: true                ; result must be boxed to jank::object_ptr
  inlineable:   false               ; cannot inline across boundary yet
}

Ensuring Type Safety at the IR Level

The requires_box flag tells LLVM codegen to emit a jank::box<double>() wrapper around the C++ return value before it's usable in jank code. The IR validates that every native_raw_node has explicit return_type and requires_box set — a missing annotation is a compile error at the IR validation pass, not a runtime crash. This is one of the places where having an IR layer pays an immediate safety dividend: the codegen never sees an untyped native return.

Current Limitations Across the Interop Boundary

Cross-boundary inlining is not implemented. If a jank function calls a native function that calls back into jank, the IR cannot currently represent the re-entrant type flow. The inlineable: false flag blocks the inliner from attempting it and silently producing incorrect code. Escape analysis also stops at native_raw_node boundaries — the compiler conservatively assumes any pointer passed to native code escapes, preventing stack allocation of objects that touch interop calls.


Limitations and Current Constraints of Jank's IR

Being direct: jank's IR is young. It's already valuable, but developers using or contributing to jank need a clear picture of what's there and what isn't.

No SSA Form: What That Means Practically

Without SSA, classical dataflow optimizations — copy propagation, global value numbering, loop-invariant code motion — cannot be applied at the jank IR level. These happen in LLVM instead, but LLVM operates on already-lowered code and lacks the Clojure-semantic context to make Lisp-aware decisions. For example, LLVM cannot tell that two var dereferences to an unmodified var are equivalent; without SSA-based alias analysis in jank IR, that optimization is missed entirely.

Optimization Pass Status

| Optimization | Supported in Jank IR Now | Planned | Handled by LLVM Instead | |---|---|---|---| | Constant folding | ✅ (basic literals) | SSA-based full pass | ✅ LLVM does post-lowering | | Dead code elimination | ✅ (constant branches) | Full DCE with SSA | ✅ LLVM -O2 | | Inline caching | ❌ (hooks only) | Q3 2025 | ❌ Not possible in LLVM | | Escape analysis | ❌ | 2025/2026 | Partial (LLVM alloca promo) | | Loop unrolling | ❌ | Post-SSA | ✅ LLVM -O3 | | Recur/tail call opt | ✅ (trampoline) | Direct jump TCO | Partial | | Type-based dispatch | ❌ | Post inline cache | ❌ Not possible in LLVM | | Cross-interop inlining | ❌ | Undecided | ❌ Not possible |

Tooling Gaps

There is currently no interactive IR explorer. You get a textual dump and that's it — no graphical node viewer, no diff tool for before/after optimization passes. When an optimization pass produces incorrect IR, you're reading raw text output and mentally diffing it. GDB/LLDB work for debugging the compiler itself, but there's no equivalent of LLVM's opt -print-after-all pipeline visualization for jank IR yet.


Contributing to Jank's IR: Where to Start

Repository Structure: Where IR Code Lives

The jank repository is at https://github.com/jank-lang/jank. The IR-relevant code lives in:

jank/
├── compiler+runtime/
│   ├── include/jank/analyze/
│   │   └── expression.hpp        # IR node type definitions
│   ├── src/jank/analyze/
│   │   ├── processor.cpp         # Semantic analysis → IR lowering
│   │   └── expression.cpp        # Node constructors and visitors
│   ├── src/jank/codegen/
│   │   └── llvm_processor.cpp    # Jank IR → LLVM IR
│   └── test/jank/analyze/
│       └── *_test.cpp            # IR-level tests

How to Add a New IR Node Type

Adding a node type follows a four-step pattern. Here's a sketch for adding a hypothetical loop_node:

Step 1 — Define the struct in expression.hpp:

// include/jank/analyze/expression.hpp
struct loop_node : expression_base {
  native_vector<local_binding>  bindings;
  expression_ptr                body;
  // loop_node is always a tail position
  static constexpr bool is_tail_recursive = true;
};

// Add to the expression variant:
using expression = std::variant<
  def_node,
  fn_node,
  do_node,
  if_node,
  let_node,
  loop_node,     // <-- new entry
  invoke_node,
  /* ... */
>;

Step 2 — Add the visitor case in expression.cpp:

// src/jank/analyze/expression.cpp
template <typename Fn>
auto visit(expression const& expr, Fn&& fn) {
  return std::visit(std::forward<Fn>(fn), expr);
}

// In the IR dump visitor:
result operator()(loop_node const& n) const {
  fmt::print("loop_node {{\n");
  for (auto const& b : n.bindings)
    fmt::print("  binding: {} = ...\n", b.name);
  fmt::print("  body: {}\n", dump(n.body));
  fmt::print("}}\n");
}

Step 3 — Hook into the analysis pass in processor.cpp:

// src/jank/analyze/processor.cpp
expression_ptr processor::analyze_loop(list_ptr const& form, ctx& c) {
  auto node    = make_node<loop_node>();
  node->bindings = analyze_bindings(form->data[1], c);  // (loop [x 0] ...)
  node->body     = analyze_body(form->rest(2), c);
  return node;
}

// Register in the special-form dispatch table:
special_forms["loop*"] = &processor::analyze_loop;

Step 4 — Add the codegen case in llvm_processor.cpp:

// src/jank/codegen/llvm_processor.cpp
llvm::Value* llvm_processor::codegen(loop_node const& n) {
  auto* loop_bb   = llvm::BasicBlock::Create(ctx, "loop", current_fn);
  auto* after_bb  = llvm::BasicBlock::Create(ctx, "after_loop", current_fn);
  builder.CreateBr(loop_bb);
  builder.SetInsertPoint(loop_bb);
  // ... emit bindings, body, conditional branch back
  return codegen(n.body);
}

Running the IR Test Suite

# Run all IR/analyze tests
ctest --test-dir build -R analyze --output-on-failure

# Run a specific IR test file
./build/compiler+runtime/test/jank_test \
  --gtest_filter="analyze.constant_folding.*"

New IR-level tests belong in compiler+runtime/test/jank/analyze/. Each test should construct IR nodes directly (bypassing the reader/analyzer pipeline), run an optimization pass, and assert on the resulting node tree structure. Look at existing if_node tests for the pattern — they're the most complete examples of IR-level unit testing in the codebase right now.

If you're a Clojure developer who wants to contribute without writing C++, the best entry point is writing jank source test cases that exercise edge cases in existing optimization passes — these live in compiler+runtime/test/jank/ as .jank files and get compiled and checked by the test runner automatically.

Recommended Tools