Skip to content

Performance and Optimization

Primarily UTLXe. Performance tuning applies to the UTLXe production engine. The CLI processes one message at a time — its performance is dominated by startup time (under 10ms with native binary), not throughput.

UTLXe achieves 86,000+ messages per second on a single container with 8 workers. This chapter explains where that number comes from, how to tune for your workload, and what to monitor.

Execution Strategies

The strategy choice is the single biggest performance lever. It determines how the transformation executes at runtime:

StrategyHow it worksThroughputWhen to use
TEMPLATEWalk AST, interpret each expression1,000-5,000 msg/sDevelopment, simple transforms, low volume
COPYClone pre-built skeleton, fill values5,000-20,000 msg/sSchema-driven, predictable output structure
COMPILEDExecute JVM bytecode generated from AST20,000-86,000 msg/sMaximum throughput, complex logic
COPY+COMPILEDClone skeleton + compiled fill logic50,000-86,000+ msg/sUltimate throughput
AUTOSchema present → COPY, else → TEMPLATEVariesProduction default — engine chooses

The numbers above are for a single container with 8 workers on 1 vCPU, processing typical JSON-to-JSON transformations ( 1KB messages). Larger messages, more complex transformations, and XML parsing reduce throughput proportionally.

Why COMPILED Is Fast

The COMPILED strategy compiles UTL-X expressions to JVM bytecode using the ASM library — the same bytecode generation technology that Java itself uses. The generated bytecode:

  • Runs at native JVM speed (no interpretation overhead)

  • Benefits from HotSpot JIT compilation (further optimized at runtime)

  • Eliminates the AST tree-walking overhead ( 10x faster than TEMPLATE for complex expressions)

  • Uses typed operations (no boxing/unboxing for arithmetic)

The compilation happens once at init-time — while the bundle loads, before the engine reports ready — and is cached by SHA-256 hash of the source. The first message therefore hits already-compiled bytecode; there is no first-message latency penalty. Subsequent starts with the same transformation skip compilation entirely (cache hit).

Memory Model

UDM Expansion Factor

When UTL-X parses a message, it creates a UDM tree in memory. This tree is larger than the original message because:

  • Every string becomes a Java String object (24+ bytes overhead per string)

  • Every number becomes a boxed Double or Long (24 bytes)

  • Every object becomes a HashMap with entry overhead

  • Every array becomes an ArrayList with element pointers

The expansion factor depends on the format:

FormatTypical expansionExample
JSON5-10x1KB JSON → 5-10KB heap
XML10-20x1KB XML → 10-20KB heap (attributes, namespaces add overhead)
CSV3-5x1KB CSV → 3-5KB heap (simpler structure)
YAML5-10xSimilar to JSON

This means a 100KB XML document consumes 1-2MB of JVM heap during transformation. With 32 concurrent workers, that's 32-64MB just for input messages — plus the output UDM, plus intermediate values.

Heap Sizing

Rule of thumb: set JVM heap to 75% of container memory.

bash
# Container with 512MB memory
docker run -e JAVA_OPTS="-Xmx384m -Xms384m" ...

# Container with 2GB memory
docker run -e JAVA_OPTS="-Xmx1536m -Xms1536m" ...

Set -Xms equal to -Xmx to avoid heap resizing during operation. The remaining 25% is for JVM metaspace, thread stacks, and native memory.

Sizing Guide

WorkloadMessage sizeWorkersRecommended memory
Small messagesUnder 10KB8-16256-512MB
Medium messages10-100KB8-16512MB-1GB
Large messages100KB-1MB4-81-2GB
Very large messages1MB+2-42-4GB

For large messages, reduce worker count — fewer concurrent messages means less simultaneous heap usage. For small messages, increase workers to maximize throughput.

GraalVM Native Image

CLI: Native Binary

The CLI (utlx) is distributed as a GraalVM native binary:

  • Startup: under 10ms (vs  250ms for JVM JAR)

  • Memory:  40MB resident (vs  150MB for JVM)

  • Distribution: single binary, no JVM required

This makes the CLI practical for shell scripts, CI/CD pipelines, and interactive use — where startup time dominates.

Engine: JVM Only

UTLXe runs on the JVM, not as a native image. The JVM is actually faster for long-running engine workloads because:

  • HotSpot JIT compiles hot paths to native code at runtime

  • The COMPILED strategy generates JVM bytecode that benefits from JIT

  • G1GC handles large heaps efficiently for sustained throughput

  • Reflection (used by SnakeYAML, Jackson) works without configuration

Native image would save startup time ( 250ms →  10ms), but UTLXe starts once and runs for days — startup time is irrelevant. Runtime throughput matters, and the JVM wins there.

Benchmarking

Quick Benchmark

Measure throughput with the HTTP API:

bash
# Start UTLXe with a test transformation
utlxe --bundle test-transforms/ --mode http --port 8080 --workers 8

# Benchmark with hey (HTTP load generator)
hey -n 10000 -c 32 -m POST \
  -H "Content-Type: application/json" \
  -d '{"orderId": "ORD-001", "total": 299.99}' \
  http://localhost:8080/api/execute/test-transform

This gives you requests/second, latency percentiles (p50, p95, p99), and error rate.

What to Measure

MetricTargetAction if exceeded
p50 latencyUnder 5msCheck transformation complexity
p99 latencyUnder 50msCheck for GC pauses, large messages
Throughput1,000+ msg/s per workerSwitch to COMPILED strategy
Error rateUnder 0.1%Check validation, input format issues
Heap usageUnder 75% of maxIncrease memory or reduce workers
GC pause timeUnder 10msTune G1GC, reduce heap pressure

Conformance Suite as Benchmark

The conformance suite (500+ tests) can be run in throughput mode to establish a baseline:

bash
python3 conformance-suite/run_tests.py --benchmark

This runs all tests repeatedly and reports average execution time per test — useful for detecting performance regressions after code changes.

Production Tuning

Worker Count

Start with 8 workers per vCPU:

bash
# 1 vCPU container
utlxe --workers 8

# 4 vCPU container
utlxe --workers 32

Adjust based on monitoring:

  • Workers always at capacity, queue building up: add more workers or more instances

  • Workers mostly idle: reduce to save memory

  • High GC pause times: reduce workers (less concurrent heap pressure)

Horizontal vs Vertical Scaling

ApproachWhenHow
More workers (vertical)Workers are saturated, memory is availableIncrease --workers
More instances (horizontal)Memory is saturated, or need fault toleranceAdd container replicas
Bigger containerLarge messages need more heapIncrease memory limit
COMPILED strategyTEMPLATE throughput is insufficientSet strategy in transform.yaml

Horizontal scaling (more instances) is generally preferred — it adds fault tolerance and distributes load across nodes. Vertical scaling (more workers in one instance) is simpler but has a ceiling (one JVM, one heap).

GC Tuning

For containerized UTLXe, G1GC with these settings works well:

bash
JAVA_OPTS="-Xmx1536m -Xms1536m \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=10 \
  -XX:G1HeapRegionSize=4m \
  -XX:+UseStringDeduplication"
  • MaxGCPauseMillis=10 — target 10ms GC pauses (acceptable for transformation latency)

  • G1HeapRegionSize=4m — good for mixed small and medium objects (UDM trees)

  • UseStringDeduplication — reduces heap for messages with repeated string values (field names, enum values)

Back-Pressure

UTLXe uses ArrayBlockingQueue with CallerRunsPolicy:

  • When the work queue is full (all workers busy): the calling thread executes the transformation itself

  • This naturally slows the producer when the engine is saturated

  • No risk of out-of-memory from unbounded queue growth

  • No messages are dropped — they're executed by the caller thread instead

Monitor queue depth via Prometheus (utlxe_queue_depth). Sustained high queue depth means you need more capacity.

Performance Anti-Patterns

Avoid: filter() Inside map()

utlx
// SLOW — O(N × M): scans all lines for every order
map($input.orders, (order) ->
  filter($input.lines, (l) -> l.orderId == order.id)
)

// FAST — O(N + M): build index once, look up per order
let linesByOrder = groupBy($input.lines, (l) -> l.orderId)
map($input.orders, (order) ->
  linesByOrder[order.id] ?? []
)

nestBy() handles the indexing automatically.

Avoid: Repeated Computation

utlx
// SLOW — computes the same filter twice
{
  activeCount: count(filter($input.users, (u) -> u.active)),
  activeNames: map(filter($input.users, (u) -> u.active), (u) -> u.name)
}

// FAST — compute once, reuse
let activeUsers = filter($input.users, (u) -> u.active)
{
  activeCount: count(activeUsers),
  activeNames: map(activeUsers, (u) -> u.name)
}

Bind intermediate results to let variables. The cost of the binding is zero; the cost of recomputation is proportional to the data size.

Avoid: Large String Concatenation in Loops

utlx
// SLOW — creates N intermediate strings
reduce($input.items, "", (acc, item) ->
  concat(acc, item.name, ", ")
)

// FAST — use join()
join(map($input.items, (item) -> item.name), ", ")

join() builds the result in one pass. concat() in a reduce() creates a new string for every iteration.

Released under AGPL-3.0.