Performance and Optimization

Primarily UTLXe. Performance tuning applies to the UTLXe production engine. The CLI processes one message at a time — its performance is dominated by startup time (under 10ms with native binary), not throughput.

UTLXe achieves 86,000+ messages per second on a single container with 8 workers. This chapter explains where that number comes from, how to tune for your workload, and what to monitor.

Execution Strategies

The strategy choice is the single biggest performance lever. It determines how the transformation executes at runtime:


Strategy	How it works	Throughput	When to use
TEMPLATE	Walk AST, interpret each expression	1,000-5,000 msg/s	Development, simple transforms, low volume
COPY	Clone pre-built skeleton, fill values	5,000-20,000 msg/s	Schema-driven, predictable output structure
COMPILED	Execute JVM bytecode generated from AST	20,000-86,000 msg/s	Maximum throughput, complex logic
COPY+COMPILED	Clone skeleton + compiled fill logic	50,000-86,000+ msg/s	Ultimate throughput
AUTO	Schema present → COPY, else → TEMPLATE	Varies	Production default — engine chooses

The numbers above are for a single container with 8 workers on 1 vCPU, processing typical JSON-to-JSON transformations ( 1KB messages). Larger messages, more complex transformations, and XML parsing reduce throughput proportionally.

Why COMPILED Is Fast

The COMPILED strategy compiles UTL-X expressions to JVM bytecode using the ASM library — the same bytecode generation technology that Java itself uses. The generated bytecode:

Runs at native JVM speed (no interpretation overhead)
Benefits from HotSpot JIT compilation (further optimized at runtime)
Eliminates the AST tree-walking overhead ( 10x faster than TEMPLATE for complex expressions)
Uses typed operations (no boxing/unboxing for arithmetic)

The compilation happens once at init-time — while the bundle loads, before the engine reports ready — and is cached by SHA-256 hash of the source. The first message therefore hits already-compiled bytecode; there is no first-message latency penalty. Subsequent starts with the same transformation skip compilation entirely (cache hit).

Memory Model

UDM Expansion Factor

When UTL-X parses a message, it creates a UDM tree in memory. This tree is larger than the original message because:

Every string becomes a Java String object (24+ bytes overhead per string)
Every number becomes a boxed Double or Long (24 bytes)
Every object becomes a HashMap with entry overhead
Every array becomes an ArrayList with element pointers

The expansion factor depends on the format:


Format	Typical expansion	Example
JSON	5-10x	1KB JSON → 5-10KB heap
XML	10-20x	1KB XML → 10-20KB heap (attributes, namespaces add overhead)
CSV	3-5x	1KB CSV → 3-5KB heap (simpler structure)
YAML	5-10x	Similar to JSON

This means a 100KB XML document consumes 1-2MB of JVM heap during transformation. With 32 concurrent workers, that's 32-64MB just for input messages — plus the output UDM, plus intermediate values.

Heap Sizing

Rule of thumb: set JVM heap to 75% of container memory.

bash

# Container with 512MB memory
docker run -e JAVA_OPTS="-Xmx384m -Xms384m" ...

# Container with 2GB memory
docker run -e JAVA_OPTS="-Xmx1536m -Xms1536m" ...

Set -Xms equal to -Xmx to avoid heap resizing during operation. The remaining 25% is for JVM metaspace, thread stacks, and native memory.

Sizing Guide


Workload	Message size	Workers	Recommended memory
Small messages	Under 10KB	8-16	256-512MB
Medium messages	10-100KB	8-16	512MB-1GB
Large messages	100KB-1MB	4-8	1-2GB
Very large messages	1MB+	2-4	2-4GB

For large messages, reduce worker count — fewer concurrent messages means less simultaneous heap usage. For small messages, increase workers to maximize throughput.

GraalVM Native Image

CLI: Native Binary

The CLI (utlx) is distributed as a GraalVM native binary:

Startup: under 10ms (vs 250ms for JVM JAR)
Memory: 40MB resident (vs 150MB for JVM)
Distribution: single binary, no JVM required

This makes the CLI practical for shell scripts, CI/CD pipelines, and interactive use — where startup time dominates.

Engine: JVM Only

UTLXe runs on the JVM, not as a native image. The JVM is actually faster for long-running engine workloads because:

HotSpot JIT compiles hot paths to native code at runtime
The COMPILED strategy generates JVM bytecode that benefits from JIT
G1GC handles large heaps efficiently for sustained throughput
Reflection (used by SnakeYAML, Jackson) works without configuration

Native image would save startup time ( 250ms → 10ms), but UTLXe starts once and runs for days — startup time is irrelevant. Runtime throughput matters, and the JVM wins there.

Benchmarking

Quick Benchmark

Measure throughput with the HTTP API:

bash

# Start UTLXe with a test transformation
utlxe --bundle test-transforms/ --mode http --port 8080 --workers 8

# Benchmark with hey (HTTP load generator)
hey -n 10000 -c 32 -m POST \
  -H "Content-Type: application/json" \
  -d '{"orderId": "ORD-001", "total": 299.99}' \
  http://localhost:8080/api/execute/test-transform

This gives you requests/second, latency percentiles (p50, p95, p99), and error rate.

What to Measure


Metric	Target	Action if exceeded
p50 latency	Under 5ms	Check transformation complexity
p99 latency	Under 50ms	Check for GC pauses, large messages
Throughput	1,000+ msg/s per worker	Switch to COMPILED strategy
Error rate	Under 0.1%	Check validation, input format issues
Heap usage	Under 75% of max	Increase memory or reduce workers
GC pause time	Under 10ms	Tune G1GC, reduce heap pressure

Conformance Suite as Benchmark

The conformance suite (500+ tests) can be run in throughput mode to establish a baseline:

bash

python3 conformance-suite/run_tests.py --benchmark

This runs all tests repeatedly and reports average execution time per test — useful for detecting performance regressions after code changes.

Production Tuning

Worker Count

Start with 8 workers per vCPU:

bash

# 1 vCPU container
utlxe --workers 8

# 4 vCPU container
utlxe --workers 32

Adjust based on monitoring:

Workers always at capacity, queue building up: add more workers or more instances
Workers mostly idle: reduce to save memory
High GC pause times: reduce workers (less concurrent heap pressure)

Horizontal vs Vertical Scaling


Approach	When	How
More workers (vertical)	Workers are saturated, memory is available	Increase `--workers`
More instances (horizontal)	Memory is saturated, or need fault tolerance	Add container replicas
Bigger container	Large messages need more heap	Increase memory limit
COMPILED strategy	TEMPLATE throughput is insufficient	Set strategy in transform.yaml

Horizontal scaling (more instances) is generally preferred — it adds fault tolerance and distributes load across nodes. Vertical scaling (more workers in one instance) is simpler but has a ceiling (one JVM, one heap).

GC Tuning

For containerized UTLXe, G1GC with these settings works well:

bash

JAVA_OPTS="-Xmx1536m -Xms1536m \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=10 \
  -XX:G1HeapRegionSize=4m \
  -XX:+UseStringDeduplication"

MaxGCPauseMillis=10 — target 10ms GC pauses (acceptable for transformation latency)
G1HeapRegionSize=4m — good for mixed small and medium objects (UDM trees)
UseStringDeduplication — reduces heap for messages with repeated string values (field names, enum values)

Back-Pressure

UTLXe uses ArrayBlockingQueue with CallerRunsPolicy:

When the work queue is full (all workers busy): the calling thread executes the transformation itself
This naturally slows the producer when the engine is saturated
No risk of out-of-memory from unbounded queue growth
No messages are dropped — they're executed by the caller thread instead

Monitor queue depth via Prometheus (utlxe_queue_depth). Sustained high queue depth means you need more capacity.

Performance Anti-Patterns

Avoid: filter() Inside map()

utlx

// SLOW — O(N × M): scans all lines for every order
map($input.orders, (order) ->
  filter($input.lines, (l) -> l.orderId == order.id)
)

// FAST — O(N + M): build index once, look up per order
let linesByOrder = groupBy($input.lines, (l) -> l.orderId)
map($input.orders, (order) ->
  linesByOrder[order.id] ?? []
)

nestBy() handles the indexing automatically.

Avoid: Repeated Computation

utlx

// SLOW — computes the same filter twice
{
  activeCount: count(filter($input.users, (u) -> u.active)),
  activeNames: map(filter($input.users, (u) -> u.active), (u) -> u.name)
}

// FAST — compute once, reuse
let activeUsers = filter($input.users, (u) -> u.active)
{
  activeCount: count(activeUsers),
  activeNames: map(activeUsers, (u) -> u.name)
}

Bind intermediate results to let variables. The cost of the binding is zero; the cost of recomputation is proportional to the data size.

Avoid: Large String Concatenation in Loops

utlx

// SLOW — creates N intermediate strings
reduce($input.items, "", (acc, item) ->
  concat(acc, item.name, ", ")
)

// FAST — use join()
join(map($input.items, (item) -> item.name), ", ")

join() builds the result in one pass. concat() in a reduce() creates a new string for every iteration.

Performance and Optimization ​

Execution Strategies ​

Why COMPILED Is Fast ​

Memory Model ​

UDM Expansion Factor ​

Heap Sizing ​

Sizing Guide ​

GraalVM Native Image ​

CLI: Native Binary ​

Engine: JVM Only ​

Benchmarking ​

Quick Benchmark ​

What to Measure ​

Conformance Suite as Benchmark ​

Production Tuning ​

Worker Count ​

Horizontal vs Vertical Scaling ​

GC Tuning ​

Back-Pressure ​

Performance Anti-Patterns ​

Avoid: filter() Inside map() ​

Avoid: Repeated Computation ​

Avoid: Large String Concatenation in Loops ​

Performance and Optimization

Execution Strategies

Why COMPILED Is Fast

Memory Model

UDM Expansion Factor

Heap Sizing

Sizing Guide

GraalVM Native Image

CLI: Native Binary

Engine: JVM Only

Benchmarking

Quick Benchmark

What to Measure

Conformance Suite as Benchmark

Production Tuning

Worker Count

Horizontal vs Vertical Scaling

GC Tuning

Back-Pressure

Performance Anti-Patterns

Avoid: filter() Inside map()

Avoid: Repeated Computation

Avoid: Large String Concatenation in Loops