Performance and Optimization
Primarily UTLXe. Performance tuning applies to the UTLXe production engine. The CLI processes one message at a time — its performance is dominated by startup time (under 10ms with native binary), not throughput.
UTLXe achieves 86,000+ messages per second on a single container with 8 workers. This chapter explains where that number comes from, how to tune for your workload, and what to monitor.
Execution Strategies
The strategy choice is the single biggest performance lever. It determines how the transformation executes at runtime:
| Strategy | How it works | Throughput | When to use |
| TEMPLATE | Walk AST, interpret each expression | 1,000-5,000 msg/s | Development, simple transforms, low volume |
| COPY | Clone pre-built skeleton, fill values | 5,000-20,000 msg/s | Schema-driven, predictable output structure |
| COMPILED | Execute JVM bytecode generated from AST | 20,000-86,000 msg/s | Maximum throughput, complex logic |
| COPY+COMPILED | Clone skeleton + compiled fill logic | 50,000-86,000+ msg/s | Ultimate throughput |
| AUTO | Schema present → COPY, else → TEMPLATE | Varies | Production default — engine chooses |
The numbers above are for a single container with 8 workers on 1 vCPU, processing typical JSON-to-JSON transformations ( 1KB messages). Larger messages, more complex transformations, and XML parsing reduce throughput proportionally.
Why COMPILED Is Fast
The COMPILED strategy compiles UTL-X expressions to JVM bytecode using the ASM library — the same bytecode generation technology that Java itself uses. The generated bytecode:
Runs at native JVM speed (no interpretation overhead)
Benefits from HotSpot JIT compilation (further optimized at runtime)
Eliminates the AST tree-walking overhead ( 10x faster than TEMPLATE for complex expressions)
Uses typed operations (no boxing/unboxing for arithmetic)
The compilation happens once at init-time — while the bundle loads, before the engine reports ready — and is cached by SHA-256 hash of the source. The first message therefore hits already-compiled bytecode; there is no first-message latency penalty. Subsequent starts with the same transformation skip compilation entirely (cache hit).
Memory Model
UDM Expansion Factor
When UTL-X parses a message, it creates a UDM tree in memory. This tree is larger than the original message because:
Every string becomes a Java String object (24+ bytes overhead per string)
Every number becomes a boxed Double or Long (24 bytes)
Every object becomes a HashMap with entry overhead
Every array becomes an ArrayList with element pointers
The expansion factor depends on the format:
| Format | Typical expansion | Example |
| JSON | 5-10x | 1KB JSON → 5-10KB heap |
| XML | 10-20x | 1KB XML → 10-20KB heap (attributes, namespaces add overhead) |
| CSV | 3-5x | 1KB CSV → 3-5KB heap (simpler structure) |
| YAML | 5-10x | Similar to JSON |
This means a 100KB XML document consumes 1-2MB of JVM heap during transformation. With 32 concurrent workers, that's 32-64MB just for input messages — plus the output UDM, plus intermediate values.
Heap Sizing
Rule of thumb: set JVM heap to 75% of container memory.
# Container with 512MB memory
docker run -e JAVA_OPTS="-Xmx384m -Xms384m" ...
# Container with 2GB memory
docker run -e JAVA_OPTS="-Xmx1536m -Xms1536m" ...Set -Xms equal to -Xmx to avoid heap resizing during operation. The remaining 25% is for JVM metaspace, thread stacks, and native memory.
Sizing Guide
| Workload | Message size | Workers | Recommended memory |
| Small messages | Under 10KB | 8-16 | 256-512MB |
| Medium messages | 10-100KB | 8-16 | 512MB-1GB |
| Large messages | 100KB-1MB | 4-8 | 1-2GB |
| Very large messages | 1MB+ | 2-4 | 2-4GB |
For large messages, reduce worker count — fewer concurrent messages means less simultaneous heap usage. For small messages, increase workers to maximize throughput.
GraalVM Native Image
CLI: Native Binary
The CLI (utlx) is distributed as a GraalVM native binary:
Startup: under 10ms (vs 250ms for JVM JAR)
Memory: 40MB resident (vs 150MB for JVM)
Distribution: single binary, no JVM required
This makes the CLI practical for shell scripts, CI/CD pipelines, and interactive use — where startup time dominates.
Engine: JVM Only
UTLXe runs on the JVM, not as a native image. The JVM is actually faster for long-running engine workloads because:
HotSpot JIT compiles hot paths to native code at runtime
The COMPILED strategy generates JVM bytecode that benefits from JIT
G1GC handles large heaps efficiently for sustained throughput
Reflection (used by SnakeYAML, Jackson) works without configuration
Native image would save startup time ( 250ms → 10ms), but UTLXe starts once and runs for days — startup time is irrelevant. Runtime throughput matters, and the JVM wins there.
Benchmarking
Quick Benchmark
Measure throughput with the HTTP API:
# Start UTLXe with a test transformation
utlxe --bundle test-transforms/ --mode http --port 8080 --workers 8
# Benchmark with hey (HTTP load generator)
hey -n 10000 -c 32 -m POST \
-H "Content-Type: application/json" \
-d '{"orderId": "ORD-001", "total": 299.99}' \
http://localhost:8080/api/execute/test-transformThis gives you requests/second, latency percentiles (p50, p95, p99), and error rate.
What to Measure
| Metric | Target | Action if exceeded |
| p50 latency | Under 5ms | Check transformation complexity |
| p99 latency | Under 50ms | Check for GC pauses, large messages |
| Throughput | 1,000+ msg/s per worker | Switch to COMPILED strategy |
| Error rate | Under 0.1% | Check validation, input format issues |
| Heap usage | Under 75% of max | Increase memory or reduce workers |
| GC pause time | Under 10ms | Tune G1GC, reduce heap pressure |
Conformance Suite as Benchmark
The conformance suite (500+ tests) can be run in throughput mode to establish a baseline:
python3 conformance-suite/run_tests.py --benchmarkThis runs all tests repeatedly and reports average execution time per test — useful for detecting performance regressions after code changes.
Production Tuning
Worker Count
Start with 8 workers per vCPU:
# 1 vCPU container
utlxe --workers 8
# 4 vCPU container
utlxe --workers 32Adjust based on monitoring:
Workers always at capacity, queue building up: add more workers or more instances
Workers mostly idle: reduce to save memory
High GC pause times: reduce workers (less concurrent heap pressure)
Horizontal vs Vertical Scaling
| Approach | When | How |
| More workers (vertical) | Workers are saturated, memory is available | Increase --workers |
| More instances (horizontal) | Memory is saturated, or need fault tolerance | Add container replicas |
| Bigger container | Large messages need more heap | Increase memory limit |
| COMPILED strategy | TEMPLATE throughput is insufficient | Set strategy in transform.yaml |
Horizontal scaling (more instances) is generally preferred — it adds fault tolerance and distributes load across nodes. Vertical scaling (more workers in one instance) is simpler but has a ceiling (one JVM, one heap).
GC Tuning
For containerized UTLXe, G1GC with these settings works well:
JAVA_OPTS="-Xmx1536m -Xms1536m \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=10 \
-XX:G1HeapRegionSize=4m \
-XX:+UseStringDeduplication"MaxGCPauseMillis=10— target 10ms GC pauses (acceptable for transformation latency)G1HeapRegionSize=4m— good for mixed small and medium objects (UDM trees)UseStringDeduplication— reduces heap for messages with repeated string values (field names, enum values)
Back-Pressure
UTLXe uses ArrayBlockingQueue with CallerRunsPolicy:
When the work queue is full (all workers busy): the calling thread executes the transformation itself
This naturally slows the producer when the engine is saturated
No risk of out-of-memory from unbounded queue growth
No messages are dropped — they're executed by the caller thread instead
Monitor queue depth via Prometheus (utlxe_queue_depth). Sustained high queue depth means you need more capacity.
Performance Anti-Patterns
Avoid: filter() Inside map()
// SLOW — O(N × M): scans all lines for every order
map($input.orders, (order) ->
filter($input.lines, (l) -> l.orderId == order.id)
)
// FAST — O(N + M): build index once, look up per order
let linesByOrder = groupBy($input.lines, (l) -> l.orderId)
map($input.orders, (order) ->
linesByOrder[order.id] ?? []
)nestBy() handles the indexing automatically.
Avoid: Repeated Computation
// SLOW — computes the same filter twice
{
activeCount: count(filter($input.users, (u) -> u.active)),
activeNames: map(filter($input.users, (u) -> u.active), (u) -> u.name)
}
// FAST — compute once, reuse
let activeUsers = filter($input.users, (u) -> u.active)
{
activeCount: count(activeUsers),
activeNames: map(activeUsers, (u) -> u.name)
}Bind intermediate results to let variables. The cost of the binding is zero; the cost of recomputation is proportional to the data size.
Avoid: Large String Concatenation in Loops
// SLOW — creates N intermediate strings
reduce($input.items, "", (acc, item) ->
concat(acc, item.name, ", ")
)
// FAST — use join()
join(map($input.items, (item) -> item.name), ", ")join() builds the result in one pass. concat() in a reduce() creates a new string for every iteration.