The Universal Data Model (UDM)

The Universal Data Model is the single most important concept in UTL-X. It's the abstraction that makes format-agnosticism possible — the internal representation that lets your transformation work identically whether the input is XML, JSON, CSV, YAML, or OData.

This chapter explains what UDM is, how each format maps to it, and why certain design decisions were made. If you want to understand why UTL-X works the way it does — not just how — this is the chapter.

Why UDM Exists

Consider the problem: XML has elements, attributes, namespaces, and mixed content. JSON has objects, arrays, strings, numbers, booleans, and null. CSV has rows and columns. YAML has all of JSON's types plus anchors, aliases, and tags.

Without an intermediate representation, a transformation engine would need separate logic for every format combination:

XML → JSON: handle attributes, namespaces, text nodes
JSON → XML: handle types (number → string), arrays (→ repeated elements)
CSV → JSON: handle headers, type detection
XML → CSV: handle hierarchy flattening
... and so on for every pair

With N formats, that's N × (N-1) conversion paths — each with its own edge cases and bugs.

UDM eliminates this by providing one common tree structure:

Tier 1 data transformation — all input formats (XML, JSON, CSV, YAML, OData) are parsed into UDM. The transformation operates on UDM. The result is rendered to the chosen output format.

Every input format is parsed into UDM. Every transformation operates on UDM. Every output format is serialized from UDM. The transformation never touches raw XML tags or JSON braces — it works with the meaning of the data.

UDM Types

UDM has eight types. Every piece of data from any format maps to one of these:


UDM Type	What it holds	Examples
Scalar	A single value: string, number, boolean, or null	"Alice", 42, 3.14, true, null
Object	Named properties + optional XML attributes + metadata
Array	Ordered list of UDM values	[1, 2, 3], [{...}, {...}]
DateTime	Timestamp with timezone (ISO 8601)	2026-04-28T14:30:00Z
Date	Date only, no time	2026-04-28
LocalDateTime	Date and time without timezone	2026-04-28T14:30:00
Time	Time only, no date	14:30:00
Binary	Raw byte array	Binary file content

There's also a Lambda type for function values, but you rarely encounter it directly — it's used internally for map, filter, and other higher-order functions.

The first three types — Scalar, Object, and Array — account for 99% of real-world data. The temporal types (DateTime, Date, LocalDateTime, Time) provide first-class date handling without string parsing at every access.

How Formats Map to UDM

JSON → UDM

JSON maps to UDM almost directly — JSON was the inspiration for UDM's structure:


JSON	UDM
String: "Alice"	Scalar(string: "Alice")
Number: 42	Scalar(number: 42)
Boolean: true	Scalar(boolean: true)
Null: null	Scalar(null)
Object:	Object(properties: {name → Scalar("Alice")})
Array: [1, 2, 3]	Array(elements: [Scalar(1), Scalar(2), Scalar(3)])

JSON-to-UDM is lossless. UDM-to-JSON is lossless. Round-trip fidelity is guaranteed.

XML → UDM

XML is more complex because it has features that JSON lacks: attributes, namespaces, mixed content, and the distinction between elements and text.


XML	UDM
Element with children: <Order><Customer>...</Customer></Order>	Object(properties: {Customer → ...})
Element with text: <Name>Alice</Name>	Object(properties: {_text → Scalar("Alice")})
Attribute: id="123"	Stored in Object.attributes map
Repeated elements: <Item/><Item/>	Array of Objects
Namespace: xmlns="..."	Stored in Object.metadata

The key design decisions for XML:

Text content uses _text. When an XML element contains only text — like <Name>Alice</Name> — the text is stored as a property called _text inside the UDM Object. This is an internal convention that you never see in output (the serializers unwrap it automatically). It exists because UDM Objects store properties as a map, and the text content needs a key.

When you write $input.Order.Customer, UTL-X automatically unwraps the _text and returns "Alice" — not the internal {_text: "Alice"} wrapper. This unwrapping is described in detail in Chapter 23.

Attributes are separate. XML attributes are stored in a separate attributes map on the UDM Object, not mixed with child element properties. This prevents name collisions — an element could theoretically have both a child element and an attribute with the same name. Access attributes with the @ prefix: $input.Order.@id.

Repeated elements become arrays. When an XML element appears multiple times with the same name, they're automatically grouped into a UDM Array. <Item/><Item/><Item/> becomes an Array of three Objects. Single elements stay as Objects (not wrapped in an array) unless array hints are provided.

CSV → UDM

CSV maps to an Array of Objects, where each row becomes an Object with column headers as property names:

name,age,city
Alice,30,Amsterdam
Bob,25,Rotterdam

Becomes:

Array([
  Object({name: "Alice", age: 30, city: "Amsterdam"}),
  Object({name: "Bob", age: 25, city: "Rotterdam"})
])

Numbers are auto-detected: "30" becomes Scalar(number: 30), not Scalar(string: "30"). Booleans too: "true" becomes Scalar(boolean: true).

YAML → UDM

YAML is a superset of JSON, so the mapping is the same as JSON. YAML-specific features (anchors, aliases, tags) are resolved during parsing — the resulting UDM is identical to what JSON would produce for the same data.

OData → UDM

OData JSON is standard JSON with metadata conventions (@odata.context, @odata.type). These metadata properties are parsed as regular Object properties — accessible via $input["@odata.context"].

Dot Notation

The primary way to access data:

utlx

$input.Order.Customer.Name          // nested property access
$input.Order.Items.Item[0].Price    // array element access
$input.Order.@id                     // XML attribute access

Handle missing properties gracefully with ?.:

utlx

$input.Order?.Customer?.Name        // returns null if any step is missing

Without ?., a missing intermediate property causes an error. With ?., you get null — which you can then handle with ?? (nullish coalescing):

utlx

$input.Order?.Customer?.Name ?? "Unknown Customer"

Recursive Descent

Find a property name at any depth with ..:

utlx

$input..ProductCode                 // finds ProductCode anywhere in the tree

Returns an array of all matches. Useful for deeply nested or variable-depth structures.

UDM and Type Coercion

UTL-X provides both automatic and explicit type coercion.

Automatic Coercion

In certain contexts, UTL-X converts types automatically:

String concatenation: concat("Total: ", 42) — number 42 becomes string "42"
XML text unwrapping: $input.Customer — the internal _text Object unwraps to the scalar value

Explicit Coercion

Use stdlib functions for explicit conversion:

utlx

toString(42)           // "42"
toNumber("42")         // 42
toBoolean("true")      // true
toNumber("not-a-num")  // 0 (returns 0 for unparseable input)

XML Text Node Unwrapping

This is the most important coercion in UTL-X. When XML is parsed:

xml

<Name>Alice</Name>

The UDM contains: Object(properties: {_text: Scalar("Alice")}).

But when you write $input.Name, you get "Alice" — not the wrapper object. UTL-X automatically unwraps the _text node during property access. This is handled by the interpreter (for TEMPLATE strategy) and RuntimeOps (for COMPILED strategy).

The unwrapping rules:

If an Object has only a _text property and no real attributes → unwrap to the scalar
If an Object has _text AND attributes → keep as Object (attributes would be lost)
If an Object has child elements → no unwrapping (it's a real object)

Chapter 23 covers the design decisions behind these rules in detail.

UDM is Format-Agnostic, Transformations are Format-Agnostic

The key insight: because UDM normalizes all formats into one tree, your transformation expression is identical regardless of input format.

The expression $input.Order.Customer works whether the input is:

XML: <Order><Customer>Alice</Customer></Order>
JSON: \{"Order": \{"Customer": "Alice"\}\}
YAML: Order:\n Customer: Alice

Change input xml to input json in the header — the body stays the same. This is what "format-agnostic" means in practice: the transformation logic is decoupled from the serialization format.

UDM and Flat/Relational Data

UDM is inherently hierarchical — Objects contain Objects which contain Objects. This mirrors XML and JSON naturally. But not all data is hierarchical.

The Flat Data Challenge

Consider SAP IDoc segments exported as XML. Orders and order lines are siblings, not nested:

xml

<IDOC>
  <E1EDK01><BELNR>ORD-001</BELNR><CURRENCY>EUR</CURRENCY></E1EDK01>
  <E1EDP01><BELNR>ORD-001</BELNR><MATNR>Widget</MATNR><MENGE>2</MENGE></E1EDP01>
  <E1EDP01><BELNR>ORD-001</BELNR><MATNR>Gadget</MATNR><MENGE>1</MENGE></E1EDP01>
  <E1EDK01><BELNR>ORD-002</BELNR><CURRENCY>USD</CURRENCY></E1EDK01>
  <E1EDP01><BELNR>ORD-002</BELNR><MATNR>Gizmo</MATNR><MENGE>5</MENGE></E1EDP01>
</IDOC>

The order lines belong to orders — but the relationship is in the BELNR field value, not in the XML structure. UDM parses this as siblings in the same array, with no awareness that BELNR is a join key.

This pattern appears in SAP IDocs, EDI/EDIFACT segments, database exports, and any flat file where parent-child relationships are expressed through reference keys rather than nesting.

How to Handle It in UTL-X

Use groupBy to build a lookup, then map to construct the hierarchy:

utlx

let headers = $input.IDOC.E1EDK01
let linesByOrder = groupBy($input.IDOC.E1EDP01, (line) -> line.BELNR)

map(headers, (header) -> {
  orderId: header.BELNR,
  currency: header.CURRENCY,
  lines: map(linesByOrder[header.BELNR] ?? [], (line) -> {
    product: line.MATNR,
    quantity: toNumber(line.MENGE)
  })
})

The groupBy creates an indexed map (O(N) to build), and each header does an O(1) key lookup — efficient even with thousands of records.

For multi-level nesting (header → line → schedule), chain multiple groupBy calls:

utlx

let linesByOrder = groupBy($input.lines, (l) -> l.orderId)
let schedulesByLine = groupBy($input.schedules, (s) -> s.lineId)

map($input.headers, (h) -> {
  ...h,
  lines: map(linesByOrder[h.orderId] ?? [], (l) -> {
    ...l,
    schedules: schedulesByLine[l.lineId] ?? []
  })
})

Hierarchical vs Relational: A Design Trade-Off

Tools like IBM's Mercator (later WTX) solved this with an "intermediate card" — a declarative mapping that specifies join keys and cardinalities. The engine then automatically restructures flat data into hierarchies.

UTL-X takes a different approach: the groupBy + map pattern uses existing language constructs instead of a separate mapping artifact. This is more verbose but more flexible — you can add conditions, transformations, and error handling at each level.

The nestBy() stdlib function simplifies the common case:

utlx

let enrichedOrders = nestBy(
  $input.IDOC.E1EDK01,           // parent records
  $input.IDOC.E1EDP01,           // child records
  (header) -> header.BELNR,       // parent key extractor
  (line) -> line.BELNR,           // child key extractor
  "lines"                         // name for the new property on each parent
)

The name reads naturally: "nest E1EDP01 records by BELNR into a property called lines." The 5th parameter ("lines") is a string that becomes a new property name on each parent object. After nestBy(), every E1EDK01 gains a .lines property containing its matched E1EDP01 records — you access it with dot notation like any other property:

utlx

// After nestBy(), .lines is a regular property:
map(enrichedOrders, (order) -> {
  orderId: order.BELNR,
  lineCount: count(order.lines),
  total: sum(map(order.lines, (l) -> toNumber(l.MENGE) * toNumber(l.PRICE)))
})

See Chapter 21 (Data Restructuring) for the full analysis, performance characteristics, and multi-level nesting patterns.

UDM and Schema Formats (Tier 2)

The diagrams above show Tier 1 data transformations — instance documents (JSON, XML, CSV, YAML, OData) parsed into UDM. But UTL-X also handles Tier 2 schema formats: XSD, JSON Schema, Avro, Protobuf, OData Schema (EDMX), and Table Schema.

When a Tier 2 format is the input, the parser produces UDM enriched with USDL directives (%types, %fields, %kind) — see Chapter 12. The transformation can then work with both the raw schema structure and the normalized USDL properties.

Tier 2 schema transformation — schema formats (JSON Schema, XSD, Table Schema, OData Schema, Protobuf, Avro) are parsed into UDM + USDL. The transformation can convert between any schema format pair.

Schema-to-schema conversion (e.g., input xsd / output jsch) works because USDL normalizes all schema concepts — types, fields, constraints — into a common vocabulary, just as UDM normalizes all data formats into a common tree.

Combined Transformations: Tier 1 + Tier 2

In some scenarios, a transformation receives both instance data (Tier 1) and schema metadata (Tier 2) as inputs. For example, a multi-input transformation might read a JSON message alongside its JSON Schema to produce validated, enriched output — or the output itself might be a schema format.

Combined transformation — Tier 1 instance data and Tier 2 schema metadata are parsed into UDM (with USDL for Tier 2). The output can be either a Tier 1 data format or a Tier 2 schema format.

This flexibility means UTL-X handles not just data transformation but also schema generation (CSV metadata → XSD), schema inspection (XSD → JSON report), and data-plus-schema pipelines — all using the same language and the same UDM foundation.

When UDM Matters

Most of the time, you don't think about UDM — you just write $input.name and it works. UDM matters when:

You're debugging unexpected behavior (especially XML attributes and _text)
You're working with flat/relational data (IDoc segments, CSV with foreign keys)
You're optimizing memory usage (UDM expansion factors — see Chapter 37)
You're working with the COMPILED strategy (which generates bytecode against UDM types)
You're writing a wrapper library (which serializes UDM to/from protobuf)
You're contributing to UTL-X itself (all core logic operates on UDM)

For day-to-day transformation development, UDM is invisible — which is exactly the point. It does its job and gets out of the way.

The Universal Data Model (UDM)

Why UDM Exists

UDM Types

How Formats Map to UDM

JSON → UDM

XML → UDM

CSV → UDM

YAML → UDM

OData → UDM

UDM Navigation

Dot Notation

Safe Navigation

Recursive Descent

UDM and Type Coercion

Automatic Coercion

Explicit Coercion

XML Text Node Unwrapping

UDM is Format-Agnostic, Transformations are Format-Agnostic

UDM and Flat/Relational Data

The Flat Data Challenge

How to Handle It in UTL-X

Hierarchical vs Relational: A Design Trade-Off

UDM and Schema Formats (Tier 2)

Combined Transformations: Tier 1 + Tier 2

When UDM Matters

The Universal Data Model (UDM) ​

Why UDM Exists ​

UDM Types ​

How Formats Map to UDM ​

JSON → UDM ​

XML → UDM ​

CSV → UDM ​

YAML → UDM ​

OData → UDM ​

UDM Navigation ​

Dot Notation ​

Safe Navigation ​

Recursive Descent ​

UDM and Type Coercion ​

Automatic Coercion ​

Explicit Coercion ​

XML Text Node Unwrapping ​

UDM is Format-Agnostic, Transformations are Format-Agnostic ​

UDM and Flat/Relational Data ​

The Flat Data Challenge ​

How to Handle It in UTL-X ​

Hierarchical vs Relational: A Design Trade-Off ​

UDM and Schema Formats (Tier 2) ​

Combined Transformations: Tier 1 + Tier 2 ​

When UDM Matters ​

The Universal Data Model (UDM)

Why UDM Exists

UDM Types

How Formats Map to UDM

JSON → UDM

XML → UDM

CSV → UDM

YAML → UDM

OData → UDM

UDM Navigation

Dot Notation

Safe Navigation

Recursive Descent

UDM and Type Coercion

Automatic Coercion

Explicit Coercion

XML Text Node Unwrapping

UDM is Format-Agnostic, Transformations are Format-Agnostic

UDM and Flat/Relational Data

The Flat Data Challenge

How to Handle It in UTL-X

Hierarchical vs Relational: A Design Trade-Off

UDM and Schema Formats (Tier 2)

Combined Transformations: Tier 1 + Tier 2

When UDM Matters