syndicate-rkt/syndicate/mc/preserve.md

40 KiB
Raw Blame History

Preserves: Semantic Serialization of Node-labelled Data

   _________
  <_________>   Tony Garnock-Jones <tonyg@leastfixedpoint.com>
  |  FRμIT  |   September 2018
  |Preserves|   Version 0.0.2
  \_________/
 

Introduction

Most data serialization formats used on the web represent edge-labelled semi-structured data.

This document proposes a data model and serialization format that takes a node-labelled approach.

This makes it both extensible and much more like S-expressions, making it easily able to represent the labelled sums of products as seen in Rust, Haskell, OCaml, and other functional programming languages.

Starting with Semantics

Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax. We will treat syntax separately, later in this document.

                      Value = Atom
                            | Compound

                       Atom = SignedInteger
                            | String
                            | ByteString
                            | Symbol
                            | Boolean
                            | Float
                            | Double

                   Compound = Record
                            | Sequence
                            | Set
                            | Dictionary

Our Values fall into two broad categories: atomic and compound data.1

Total order. As we go, we will incrementally specify a total order over Values. Two values of the same kind are compared using kind-specific rules. The ordering among values of different kinds is essentially arbitrary, but having a total order is convenient for many tasks, so we define it as follows:2

        (Values)        Compound < Atom

        (Compounds)     Record < Sequence < Set < Dictionary

        (Atoms)         SignedInteger < String < ByteString < Symbol
                          < Boolean < Float < Double

Equivalence. Two Values are equal if neither is less than the other according to the total order.

Signed integers.

A SignedInteger is a signed integer of arbitrary width. SignedIntegers are compared as mathematical integers. We will write examples of SignedIntegers using standard mathematical notation.

Examples. 10; -6; 0.

Non-examples. NaN (the clue is in the name!); ∞ (not finite); 0.2 (not an integer); 1/7 (likewise); 2+i3 (likewise); √2 (likewise).

Unicode strings.

A String is a sequence of Unicode code-points. Two Strings are compared lexicographically, code-point by code-point.3 We will write examples of Strings text surrounded by double-quotes “"” using a monospace font.

Examples. "Hello world", an eleven-code-point string; "z水𝄞", the string containing the three Unicode code-points z (0x7A), (0x6C34) and 𝄞 (0x1D11E); "", the empty string.

Normalization forms. Unicode defines multiple normalization forms for text. No particular normalization form is required for Strings; see below.

Binary data.

A ByteString is an ordered sequence of zero or more integers in the inclusive range [0..255]. ByteStrings are compared lexicographically, byte by byte. We will only write examples of ByteStrings that contain bytes mapping to printable ASCII characters, using “#"” as an opening quote mark and “"” as a closing quote mark.

Examples. The ByteString containing the integers 65, 66 and 67 (corresponding to ASCII characters A, B and C) is written as #"ABC". The empty ByteString is written as #"". N.B. Despite appearances, these are binary data.

Symbols or identifiers.

Programming languages like Lisp and Prolog frequently use string-like values called symbols. Here, a Symbol is, like a String, a sequence of Unicode code-points, intended to represent an identifier of some kind. Symbols are also compared lexicographically by code-point. We will write examples including only non-empty sequences of non-whitespace characters, using a monospace font without quotation marks.

Examples. hello-world; utf8-string; exact-integer?.

Booleans.

There are exactly two Boolean values, “false” and “true”. The “false” value compares less-than the “true” value. We write #f for “false”, and #t for “true”.

Examples. #f; #t.

IEEE floating-point values.

A Float is a single-precision IEEE 754 floating-point value; a Double is a double-precision IEEE 754 floating-point value. Floats, Doubles and SignedIntegers are considered disjoint, and so by the rules above, every Float is less than every Double, and every SignedInteger is less than both. Two Floats or two Doubles are to be ordered by the totalOrder predicate defined in section 5.10 of IEEE Std 754-2008. We write examples using standard mathematical notation, avoiding NaN and infinities, using a suffix f or d to indicate Float or Double, respectively.

Examples. 10f; -6d; 0f; 0.5d; -1.202e300d.

Non-examples. 10, -6, and 0, because writing them this way indicates SignedIntegers, not Floats or Doubles.

Records.

A Record is a labelled tuple of zero or more Values, called the record's fields. A record's label is, itself, a Value, though it will usually be a Symbol.4 5 Records are compared lexicographically as if they were just tuples; that is, first by their labels, and then by the remainder of their fields. We will only write examples of Records having labels that are Symbols entirely composed of ASCII characters. Such Records will be written as a parenthesised, space-separated sequence of their label followed by their fields.

Examples. The Record with label foo and fields 1, 2 and 3 is written (foo 1 2 3); the Record with label void and no fields is written (void).

Sequences.

A Sequence is a general-purpose, variable-length ordered sequence of zero or more Values. Sequences are compared lexicographically, appealing to the ordering on Values for comparisons at each position in the Sequences. We write examples space-separated, surrounded with square brackets.

Examples. [], the empty sequence; [1 2 3], the sequence of SignedIntegers 1, 2 and 3.

Sets.

A Set is an unordered finite set of Values. It contains no duplicate values, following the equivalence relation induced by the total order on Values. Two Sets are compared by sorting their elements using the total order and comparing the resulting sequences as Sequences. We write examples space-separated, surrounded with curly braces, prefixed by #set.

Examples. #set{}, the empty set; #set{#set{}}, the set containing only the empty set; #set{4 "hello" (void) 9.0f}, the set containing 4, the string "hello", the record with label void and no fields, and the Float denoting the number 9.0; #set{1 1.0f}, the set containing a SignedInteger and a Float, both denoting the number 1; #set{(mime application/xml #"<x/>") (mime application/xml #"<x />")}, a set containing two different type-labelled byte arrays.6

Non-examples. #set{1 1 1}, because it contains multiple equivalent Values.

Dictionaries, hash-tables or maps.

A Dictionary is an unordered finite collection of zero or more pairs of Values. Each pair comprises a key and a value. Keys in a Dictionary must be pairwise distinct. Instances of Dictionary are compared by lexicographic comparison of the sequences resulting from ordering each Dictionary's pairs in ascending order by key. Examples are written as a #dict-prefixed, curly-brace-surrounded sequence of space-separated key-value pairs, each written with a colon between the key and value.

Examples. #dict{}, the empty dictionary; #dict{a:1}, the dictionary mapping the Symbol a to the SignedInteger 1; #dict{1:a}, mapping 1 to a; #dict{"hi":0 hi:0 there:[]}, having a String and two Symbol keys, and SignedInteger and Sequence values.

Non-examples. #dict{a:1 b:2 a:3}, because it contains duplicate keys; #dict{[]:[] []:99}, for the same reason.

Syntax

Now we have discussed Values and their meanings, we may turn to techniques for representing Values for communication or storage.

The syntax we have used for the examples so far is inadequate in many ways, not least of which is that it cannot represent every Value.

Separation of the meaning of a piece of syntax from the syntax itself opens the door to domain-specific syntaxes, all equivalent and interconvertible.7 With a robust semantic foundation, connections to other data languages can also be made.

Binary syntax

For now, we limit our attention to an easily-parsed, easily-produced machine-readable syntax.

Every Value is represented as one or more bytes describing first its kind and its length, and then its specific contents.

For a value v, we write [[v]] for the encoding of v.

The following figure summarises the definitions below:

tt nn mmmm  varint(m)  contents
-------------------------------

00 00 mmmm  ...        application-specific Record
00 01 mmmm  ...        application-specific Record
00 10 mmmm  ...        application-specific Record
00 11 mmmm  ...        Record

01 00 mmmm  ...        Sequence
01 01 mmmm  ...        Set
01 10 mmmm  ...        Dictionary

10 00 mmmm  ...        SignedInteger, big-endian binary
10 01 mmmm  ...        String, UTF-8 binary
10 10 mmmm  ...        Bytes
10 11 mmmm  ...        Symbol, UTF-8 binary

11 00 0000             False
11 00 0001             True
11 00 0010             Float, 32 bits big-endian binary
11 00 0011             Double, 64 bits big-endian binary

If mmmm = 1111, varint(m) is present; otherwise, m is the length

Type and Length representation

A Value's type and length is represented by use of a function header(t,n,m) that yields a sequence of bytes when t, n and m are appropriate non-negative integers.

header(t,n,m) =    leadbyte(t,n,m)                 when m < 15
                or leadbyte(t,n,15) ++ varint(m)   otherwise

The lead byte in a Value's representation is constructed by a function

leadbyte(t,n,m) = [t*64 + n*16 + m]

The lead byte describes the rest of the representation as follows:8

leadbyte(0,-,-) represents a Record
leadbyte(1,-,-) represents a Sequence, Set or Dictionary
leadbyte(2,-,-) represents an Atom with variable-length binary representation
leadbyte(3,0,-) represents an Atom with fixed-length binary representation

Variable-length representations use the value of m to encode their lengths:

  • Lengths between 0 and 14 are represented using leadbyte with m values 0 through 14.
  • Lengths of 15 or greater are represented by m value 15, and additional "length bytes" describing the length then follow the lead byte.

These additional length bytes are formatted as base 128 varints. Quoting the Google Protocol Buffers definition,

Each byte in a varint, except the last byte, has the most significant bit (msb) set this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

Examples.

  • The varint representation of 15 is just the byte 15.
  • 300 (binary, grouped into 7-bit chunks, 10 0101100) varint-encodes to the two bytes 172 and 2.
  • 1000000000 (binary 11 1011100 1101011 0010100 0000000) varint-encodes to bytes 128, 148, 235, 220, and 3.

We write varint(m) for the varint-encoding of m.

Records

[[ (L F_1 ... F_m) ]] = header(0,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]

For m fields, m+1 is supplied to header, to account for the encoding of the record label.

Application-specific short form for labels

Any given protocol using Preserves may additionally define an interpretation for n ∈ {0,1,2}, mapping each short form label number n to a specific record label. When encoding m fields with short form label number n, the header is header(0,n,m) (rather than m+1) since the label is implicit.

Examples. For example, a protocol may choose to map records labelled void to n=0, making

[[(void)]] = header(0,0,0) = [0x00]

or it may map records labelled person to short form label number 1, making

[[(person "Dr" "Elizabeth" "Blackwell")]]
    = header(0,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`
    =        [0x13] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`

Sequences, Sets and Dictionaries

[[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]

[[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[X_1]] ++ ... ++ [[X_m]]

[[ #dict{K_1:V_1 ... K_m:V_m} ]]
  = header(1,2,m) ++ [[K_1]] ++ [[V_1]] ++ ... ++ [[K_m]] ++ [[V_m]]

There is no ordering requirement on the X_i elements or K_i/V_i pairs.9 They may appear in any order.

Note that n=3 is unused and reserved.

Variable-length Atoms

SignedInteger
[[ x ]] when x ∈ SignedInteger = header(2,0,m) ++ intbytes(x)
  where           m = |intbytes(x)|
    and intbytes(x) = a big-endian two's-complement representation
                      of the signed integer x, taking exactly as
                      many whole bytes as needed to unambiguously
                      identify the value

The value 0 needs zero bytes to identify the value, so intbytes(0) is the empty byte string. Non-zero values need at least one byte; the most-significant bit in the first byte in intbytes(x) for x≠0 is the sign bit.

For example,

[[   -257 ]] = [0x82, 0xFE, 0xFF]
[[   -256 ]] = [0x82, 0xFF, 0x00]
[[   -255 ]] = [0x82, 0xFF, 0x01]
[[   -254 ]] = [0x82, 0xFF, 0x02]
[[   -129 ]] = [0x82, 0xFF, 0x7F]
[[   -128 ]] = [0x81, 0x80]
[[   -127 ]] = [0x81, 0x81]
[[     -2 ]] = [0x81, 0xFE]
[[     -1 ]] = [0x81, 0xFF]
[[      0 ]] = [0x80]
[[      1 ]] = [0x81, 0x01]
[[    127 ]] = [0x81, 0x7F]
[[    128 ]] = [0x82, 0x00, 0x80]
[[    255 ]] = [0x82, 0x00, 0xFF]
[[    256 ]] = [0x82, 0x01, 0x00]
[[  32767 ]] = [0x82, 0x7F, 0xFF]
[[  32768 ]] = [0x83, 0x00, 0x80, 0x00]
[[  65535 ]] = [0x83, 0x00, 0xFF, 0xFF]
[[  65536 ]] = [0x83, 0x01, 0x00, 0x00]
[[ 131072 ]] = [0x83, 0x02, 0x00, 0x00]
String
[[ S ]] when S ∈ String = header(2,1,m) ++ utf8(S)
  where       m = |utf8(x)|
    and utf8(x) = the UTF-8 encoding of S
ByteString
[[ B ]] when B ∈ ByteString = header(2,2,m) ++ B
                    where m = |B|
Symbol
[[ S ]] when S ∈ Symbol = header(2,2,m) ++ utf8(S)
  where       m = |utf8(x)|
    and utf8(x) = the UTF-8 encoding of S

Fixed-length Atoms

Booleans
[[ #f ]] = header(3,0,0) = [0xC0]
[[ #t ]] = header(3,0,1) = [0xC1]
Floats and Doubles
[[ F ]] when F ∈ Float  = header(3,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(3,0,3) ++ binary64(D)
  where binary32(F) and binary64(D) are big-endian 4- and 8-byte
        IEEE 754 binary representations

Examples

For the following examples, imagine an application that maps Record short form label number 0 to label discard, 1 to capture, and 2 to observe.

Value Encoded hexadecimal byte sequence
(capture (discard)) 11 00
(observe (speak (discard) (capture (discard)))) 21 33 B5 73 70 65 61 6B 00 11 00
[1 2 3 4] 44 81 01 81 02 81 03 81 04
[-2 -1 0 1] 54 81 FE 81 FF 80 81 01
["hello" there #"world" [] #set{} #t #f] 47 95 68 65 6C 6C 6F A5 74 68 65 72 65 40 50 C1 C0
-257 82 FE FF
-1 81 FF
0 80
1 81 01
255 82 00 FF
1f C2 3F 80 00 00
1d C3 3F F0 00 00 00 00 00 00
-1.202e300d C3 FE 3C B7 B7 59 BF 04 26

Finally, a larger example, using a non-Symbol label for a record.10 The Value

([titled person 2 thing 1]
   101
   "Blackwell"
   (date 1821 2 3)
   "Dr")

encodes to

35                              ;; Record, generic, 4+1
  45                              ;; Sequence, 5
    B6 74 69 74 6C 65 64            ;; Symbol, "titled"
    B6 70 65 72 73 6F 6E            ;; Symbol, "person"
    81 02                           ;; SignedInteger, "2"
    B5 74 68 69 6E 67               ;; Symbol, "thing"
    81 01                           ;; SignedInteger, "1"
  81 65                           ;; SignedInteger, "101"
  99 42 6C 61 63 6B 77 65 6C 6C   ;; String, "Blackwell"
  34                              ;; Record, generic, 3+1
    B4 64 61 74 65                  ;; Symbol, "date"
    82 07 1D                        ;; SignedInteger, "1821"
    81 02                           ;; SignedInteger, "2"
    81 03                           ;; SignedInteger, "3"
  92 44 72                        ;; String, "Dr"

Conventions for Common Data Types

The Value data type is essentially an S-Expression, able to represent semi-structured data over ByteString, String, SignedInteger atoms and so on.

However, users need a wide variety of data types for representing domain-specific values such as various kinds of encoded and normalized text, calendrical values, machine words, and so on.

We use appropriately-labelled Records to denote these domain-specific data types.

All of these conventions are optional. They form a layer atop the core Value structure. Non-domain-specific tools do not in general need to treat them specially.

Validity. Many of the labels we will describe in this section come with side-conditions on the contents of labelled Records. It is possible to construct an instance of Value that violates these side-conditions without ceasing to be a Value or becoming unrepresentable. However, we say that such a Value is invalid because it fails to honour the necessary side-conditions. Implementations SHOULD allow two modes of working: one which treats all Values identically, without regard for side-conditions, and one which enforces validity (i.e. side-conditions) when reading, writing, or constructing Values.

MIME-type tagged binary data

Many internet protocols use media types (a.k.a MIME types) to indicate the format of some associated binary data. For this purpose, we define MIMEData to be a record labelled mime with two fields, the first being a Symbol, the media type, and the second being a ByteString, the binary data.

While each media type may define its own rules for comparing documents, we define ordering among MIMEData representations of such media types lexicographically over the (Symbol, ByteString) pair.

Examples.

| (mime application/octet-stream #"abcde") | 33 B4 6D 69 6D 65 BF 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D A5 61 62 63 64 65 | | (mime text/plain "ABC") | 33 B4 6D 69 6D 65 BA 74 65 78 74 2F 70 6C 61 69 6E 93 41 42 43 | | (mime application/xml "<xhtml/>") | 33 B4 6D 69 6D 65 BF 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 98 3C 78 68 74 6D 6C 2F 3E | | (mime text/csv "123,234,345") | 33 B4 6D 69 6D 65 B8 74 65 78 74 2F 63 73 76 9B 31 32 33 2C 32 33 34 2C 33 34 35 |

Applications making heavy use of mime records may choose to use a short form label number for the record type. For example, if short form label number 1 were chosen, the second example above, (mime text/plain "ABC"), would be encoded with "12" in place of "33 B4 6D 69 6D 65".

Text

Normalization forms

In order for users to unambiguously signal or require a particular normalization form, we define a NormalizedString, which is a Record labelled with unicode-normalization and having two fields, the first of which is a Symbol specifying the normalization form used (e.g. nfc, nfd, nfkc, nfkd), and the second of which is a String whose underlying code point representation MUST be normalized according to the named normalization form.

IRIs (URIs, URLs, URNs, etc.)

An IRI is a Record labelled with iri and having one field, a String which is the IRI itself and which MUST be a valid absolute or relative IRI.

Machine words

The definition of SignedInteger captures all integers. However, in certain circumstances it can be valuable to assert that a number inhabits a particular range, such as a fixed-width machine word.

A family of labels in and un for n ∈ {16,32,64} denote n-bit-wide signed and unsigned range restrictions, respectively. Records with these labels MUST have one field, a SignedInteger, which MUST fall within the appropriate range. That is, to be valid,

  • in (i16 x), -32768 <= x <= 32767.
  • in (u16 x), 0 <= x <= 65535.
  • in (i32 x), -2147483648 <= x <= 2147483647.
  • etc.

Anonymous Tuples and Unit

A Tuple is a Record with label tuple and zero or more fields, denoting an anonymous tuple of values.

The 0-ary tuple, (tuple), denotes the empty tuple, sometimes called "unit" or "void" (but not e.g. JavaScript's "undefined" value).

Null and Undefined

Tony Hoare's "billion-dollar mistake" can be represented with the 0-ary Record (null). An "undefined" value can be represented as (undefined).

Dates and Times

Dates, times, moments, and timestamps can be represented with a Record with label rfc3339 having a single field, a String, which MUST conform to one of the full-date, partial-time, full-time, or date-time productions of section 5.6 of RFC 3339.

Representing Values in Programming Languages

We have given a definition of Value and its semantics, and proposed a concrete syntax for communicating and storing Values. We now turn to suggested representations of Values as programming-language values for various programming languages.

When designing a language mapping, an important consideration is roundtripping: serialization after deserialization, and vice versa, should both be identities.

JavaScript

  • SignedInteger ↔ numbers or BigInt [1, 2]
  • String ↔ strings
  • ByteStringUint8Array
  • SymbolSymbol.for(...)
  • BooleanBoolean
  • Float and Double ↔ numbers,
  • Record{ "_label": theLabel, "_fields": [field0, ..., fieldN] }, plus convenience accessors
    • (undefined) ↔ the undefined value
    • (rfc3339 F)Date, if F matches the date-time RFC 3339 production
  • SequenceArray
  • Set{ "_set": M } where M is a Map from the elements of the set to true
  • Dictionary ↔ a Map

Scheme/Racket

  • SignedInteger ↔ exact numbers
  • String ↔ strings
  • ByteString ↔ byte vector (Racket: "Bytes")
  • Symbol ↔ symbols
  • Boolean ↔ booleans
  • Float and Double ↔ inexact numbers (Racket: single- and double-precision floats)
  • Record ↔ structures (Racket: prefab struct)
  • Sequence ↔ lists
  • Set ↔ Racket: sets
  • Dictionary ↔ Racket: hash-table

Java

  • SignedIntegerInteger, Long, BigInteger
  • StringString
  • ByteStringbyte[]
  • Symbol ↔ a simple data class wrapping a String
  • BooleanBoolean
  • Float and DoubleFloat and Double
  • Record ↔ in a simple implementation, a generic Record class; else perhaps a bean mapping?
  • Sequence ↔ an implementation of java.util.List
  • Set ↔ an implementation of java.util.Set
  • Dictionary ↔ an implementation of java.util.Map

Erlang

  • SignedInteger ↔ integers
  • String ↔ tuple of utf8 and a binary
  • ByteString ↔ a binary
  • Symbol ↔ the underlying string converted to an Erlang atom, if some kind of an "unsafe" mode is set on the decoder (because Erlang atoms are not GC'd); otherwise perhaps a tuple of symbol and a binary of the utf-8
  • Booleantrue and false
  • Float and Double ↔ floats (unsure how Erlang deals with single-precision)
  • Record ↔ a tuple with the label in the first position, and the fields in subsequent positions
  • Sequence ↔ a list
  • Set ↔ a sets set (is this unambiguous? Maybe a map from elements to true?)
  • Dictionary ↔ a map (new in Erlang/OTP R17)

Appendix. Table of lead byte values

 0x - short form Record label index 0
 1x - short form Record label index 1
 2x - short form Record label index 2
 3x - Record
 4x - Sequence
 5x - Set
 6x - Dictionary
(7x)  RESERVED
 8x - SignedInteger
 9x - String
 Ax - Bytes
 Bx - Symbol
 C0 - False
 C1 - True
 C2 - Float
 C3 - Double
(Cx)  RESERVED C4-CF
(Dx)  RESERVED
(Ex)  RESERVED
(Fx)  RESERVED

Appendix. Why not Just Use JSON?

JSON offers syntax for numbers, strings, booleans, null, arrays and string-keyed maps. However, it suffers from two major problems. First, it offers no semantics for the syntax: it is left to each implementation to determine how to treat each JSON term. This causes interoperability and even security issues. Second, JSON's lack of support for type tags leads to awkward and incompatible encodings of type information in terms of the fixed suite of constructors on offer.

There are other minor problems with JSON having to do with its syntax. Examples include its relative verbosity and its lack of support for binary data.

JSON syntax doesn't mean anything

When are two JSON values the same? When are they different?

The specifications are largely silent on these questions. Different JSON implementations give different answers.

Specifically, JSON does not:

  • assign any meaning to numbers,11
  • determine how strings are to be compared,12
  • determine whether object key ordering is significant,13 or
  • determine whether duplicate object keys are permitted, what it would mean if they were, or how to determine a duplicate in the first place.14

In short, JSON syntax doesn't denote anything.15 16

Some examples:

  • are the JSON values 1, 1.0, and 1e0 the same or different?
  • are the JSON values 1.0 and 1.0000000000000001 the same or different?
  • are the JSON strings "päron" (UTF-8 70c3a4726f6e) and "päron" (UTF-8 7061cc88726f6e) the same or different?
  • are the JSON objects {"a":1, "b":2} and {"b":2, "a":1} the same or different?
  • which, if any, of {"a":1, "a":2}, {"a":1} and {"a":2} are the same? Are all three legal?
  • are {"päron":1} and {"päron":1} the same or different?

JSON can multiply nicely, but it can't add very well

JSON includes a fixed set of types: numbers, strings, booleans, null, arrays and string-keyed maps. Domain-specific data must be encoded into these types. For example, dates and email addresses are often represented as strings with an implicit internal structure.

There is no convention for labelling a value as belonging to a particular category. This makes it difficult to extract, say, all email addresses, or all URLs, from an arbitrary JSON document.

Instead, JSON-encoded data are often labelled in an ad-hoc way. Multiple incompatible approaches exist. For example, a "money" structure containing a currency field and an amount may be represented in any number of ways:

{ "_type": "money", "currency": "EUR", "amount": 10 }
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
[ "money", { "currency": "EUR", "amount": 10 } ]
{ "@money": { "currency": "EUR", "amount": 10 } }

This causes particular problems when JSON is used to represent sum or union types, such as "either a value or an error, but not both". Again, multiple incompatible approaches exist.

For example, imagine an API for depositing money in an account. The response might be either a "success" response indicating the new balance, or one of a set of possible errors.

Sometimes, a pair of values is used, with null marking the option not taken.17

{ "ok": { "balance": 210 }, "error": null }
{ "ok": null, "error": "Unauthorized" }

The branch not chosen is sometimes present, sometimes omitted as if it were an optional field:

{ "ok": { "balance": 210 } }
{ "error": "Unauthorized" }

Sometimes, an array of a label and a value is used:

[ "ok", { "balance": 210 } ]
[ "error", "Unauthorized" ]

Sometimes, the shape of the data is sufficient to distinguish among the alternatives, and the label is left implicit:

{ "balance": 210 }
"Unauthorized"

JSON itself does not offer any guidance for which of these options to choose. In many real cases on the web, poor choices have led to encodings that are irrecoverably ambiguous.



Open questions

Q. Should "symbols" instead be URIs? Relative, usually; relative to what? Some domain-specific base URI?

Q. What about general rationals, subsuming integers and IEEE floats (except NaN and the Infinities)?

Q. Should I map to SPKI SEXP or is that nonsense / for later?18

Q. Should MIMEData be a special syntax for Records with a single ByteString field?

A. Not even. It should probably just be moved to the "conventions" section. Compare:

D5 BA text/plain    hello   -- using special MIMEData encoding
32 BA text/plain A5 hello   -- using bog standard type-labelled Record

Q. Should Symbol be a special syntax for a Record with a Symbol label (recursive!?) and a single String field?

Q. Should String be a special syntax for (utf8 Bytes)? Again, recursiveness problems...?

Q. Should Dictionary be a special syntax for etc etc.? Set? Float? Double?

--> Rule of thumb: if there's a special equivalence predicate for it, it needs to be built-in syntax. Otherwise it can be a regular record. (So: Boolean might not make the cut for special treatment?? Likewise String...? Ugh those are psychologically important perhaps)

Q. Are the language mappings reasonable? How about one for Python?


Streaming: needed for variable-sized structures. Tricky to design syntax for this that isn't gratuitously warty. End byte value.

SIGH. Streaming for text/bytes too I SUPPOSE. Chunks, like CBOR

Literal small integers: could be nice? Not absolutely necessary.

Maybe reorder: fixed-length atoms first, then variable-length atoms, then fixed-length compounds, then variable-length compounds? Reason being that then maybe can put the streaming forms of the variable-length ones very last.



  1. This design was loosely inspired by S-expressions, as seen in Lisp, Scheme, SPKI/SDSI, and many others, and by the ML type system, as seen in languages such as SML, OCaml, Haskell, Rust, and many others. It is also related to Zephyr ASDL (h/t Darius Bacon), which doesn't offer much in the way of atoms, but offers general-purpose labelled sums and products. See D. C. Wang, A. W. Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax Description Language,” in USENIX Conference on Domain-Specific Languages, 1997, pp. 213228. PDF available. ↩︎

  2. The observant reader may note that the ordering here is the same as that implied by the tagging scheme used in the concrete binary syntax for Values. ↩︎

  3. Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! ↩︎

  4. The Racket programming language defines “prefab” structure types, which map well to our Records. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. ↩︎

  5. It is occasionally (but seldom) necessary to interpret such Symbol labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRI urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself - for its own Value. ↩︎

  6. The two XML documents <x/> and <x /> differ by bytewise comparison, and thus yield different record values, even though under the semantics of XML they denote identical XML infoset. ↩︎

  7. Those who remember ASN.1 will recall BER, DER, PER, CER, XER and so on, each appropriate to a different setting. Similarly, Rivest's S-Expression design offers a human-friendly syntax, a syntax robust to network-induced message corruption, and an unambiguous, simple and easily-parsed machine-friendly syntax for the same underlying values. ↩︎

  8. Some encodings are unused. All such encodings are reserved for future versions of this specification. ↩︎

  9. In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of Values is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serialized Values, because (a) where canonicalization is used for cryptographic signatures, it is more reliable to simply retain the exact binary form of the signed document than to depend on canonical de- and re-serialization, and (b) sorting keys or elements makes no sense in streaming serialization formats. ↩︎

  10. It happens to line up with Racket's representation of a record label for an inheritance hierarchy where titled extends person extends thing:

    (struct date (year month day) #:prefab)
    (struct thing (id) #:prefab)
    (struct person thing (name date-of-birth) #:prefab)
    (struct titled person (title) #:prefab)
    
    ↩︎
  11. Section 6 of RFC 7159 does go so far as to indicate “good interoperability can be achieved” by imagining that parsers are able reliably to understand the syntax of numbers as denoting an IEEE 754 double-precision floating-point value. ↩︎

  12. Section 8.3 of RFC 7159 suggests that if an implementation compares strings used as object keys “code unit by code unit”, then it will interoperate with other such implementations, but neither requires this behaviour nor discusses comparisons of strings used in other contexts. ↩︎

  13. Section 4 of RFC 7159 remarks that “[implementations] differ as to whether or not they make the ordering of object members visible to calling software.” ↩︎

  14. Section 4 of RFC 7159 is the only place in the specification that mentions the issue. It explicitly sanctions implementations supporting duplicate keys, noting only that “when the names within an object are not unique, the behavior of software that receives such an object is unpredictable.” Implementations are free to choose any behaviour at all in this situation, including signalling an error, or discarding all but one of a set of duplicates. ↩︎

  15. The XML world has the concept of XML infoset. Loosely speaking, XML infoset is the denotation of an XML document; the meaning of the document. ↩︎

  16. Most other recent data languages are like JSON in specifying only a syntax with no associated semantics. While some do make a sketch of a semantics, the result is often underspecified (e.g. in terms of how strings are to be compared), overly machine-oriented (e.g. treating 32-bit integers as fundamentally distinct from 64-bit integers and from floating-point numbers), overly fine (e.g. giving visibility to the order in which map entries are written), or all three. ↩︎

  17. What is the meaning of a document where both ok and error are non-null? What might happen when a program is presented with such a document? ↩︎

  18. Why not just use Rivest's S-Expressions as they are? While they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers per se as well as any kind of unordered structure such as sets or maps. In addition, while "display hints" allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the "hint" itself can only be a binary blob. ↩︎