Preserves: an Expressive Data Language

--- --- Preserves: an Expressive Data Language # Preserves: an Expressive Data Language Tony Garnock-Jones September 2018. Version 0.0.3. [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [spki]: http://world.std.com/~cme/html/spki.html [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map [abnf]: https://tools.ietf.org/html/rfc7405 This document proposes a data model and serialization format called *Preserves*. Preserves supports *records* with user-defined *labels*. This makes it more expressive[^macro-expressiveness] than most data languages in use on the web and allows it to easily represent the *labelled sums of products* as seen in many functional programming languages. Preserves also supports the usual suite of atomic and compound data types, in particular including *binary* data as a distinct type from text strings. Finally, Preserves defines precisely how to *compare* two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language. [^macro-expressiveness]: By "expressive" I mean *macro-expressive* in the sense of Felleisen's 1991 paper, "On the Expressive Power of Programming Languages". Roughly speaking, there's no way in a JSON document to introduce a new kind of information (such as binary data, or a date-stamp, or a "person" object) in an *unambiguous way* without *global agreement* from every potential consumer of the document. With an extensible labelled record type, there is. Felleisen, Matthias. “On the Expressive Power of Programming Languages.” Science of Computer Programming 17, no. 1--3 (1991): 35–75. ## Starting with Semantics Taking inspiration from functional programming, we start with a definition of the *values* that we want to work with and give them meaning independent of their syntax. When we write examples of values, we will do so using the [textual syntax](#textual-syntax) defined later in this document. Our `Value`s fall into two broad categories: *atomic* and *compound* data. Value = Atom | Compound Atom = Boolean | Float | Double | SignedInteger | String | ByteString | Symbol Compound = Record | Sequence | Set | Dictionary **Total order.** As we go, we will incrementally specify a total order over `Value`s. Two values of the same kind are compared using kind-specific rules. The ordering among values of different kinds is essentially arbitrary, but having a total order is convenient for many tasks, so we define it as follows:[^ordering-by-syntax] (Values) Atom < Compound (Compounds) Record < Sequence < Set < Dictionary (Atoms) Boolean < Float < Double < SignedInteger < String < ByteString < Symbol [^ordering-by-syntax]: The observant reader may note that the ordering here is (almost) the same as that implied by the tagging scheme used in the concrete binary syntax for `Value`s. (The exception is the syntax for small integers near zero.) **Equivalence.** Two `Value`s are equal if neither is less than the other according to the total order. ### Signed integers. A `SignedInteger` is a signed integer of arbitrary width. `SignedInteger`s are compared as mathematical integers. **Examples.** 10; -6; 0. **Non-examples.** NaN (the clue is in the name!); ∞ (not finite); 0.2 (not an integer); 1/7 (likewise); 2+*i*3 (likewise); √2 (likewise). ### Unicode strings. A `String` is a sequence of Unicode [code-point](http://www.unicode.org/glossary/#code_point)s. `String`s are compared lexicographically, code-point by code-point.[^utf8-is-awesome] [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! **Examples.** `"Hello world"`, an eleven-code-point string; `"z水𝄞"`, the string containing the three Unicode code-points `z` (0x7A), `水` (0x6C34) and `𝄞` (0x1D11E); `""`, the empty string. ### Binary data. A `ByteString` is an ordered sequence of zero or more eight-bit bytes. `ByteString`s are compared lexicographically. **Examples.** `#""`, the empty `ByteString`; `#"ABC"`, the `ByteString` containing the integers 65, 66 and 67 (corresponding to ASCII characters `A`, `B` and `C`). **N.B.** Despite appearances, these are *binary* data. ### Symbols. Programming languages like Lisp and Prolog frequently use string-like values called *symbols*. Here, a `Symbol` is, like a `String`, a sequence of Unicode code-points representing an identifier of some kind. `Symbol`s are also compared lexicographically by code-point. **Examples.** `hello-world`; `utf8-string`; `exact-integer?`. ### Booleans. There are exactly two `Boolean` values, “false” and “true”. The “false” value compares less-than the “true” value. We write `#false` for “false”, and `#true` for “true”. ### IEEE floating-point values. A `Float` is a single-precision IEEE 754 floating-point value; a `Double` is a double-precision IEEE 754 floating-point value. `Float`s, `Double`s and `SignedInteger`s are considered disjoint, and so by the rules [above](#total-order), every `Float` is less than every `Double`, and every `SignedInteger` is greater than both. Two `Float`s or two `Double`s are to be ordered by the `totalOrder` predicate defined in section 5.10 of [IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935). We write examples using a fractional part and/or an exponent to distinguish them from `SignedInteger`s. An additional suffix `f` distinguishes `Float`s from `Double`s. **Examples.** 10.0f; -6.0; 0.0f; 0.5; -1.202e300. **Non-examples.** 10, -6, and 0, because writing them this way indicates `SignedInteger`s, not `Float`s or `Double`s. ### Records. A `Record` is a *labelled* tuple of zero or more `Value`s, called the record's *fields*. A record's label is itself a `Value`, though it will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s are compared lexicographically as if they were just tuples; that is, first by their labels, and then by the remainder of their fields. [^extensibility]: The [Racket](https://racket-lang.org/) programming language defines “[prefab](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))” structure types, which map well to our `Record`s. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. [^iri-labels]: It is occasionally (but seldom) necessary to interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRI `urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its own `Value`. **Examples.** `foo(1 2 3)`, a `Record` with label `foo` and fields 1, 2 and 3; `void()`, a `Record` with label `void` and no fields. **Non-examples.** `()`, because it lacks a label; `void`, because it lacks even an empty tuple of fields. ### Sequences. A `Sequence` is a general-purpose, variable-length ordered sequence of zero or more `Value`s. `Sequence`s are compared lexicographically. **Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of `SignedInteger`s 1, 2 and 3. ### Sets. A `Set` is an unordered finite set of `Value`s. It contains no duplicate values, following the [equivalence relation](#equivalence) induced by the total order on `Value`s. Two `Set`s are compared by sorting their elements ascending using the [total order](#total-order) and comparing the resulting `Sequence`s. **Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set containing only the empty set; `{4 "hello" (void) 9.0f}`, the set containing 4, the string `"hello"`, the record with label `void` and no fields, and the `Float` denoting the number 9.0; `{1 1.0f}`, the set containing a `SignedInteger` and a `Float`; `{mime(application/xml #"") mime(application/xml #"")}`, a set containing two different `mime` records.[^mime-xml-difference] [^mime-xml-difference]: The two XML documents `` and `` differ by bytewise comparison, and thus yield different record values, even though under the semantics of XML they denote identical XML infoset. **Non-examples.** `{1 1}`, because it contains multiple equivalent `Value`s; `{}`, because without the `#set` marker, it denotes the empty dictionary. ### Dictionaries. A `Dictionary` is an unordered finite collection of pairs of `Value`s. Each pair comprises a *key* and a *value*. Keys in a `Dictionary` must be pairwise distinct. Instances of `Dictionary` are compared by lexicographic comparison of the sequences resulting from ordering each `Dictionary`'s pairs in ascending order by key. **Examples.** `{}`, the empty dictionary; `{a: 1}`, the dictionary mapping the `Symbol` `a` to the `SignedInteger` 1; `{[1 2 3]: a}`, mapping `[1 2 3]` to `a`; `{"hi": 0, hi: 0, there: []}`, having a `String` and two `Symbol` keys, and `SignedInteger` and `Sequence` values. **Non-examples.** `{a:1 b:2 a:3}`, because it contains duplicate keys; `{[7 8]:[] [7 8]:99}`, for the same reason. ## Textual Syntax Now we have discussed `Value`s and their meanings, we may turn to techniques for *representing* `Value`s for communication or storage. In this section, we use [case-sensitive ABNF][abnf] to define a textual syntax that is easy for people to read and write.[^json-superset] Most of the examples in this document are written using this syntax. In the following section, we will define an equivalent compact machine-readable syntax. [^json-superset]: The grammar of the textual syntax is a superset of JSON, with the slightly unusual feature that `true`, `false`, and `null` are all read as `Symbol`s, and that `SignedInteger`s are never read as `Double`s. ### Character set [ABNF][abnf] allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points. Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where possible. ### Whitespace Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, comments, or commas. A comment is a semicolon followed by the unicode code points up to and including the next carriage return or line feed. ws = *(%x20 / %x09 / newline / comment / ",") newline = CR / LF comment = ";" *(WSP / nonnl) newline nonnl = ### Grammar Standalone documents containing textual representations of `Value`s may have trailing whitespace. Document = Value ws Any `Value` may be preceded by whitespace. Value = ws (Record / Collection / Atom / Compact) Collection = Sequence / Dictionary / Set Atom = Boolean / Float / Double / SignedInteger / String / ByteString / Symbol Each `Record` is its label-`Value` followed by a parenthesised grouping of its field-`Value`s. Whitespace is not permitted between the label and the open-parenthesis. Record = Value "(" *Value ws ")" `Sequence`s are enclosed in square brackets. `Dictionary` values are curly-brace-enclosed colon-separated pairs of values. `Set`s are written either as a simple curly-brace-enclosed non-empty sequence of values, or as a possibly-empty sequence of values enclosed by the tokens `#set{` and `}`.[^printing-collections] Sequence = "[" *Value ws "]" Dictionary = "{" *(Value ws ":" Value) ws "}" Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}" [^printing-collections]: **Implementation note.** When implementing printing of `Value`s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset of `Value` that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. Any `Value` may be represented using the [compact binary syntax](#compact-binary-syntax) by directly prefixing the binary form of the `Value` with ASCII `SOH` (`%x01`), or by enclosing a hexadecimal representation of the binary form of the `Value` in the tokens `#hexvalue{` and `}`. Compact = %x01 / %s"#hexvalue{" *(ws / HEXDIG) ws "}" `Boolean`s are the simple literal strings `#true` and `#false`. Boolean = %s"#true" / %s"#false" Numeric data follow the [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with the addition of a trailing "f" distinguishing `Float` from `Double` values. `Float`s and `Double`s always have either a fractional part or an exponent part, where `SignedInteger`s never have either.[^reading-and-writing-floats-accurately] [^arbitrary-precision-signedinteger] Float = flt %i"f" Double = flt SignedInteger = int digit1-9 = %x31-39 nat = %x30 / ( digit1-9 *DIGIT ) int = ["-"] nat frac = "." 1*DIGIT exp = %i"e" ["-"/"+"] 1*DIGIT flt = int (frac exp / frac / exp) [^reading-and-writing-floats-accurately]: **Implementation note.** Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer: Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. . Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. . Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. . [^arbitrary-precision-signedinteger]: **Implementation note.** Be aware when implementing reading and writing of `SignedInteger`s that the data model *requires* arbitrary-precision integers. Your I/O routines must not truncate precision either when reading or writing a `SignedInteger`. `String`s are, [as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly escaped text surrounded by double quotes. The escaping rules are the same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs] String = %x22 *char %x22 char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG) unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF escape = %x5C ; \ escaped = ( %x5C / ; \ reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 ) ; t tab U+0009 [^string-json-correspondence]: The grammar for `String` has the same effect as the [JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for `string`. Some auxiliary definitions (e.g. `escaped`) are lifted largely unmodified from the text of RFC 8259. [^escaping-surrogate-pairs]: In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid escaping such characters when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle them correctly. A `ByteString` may be written in any of three different forms. The first is similar to a `String`, but prepended with a hash sign `#`. In addition, only Unicode code points overlapping with printable 7-bit ASCII are permitted unescaped inside such a `ByteString`; other byte values must be escaped by prepending a two-digit hexadecimal value with `\x`. ByteString = "#" %x22 *binchar %x22 binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG) binunescaped = %x20-21 / %x23-5B / %x5D-7E The second is as a sequence of pairs of hexadecimal digits interleaved with whitespace and surrounded by `#hex{` and `}`. ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}" The third is as a sequence of [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved with whitespace and surrounded by `#base64{` and `}`. Plain and URL-safe Base64 characters are allowed. ByteString =/ %s"#base64{" *(ws / base64char) ws "}" / base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" A `Symbol` may be written in a "bare" form[^cf-sexp-token] so long as it conforms to certain restrictions on the characters appearing in the symbol. Alternatively, it may be written in a quoted form. The quoted form is much the same as the syntax for `String`s, including embedded escape syntax, except using a bar or pipe character (`|`) instead of a double quote mark. Symbol = symstart *symcont / "|" *symchar "|" symstart = ALPHA / sympunct / symunicode symcont = ALPHA / sympunct / symunicode / DIGIT / "-" / "." sympunct = "~" / "!" / "@" / "$" / "%" / "^" / "&" / "*" / "?" / "_" / "=" / "+" / "<" / ">" / "/" symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG) symunicode = [^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt] definition of "token representation", and with the [R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4). ## Compact Binary Syntax A `Repr` is an encoding, or representation, of a specific `Value`. Each `Repr` comprises one or more bytes describing first the kind of represented `Value` and the length of the representation, and then the encoded details of the `Value` itself. For a value `v`, we write `[[v]]` for the `Repr` of v. ### Type and Length representation Each `Repr` takes one of three possible forms: - (A) a fixed-length form, used for simple values such as `Boolean`s or `Float`s. - (B) a variable-length form with length specified up-front, used for almost all `Record`s as well as for most `Sequence`s and `String`s, when their sizes are known at the time serialization begins. - (C) a variable-length streaming form with unknown or unpredictable length, used only seldom for `Record`s, since the number of fields in a `Record` is usually statically known, but sometimes used for `Sequence`s, `String`s etc., such as in cases when serialization begins before the number of elements or bytes in the corresponding `Value` is known. Applications may choose between formats B and C depending on their needs at serialization time. #### The lead byte Every `Repr` starts with a *lead byte*, constructed by `leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16: leadbyte(t,n,m) = [t*64 + n*16 + m] The arguments `t` and `n` describe the rest of the representation:[^some-encodings-unused] [^some-encodings-unused]: Some encodings are unused. All such encodings are reserved for future versions of this specification. - `t`=0, `n`=0 (format A) represents an `Atom` with fixed-length binary representation. - `t`=0, `n`=1 (format A) represents certain small `SignedInteger`s. - `t`=0, `n`=2 (format C) is a Stream Start byte. - `t`=0, `n`=3 (format C) is a Stream End byte. - `t`=1 (format B) represents an `Atom` with variable-length binary representation. - `t`=2 (format B) represents a `Record`. - `t`=3 (format B) represents a `Sequence`, `Set` or `Dictionary`. #### Encoding data of fixed length (format A) Each specific type of data defines its own rules for this format. #### Encoding data of known length (format B) A `Repr` where the length of the `Value` to be encoded is variable but known uses the value of `m` in `leadbyte` to encode its length. The length counts *bytes* for atomic `Value`s, but counts *contained values* for compound `Value`s. - A length `l` between 0 and 14 is represented using `leadbyte` with `m=l`. - A length of 15 or greater is represented by `m=15` and additional bytes describing the length following the lead byte. The function `header(t,n,m)` yields an appropriate sequence of bytes describing a `Repr`'s type and length when `t`, `n` and `m` are appropriate non-negative integers: header(t,n,m) = leadbyte(t,n,m) when m < 15 or leadbyte(t,n,15) ++ varint(m) otherwise The additional length bytes are formatted as [base 128 varints][varint]. We write `varint(m)` for the varint-encoding of `m`. Quoting the [Google Protocol Buffers][varint] definition, > Each byte in a varint, except the last byte, has the most > significant bit (msb) set – this indicates that there are further > bytes to come. The lower 7 bits of each byte are used to store the > two's complement representation of the number in groups of 7 bits, > least significant group first. **Examples.** - The varint representation of 15 is just the byte 15. - 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2. - 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3. #### Streaming data of unknown length (format C) A `Repr` where the length of the `Value` to be encoded is variable and not known at the time serialization of the `Value` starts is encoded by a single Stream Start (“open”) byte, followed by zero or more *chunks*, followed by a matching Stream End (“close”) byte: open(t,n) = leadbyte(0,2, t*4 + n) close(t,n) = leadbyte(0,3, t*4 + n) For a `Repr` of a `Value` containing binary data, each chunk is to be a format B `Repr` of a `ByteString`, no matter the type of the overall `Repr`. For a `Repr` of a `Value` containing other `Value`s, each chunk is to be a single `Repr`. ### Records Format B (known length): [[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] For `m` fields, `m+1` is supplied to `header`, to account for the encoding of the record label. Format C (streaming): [[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3) Applications *SHOULD* prefer the known-length format for encoding `Record`s. #### Application-specific short form for labels Any given protocol using Preserves may additionally define an interpretation for `n`∈{0,1,2}, mapping each *short form label number* `n` to a specific record label. When encoding `m` fields with short form label number `n`, format B becomes header(2,n,m) ++ [[F_1]] ++...++ [[F_m]] and format C becomes open(2,n) ++ [[F_1]] ++...++ [[F_m]] ++ close(2,n) **Examples.** For example, a protocol may choose to map records labelled `void` to `n=0`, making [[void()]] = header(2,0,0) = [0x80] or it may map records labelled `person` to short form label number 1, making [[person("Dr", "Elizabeth", "Blackwell")]] = header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] = [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] for format B, or = open(2,1) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ close(2,1) = [0x29] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] ++ [0x39] for format C. ### Sequences, Sets and Dictionaries Format B (known length): [[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]] [[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]] [[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++... ++ [[K_m]] ++ [[V_m]] Note that `m*2` is given to `header` for a `Dictionary`, since there are two `Value`s in each key-value pair. Format C (streaming): [[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0) [[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1) [[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++... ++ [[K_m]] ++ [[V_m]] ++ close(3,2) Applications may use whichever format suits their needs on a case-by-case basis. There is *no* ordering requirement on the `X_i` elements or `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any order. [^no-sorting-rationale]: In the BitTorrent encoding format, [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of `Value`s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serialized `Value`s, because (a) where canonicalization is used for cryptographic signatures, it is more reliable to simply retain the exact binary form of the signed document than to depend on canonical de- and re-serialization, and (b) sorting keys or elements makes no sense in streaming serialization formats. However, a quality implementation may wish to offer the programmer the option of serializing with set elements and dictionary keys in sorted order. Note that `header(3,3,m)` and `open(3,3)`/`close(3,3)` are unused and reserved. ### SignedIntegers Format B/A (known length/fixed-size): [[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x header(0,1,x+16) if -3≤x<0 header(0,1,x) if 0≤x<13 Integers in the range [-3,12] are compactly represented using format A because they are so frequently used. Other integers are represented using format B. Format C *MUST NOT* be used for `SignedInteger`s. The function `intbytes(x)` gives the big-endian two's-complement binary representation of `x`, taking exactly as many whole bytes as needed to unambiguously identify the value and its sign, and `m = |intbytes(x)|`. The most-significant bit in the first byte in `intbytes(x)` is the sign bit.[^zero-intbytes] [^zero-intbytes]: The value 0 needs zero bytes to identify the value, so `intbytes(0)` is the empty byte string. Non-zero values need at least one byte. For example, [[ -257 ]] = 42 FE FF [[ -3 ]] = 1D [[ 128 ]] = 42 00 80 [[ -256 ]] = 42 FF 00 [[ -2 ]] = 1E [[ 255 ]] = 42 00 FF [[ -255 ]] = 42 FF 01 [[ -1 ]] = 1F [[ 256 ]] = 42 01 00 [[ -254 ]] = 42 FF 02 [[ 0 ]] = 10 [[ 32767 ]] = 42 7F FF [[ -129 ]] = 42 FF 7F [[ 1 ]] = 11 [[ 32768 ]] = 43 00 80 00 [[ -128 ]] = 41 80 [[ 12 ]] = 1C [[ 65535 ]] = 43 00 FF FF [[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00 [[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00 ### Strings, ByteStrings and Symbols Syntax for these three types varies only in the value of `n` supplied to `header`, `open`, and `close`. In each case, the payload following the header is a binary sequence; for `String` and `Symbol`, it is a UTF-8 encoding of the `Value`'s code points, while for `ByteString` it is the raw data contained within the `Value` unmodified. Format B (known length): [[ S ]] = header(1,n,m) ++ encode(S) where m = |encode(S)| and (n,encode(S)) = (1,utf8(S)) if S ∈ String (2,S) if S ∈ ByteString (3,utf8(S)) if S ∈ Symbol To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and then a sequence of zero or more format B chunks, followed by `close(1,n)`. Every chunk must be a `ByteString`. While the overall content of a streamed `String` or `Symbol` must be valid UTF-8, individual chunks do not have to conform to UTF-8. ### Fixed-length Atoms Fixed-length atoms all use format A, and do not have a length representation. They repurpose the bits that format B `Repr`s use to specify lengths. Applications *MUST NOT* use format C with `open(0,n)` or `close(0,n)` for any `n`. #### Booleans [[ #false ]] = header(0,0,0) = [0x00] [[ #true ]] = header(0,0,1) = [0x01] #### Floats and Doubles [[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F) [[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D) The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and 8-byte IEEE 754 binary representations of `F` and `D`, respectively. ## Examples ### Simple examples For the following examples, imagine an application that maps `Record` short form label number 0 to label `discard`, 1 to `capture`, and 2 to `observe`. | Value | Encoded hexadecimal byte sequence | |---------------------------------------------------|----------------------------------------------------------------------| | `capture(discard())` | 91 80 | | `observe(speak(discard(), capture(discard())))` | A1 B3 75 73 70 65 61 6B 80 91 80 | | `[1 2 3 4]` (format B) | C4 11 12 13 14 | | `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C | | `[-2 -1 0 1]` | C4 1E 1F 10 11 | | `"hello"` (format B) | 55 68 65 6C 6C 6F | | `"hello"` (format C, 2 chunks) | 25 62 68 65 63 6C 6C 6F 35 | | `"hello"` (format C, 5 chunks) | 25 62 68 65 62 6C 6C 60 60 61 6F 35 | | `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 | | `-257` | 42 FE FF | | `-1` | 1F | | `0` | 10 | | `1` | 11 | | `255` | 42 00 FF | | `1.0f` | 02 3F 80 00 00 | | `1.0` | 03 3F F0 00 00 00 00 00 00 | | `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 | The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record` [titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr") encodes to B5 ;; Record, generic, 4+1 C5 ;; Sequence, 5 76 74 69 74 6C 65 64 ;; Symbol, "titled" 76 70 65 72 73 6F 6E ;; Symbol, "person" 12 ;; SignedInteger, "2" 75 74 68 69 6E 67 ;; Symbol, "thing" 11 ;; SignedInteger, "1" 41 65 ;; SignedInteger, "101" 59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" B4 ;; Record, generic, 3+1 74 64 61 74 65 ;; Symbol, "date" 42 07 1D ;; SignedInteger, "1821" 12 ;; SignedInteger, "2" 13 ;; SignedInteger, "3" 52 44 72 ;; String, "Dr" [^extensibility2]: It happens to line up with Racket's representation of a record label for an inheritance hierarchy where `titled` extends `person` extends `thing`: (struct date (year month day) #:prefab) (struct thing (id) #:prefab) (struct person thing (name date-of-birth) #:prefab) (struct titled person (title) #:prefab) For more detail on Racket's representations of record labels, see [the Racket documentation for `make-prefab-struct`](http://docs.racket-lang.org/reference/structutils.html#%28def._%28%28quote._~23~25kernel%29._make-prefab-struct%29%29). --- ### JSON examples The examples from [RFC 8259](https://tools.ietf.org/html/rfc8259#section-13) read as valid Preserves, though the JSON literals `true`, `false` and `null` read as `Symbol`s. The first example: { "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "http://www.example.com/image/481989943", "Height": 125, "Width": 100 }, "Animated" : false, "IDs": [116, 943, 234, 38793] } } encodes to binary as follows: E2 55 "Image" EC 55 "Width" 42 03 20 55 "Title" 5F 14 "View from 15th Floor" 58 "Animated" 75 "false" 56 "Height" 42 02 58 59 "Thumbnail" E6 55 "Width" 41 64 53 "Url" 5F 26 "http://www.example.com/image/481989943" 56 "Height" 41 7D 53 "IDs" C4 41 74 42 03 AF 42 00 EA 43 00 97 89 and the second example: [ { "precision": "zip", "Latitude": 37.7668, "Longitude": -122.3959, "Address": "", "City": "SAN FRANCISCO", "State": "CA", "Zip": "94107", "Country": "US" }, { "precision": "zip", "Latitude": 37.371991, "Longitude": -122.026020, "Address": "", "City": "SUNNYVALE", "State": "CA", "Zip": "94085", "Country": "US" } ] encodes to binary as follows: C2 EF 10 59 "precision" 53 "zip" 58 "Latitude" 03 40 42 E2 26 80 9D 49 52 59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21 57 "Address" 50 54 "City" 5D "SAN FRANCISCO" 55 "State" 52 "CA" 53 "Zip" 55 "94107" 57 "Country" 52 "US" EF 10 59 "precision" 53 "zip" 58 "Latitude" 03 40 42 AF 9D 66 AD B4 03 59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF 57 "Address" 50 54 "City" 59 "SUNNYVALE" 55 "State" 52 "CA" 53 "Zip" 55 "94085" 57 "Country" 52 "US" ## Conventions for Common Data Types The `Value` data type is essentially an S-Expression, able to represent semi-structured data over `ByteString`, `String`, `SignedInteger` atoms and so on.[^why-not-spki-sexps] [^why-not-spki-sexps]: Rivest's S-Expressions are in many ways similar to Preserves. However, while they include binary data and sequences, and an obvious equivalence for them exists, they lack numbers *per se* as well as any kind of unordered structure such as sets or maps. In addition, while "display hints" allow labelling of binary data with an intended interpretation, they cannot be attached to any other kind of structure, and the "hint" itself can only be a binary blob. However, users need a wide variety of data types for representing domain-specific values such as various kinds of encoded and normalized text, calendrical values, machine words, and so on. Appropriately-labelled `Record`s denote these domain-specific data types.[^why-dictionaries] [^why-dictionaries]: Given `Record`'s existence, it may seem odd that `Dictionary`, `Set`, `Float`, etc. are given special treatment. Preserves aims to offer a useful basic equivalence predicate to programmers, and so if a data type demands a special equivalence predicate, as `Dictionary`, `Set` and `Float` all do, then the type should be included in the base language. Otherwise, it can be represented as a `Record` and treated separately. Both `Boolean` and `String` are seeming exceptions: they merit inclusion because of their cultural importance. All of these conventions are optional. They form a layer atop the core `Value` structure. Non-domain-specific tools do not in general need to treat them specially. **Validity.** Many of the labels we will describe in this section come with side-conditions on the contents of labelled `Record`s. It is possible to construct an instance of `Value` that violates these side-conditions without ceasing to be a `Value` or becoming unrepresentable. However, we say that such a `Value` is *invalid* because it fails to honour the necessary side-conditions. Implementations *SHOULD* allow two modes of working: one which treats all `Value`s identically, without regard for side-conditions, and one which enforces validity (i.e. side-conditions) when reading, writing, or constructing `Value`s. ### MIME-type tagged binary data Many internet protocols use [media types](https://tools.ietf.org/html/rfc6838) (a.k.a MIME types) to indicate the format of some associated binary data. For this purpose, we define `MIMEData` to be a record labelled `mime` with two fields, the first being a `Symbol`, the media type, and the second being a `ByteString`, the binary data. While each media type may define its own rules for comparing documents, we define ordering among `MIMEData` *representations* of such media types following the general rules for ordering of `Record`s. **Examples.** | Value | Encoded hexadecimal byte sequence | |--------------------------------------------|-------------------------------------------------------------------------------------------------------------------| | `mime(application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 | | `mime(text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 | | `mime(application/xml #"")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E | | `mime(text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 | Applications making heavy use of `mime` records may choose to use a short form label number for the record type. For example, if short form label number 1 were chosen, the second example above, `mime(text/plain "ABC")`, would be encoded with "92" in place of "B3 74 6D 69 6D 65". ### Unicode normalization forms Unicode defines multiple [normalization forms](http://unicode.org/reports/tr15/) for text. While no particular normalization form is required for `String`s, users may need to unambiguously signal or require a particular normalization form. A `NormalizedString` is a `Record` labelled with `unicode-normalization` and having two fields, the first of which is a `Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`, `nfkc`, `nfkd`), and the second of which is a `String` whose underlying code point representation *MUST* be normalized according to the named normalization form. ### IRIs (URIs, URLs, URNs, etc.) An `IRI` is a `Record` labelled with `iri` and having one field, a `String` which is the IRI itself and which *MUST* be a valid absolute or relative IRI. ### Machine words The definition of `SignedInteger` captures all integers. However, in certain circumstances it can be valuable to assert that a number inhabits a particular range, such as a fixed-width machine word. A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote *n*-bit-wide signed and unsigned range restrictions, respectively. Records with these labels *MUST* have one field, a `SignedInteger`, which *MUST* fall within the appropriate range. That is, to be valid, - in `i8(`*x*`)`, -128 <= *x* <= 127. - in `u8(`*x*`)`, 0 <= *x* <= 255. - in `i16(`*x*`)`, -32768 <= *x* <= 32767. - etc. ### Anonymous Tuples and Unit A `Tuple` is a `Record` with label `tuple` and zero or more fields, denoting an anonymous tuple of values. The 0-ary tuple, `tuple()`, denotes the empty tuple, sometimes called "unit" or "void" (but *not* e.g. JavaScript's "undefined" value). ### Null and Undefined Tony Hoare's "[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)" can be represented with the 0-ary `Record` `null()`. An "undefined" value can be represented as `undefined()`. ### Dates and Times Dates, times, moments, and timestamps can be represented with a `Record` with label `rfc3339` having a single field, a `String`, which *MUST* conform to one of the `full-date`, `partial-time`, `full-time`, or `date-time` productions of [section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6). ## Security Considerations **Empty chunks.** Streamed (format C) `String`s, `ByteString`s and `Symbol`s may include chunks of zero length. This opens up a possibility for denial-of-service: an attacker may begin streaming a string, sending an endless sequence of zero length chunks, appearing to make progress but not actually doing so. Implementations may place optional reasonable restrictions on the number of consecutive empty chunks that may appear in a stream, and may even supply an optional mode that rejects empty chunks entirely. **Whitespace.** Similarly, the textual format for `Value`s allows arbitrary whitespace in many positions. In streaming transfer situations, consider optional restrictions on the amount of consecutive whitespace and comments that may appear in a serialized `Value`. **Canonical form for cryptographic hashing and signing.** As specified, neither the textual nor the compact binary encoding rules for `Value`s force canonical serializations. Two serializations of the same `Value` may yield different binary `Repr`s. ## Appendix. Table of lead byte values 00 - False 01 - True 02 - Float 03 - Double (0x) RESERVED 04-0F 1x - Small integers 0..12,-3..-1 2x - Start Stream 3x - End Stream 4x - SignedInteger 5x - String 6x - ByteString 7x - Symbol 8x - short form Record label index 0 9x - short form Record label index 1 Ax - short form Record label index 2 Bx - Record Cx - Sequence Dx - Set Ex - Dictionary (Fx) RESERVED F0-FF ## Appendix. Bit fields within lead byte values tt nn mmmm contents ---------- --------- 00 00 0000 False 00 00 0001 True 00 00 0010 Float, 32 bits big-endian binary 00 00 0011 Double, 64 bits big-endian binary 00 01 xxxx Small integers 0..12,-3..-1 00 10 ttnn Start Stream When tt = 00 --> error 01 --> each chunk is a ByteString 1x --> each chunk is a single encoded Value 00 11 ttnn End Stream (must match preceding Start Stream) 01 00 mmmm SignedInteger, big-endian binary 01 01 mmmm String, UTF-8 binary 01 10 mmmm ByteString 01 11 mmmm Symbol, UTF-8 binary 10 00 mmmm application-specific Record 10 01 mmmm application-specific Record 10 10 mmmm application-specific Record 10 11 mmmm Record 11 00 mmmm Sequence 11 01 mmmm Set 11 10 mmmm Dictionary If mmmm = 1111, a varint(m) follows, giving the length, before the body; otherwise, m is the length of the body to follow. ## Appendix. Representing Values in Programming Languages We have given a definition of `Value` and its semantics, and proposed a concrete syntax for communicating and storing `Value`s. We now turn to **suggested** representations of `Value`s as *programming-language values* for various programming languages. When designing a language mapping, an important consideration is roundtripping: serialization after deserialization, and vice versa, should both be identities. ### JavaScript - `Boolean` ↔ `Boolean` - `Float` and `Double` ↔ numbers - `SignedInteger` ↔ numbers or `BigInt` (see [here](https://developers.google.com/web/updates/2018/05/bigint) and [here](https://github.com/tc39/proposal-bigint)) - `String` ↔ strings - `ByteString` ↔ `Uint8Array` - `Symbol` ↔ `Symbol.for(...)` - `Record` ↔ `{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors - `(undefined)` ↔ the undefined value - `(rfc3339 F)` ↔ `Date`, if `F` matches the `date-time` RFC 3339 production - `Sequence` ↔ `Array` - `Set` ↔ `{ "_set": M }` where `M` is a `Map` from the elements of the set to `true` - `Dictionary` ↔ a `Map` ### Scheme/Racket - `Boolean` ↔ booleans - `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats) - `SignedInteger` ↔ exact numbers - `String` ↔ strings - `ByteString` ↔ byte vector (Racket: "Bytes") - `Symbol` ↔ symbols - `Record` ↔ structures (Racket: prefab struct) - `Sequence` ↔ lists - `Set` ↔ Racket: sets - `Dictionary` ↔ Racket: hash-table ### Java - `Boolean` ↔ `Boolean` - `Float` and `Double` ↔ `Float` and `Double` - `SignedInteger` ↔ `Integer`, `Long`, `BigInteger` - `String` ↔ `String` - `ByteString` ↔ `byte[]` - `Symbol` ↔ a simple data class wrapping a `String` - `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping? - `(mime T B)` ↔ an implementation of `javax.activation.DataSource`? - `Sequence` ↔ an implementation of `java.util.List` - `Set` ↔ an implementation of `java.util.Set` - `Dictionary` ↔ an implementation of `java.util.Map` ### Erlang - `Boolean` ↔ `true` and `false` - `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision) - `SignedInteger` ↔ integers - `String` ↔ pair of `utf8` and a binary - `ByteString` ↔ a binary - `Symbol` ↔ pair of `atom` and a binary - `Record` ↔ triple of `obj`, label, and field list - `Sequence` ↔ a list - `Set` ↔ a `sets` set - `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17) This is a somewhat unsatisfactory mapping because: (a) Erlang doesn't garbage-collect its atoms, meaning that (a.1) representing `Symbol`s as atoms could lead to denial-of-service and (a.2) representing `Symbol`-labelled `Record`s as Erlang records must be rejected for the same reason; (b) even if it did, Erlang's boolean values are atoms, which would then clash with the `Symbol`s `true` and `false`; and (c) Erlang has no distinct string type, making for a trilemma where `String`s are in danger of clashing with `ByteString`s, `Sequence`s, or `Record`s. ### Python - `Boolean` ↔ `True` and `False` - `Float` ↔ a `Float` wrapper-class for a double-precision value - `Double` ↔ float - `SignedInteger` ↔ int and long - `String` ↔ `unicode` - `ByteString` ↔ `bytes` - `Symbol` ↔ a simple data class wrapping a `unicode` - `Record` ↔ something like `namedtuple`, but that doesn't care about class identity? - `Sequence` ↔ `tuple` (but accept `list` during encoding) - `Set` ↔ `frozenset` (but accept `set` during encoding) - `Dictionary` ↔ a hashable (immutable) dictionary-like thing (but accept `dict` during encoding) ### Squeak Smalltalk - `Boolean` ↔ `true` and `false` - `Float` ↔ perhaps a subclass of `Float`? - `Double` ↔ `Float` - `SignedInteger` ↔ `Integer` - `String` ↔ `WideString` - `ByteString` ↔ `ByteArray` - `Symbol` ↔ `WideSymbol` - `Record` ↔ a simple data class - `Sequence` ↔ `ArrayedCollection` (usually `OrderedCollection`) - `Set` ↔ `Set` - `Dictionary` ↔ `Dictionary` ## Appendix. Why not Just Use JSON? JSON offers *syntax* for numbers, strings, booleans, null, arrays and string-keyed maps. However, it suffers from two major problems. First, it offers no *semantics* for the syntax: it is left to each implementation to determine how to treat each JSON term. This causes [interoperability](http://seriot.ch/parsing_json.php) and even [security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html) issues. Second, JSON's lack of support for type tags leads to awkward and incompatible *encodings* of type information in terms of the fixed suite of constructors on offer. There are other minor problems with JSON having to do with its syntax. Examples include its relative verbosity and its lack of support for binary data. ### JSON syntax doesn't *mean* anything When are two JSON values the same? When are they different? The specifications are largely silent on these questions. Different JSON implementations give different answers. Specifically, JSON does not: - assign any meaning to numbers,[^meaning-ieee-double] - determine how strings are to be compared,[^string-key-comparison] - determine whether object key ordering is significant,[^json-member-ordering] or - determine whether duplicate object keys are permitted, what it would mean if they were, or how to determine a duplicate in the first place.[^json-key-uniqueness] In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats] [^meaning-ieee-double]: [Section 6 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-6) does go so far as to indicate “good interoperability can be achieved” by imagining that parsers are able reliably to understand the syntax of numbers as denoting an IEEE 754 double-precision floating-point value. [^string-key-comparison]: [Section 8.3 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-8.3) suggests that *if* an implementation compares strings used as object keys “code unit by code unit”, then it will interoperate with *other such implementations*, but neither requires this behaviour nor discusses comparisons of strings used in other contexts. [^json-member-ordering]: [Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4) remarks that “[implementations] differ as to whether or not they make the ordering of object members visible to calling software.” [^json-key-uniqueness]: [Section 4 of RFC 8259](https://tools.ietf.org/html/rfc8259#section-4) is the only place in the specification that mentions the issue. It explicitly sanctions implementations supporting duplicate keys, noting only that “when the names within an object are not unique, the behavior of software that receives such an object is unpredictable.” Implementations are free to choose any behaviour at all in this situation, including signalling an error, or discarding all but one of a set of duplicates. [^xml-infoset]: The XML world has the concept of [XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely speaking, XML infoset is the *denotation* of an XML document; the *meaning* of the document. [^other-formats]: Most other recent data languages are like JSON in specifying only a syntax with no associated semantics. While some do make a sketch of a semantics, the result is often underspecified (e.g. in terms of how strings are to be compared), overly machine-oriented (e.g. treating 32-bit integers as fundamentally distinct from 64-bit integers and from floating-point numbers), overly fine (e.g. giving visibility to the order in which map entries are written), or all three. Some examples: - are the JSON values `1`, `1.0`, and `1e0` the same or different? - are the JSON values `1.0` and `1.0000000000000001` the same or different? - are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"` (UTF-8 `7061cc88726f6e`) the same or different? - are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same or different? - which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the same? Are all three legal? - are `{"päron":1}` and `{"päron":1}` the same or different? ### JSON can multiply nicely, but it can't add very well JSON includes a fixed set of types: numbers, strings, booleans, null, arrays and string-keyed maps. Domain-specific data must be *encoded* into these types. For example, dates and email addresses are often represented as strings with an implicit internal structure. There is no convention for *labelling* a value as belonging to a particular category. This makes it difficult to extract, say, all email addresses, or all URLs, from an arbitrary JSON document. Instead, JSON-encoded data are often labelled in an ad-hoc way. Multiple incompatible approaches exist. For example, a "money" structure containing a `currency` field and an `amount` may be represented in any number of ways: { "_type": "money", "currency": "EUR", "amount": 10 } { "type": "money", "value": { "currency": "EUR", "amount": 10 } } [ "money", { "currency": "EUR", "amount": 10 } ] { "@money": { "currency": "EUR", "amount": 10 } } This causes particular problems when JSON is used to represent *sum* or *union* types, such as "either a value or an error, but not both". Again, multiple incompatible approaches exist. For example, imagine an API for depositing money in an account. The response might be either a "success" response indicating the new balance, or one of a set of possible errors. Sometimes, a *pair* of values is used, with `null` marking the option not taken.[^interesting-failure-mode] { "ok": { "balance": 210 }, "error": null } { "ok": null, "error": "Unauthorized" } [^interesting-failure-mode]: What is the meaning of a document where both `ok` and `error` are non-null? What might happen when a program is presented with such a document? The branch not chosen is sometimes present, sometimes omitted as if it were an optional field: { "ok": { "balance": 210 } } { "error": "Unauthorized" } Sometimes, an array of a label and a value is used: [ "ok", { "balance": 210 } ] [ "error", "Unauthorized" ] Sometimes, the shape of the data is sufficient to distinguish among the alternatives, and the label is left implicit: { "balance": 210 } "Unauthorized" JSON itself does not offer any guidance for which of these options to choose. In many real cases on the web, poor choices have led to encodings that are irrecoverably ambiguous. # Open questions Q. Should "symbols" instead be URIs? Relative, usually; relative to what? Some domain-specific base URI? Q. Literal small integers: are they pulling their weight? They're not absolutely necessary. They mess up the connection between value-ordering and repr-ordering! Q. Should we go for trying to make the data ordering line up with the encoding ordering? We'd have to only use streaming forms, and avoid the small integer encoding, and not store record arities, and sort sets and dictionaries, and mask floats and doubles (perhaps [like this](https://stackoverflow.com/questions/43299299/sorting-floating-point-values-using-their-byte-representation)), and I don't know what to do about SignedIntegers. Perhaps make them more like float formats, with the byte count acting as a kind of exponent underneath the sign bit. - Perhaps define separate additional canonicalization restrictions? Doesn't help the ordering, but does help the equivalence. - Canonicalization and early-bailout-equivalence-checking are in tension with support for streaming values. Q. The postfix fields in the textual syntax come unannounced: "oh, and another thing, what you just read is a label, and here are some fields." This is a problem for interactive reading of textual syntax, because after a complete term, it needs to see the next character to tell whether it is an open-parenthesis or not! For this reason, I've disallowed whitespace between a label `Value` and the open-parenthesis of the fields. Is this reasonable?? ## Notes