preserve.md based on codec.md which I'm about to check in

This commit is contained in:
Tony Garnock-Jones 2018-09-23 14:37:20 +01:00
parent f6ab8320c5
commit c23285781c
1 changed files with 982 additions and 0 deletions

982
syndicate/mc/preserve.md Normal file
View File

@ -0,0 +1,982 @@
---
---
<style>
body { padding-top: 2rem; font-family: palatino, "Palatino Linotype", "Palatino LT STD", "URW Palladio L", serif; max-width: 40em; margin: auto; font-size: 120%; }
h1, h2, h3, h4, h5, h6 { margin-left: -1rem; color: #4f81bd; }
h2 { border-bottom: solid #4f81bd 1px; }
pre, code { background-color: #eee; }
pre { padding: 0.33rem; }
</style>
# Preserves: Semantic Serialization of Node-labelled Data
_________
<_________> Tony Garnock-Jones <tonyg@leastfixedpoint.com>
| FRμIT | September 2018
|Preserves| Version 0.0.2
\_________/
 
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
[spki]: http://world.std.com/~cme/html/spki.html
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
## Introduction
Most data serialization formats used on the web represent
*edge-labelled* semi-structured data.
This document proposes a data model and serialization format that
takes a *node-labelled* approach.
This makes it both extensible and much more like S-expressions, making
it easily able to represent the *labelled sums of products* as seen in
Rust, Haskell, OCaml, and other functional programming languages.
## Starting with Semantics
Taking inspiration from functional programming, we start with a
definition of the *values* that we want to work with and give them
meaning independent of their syntax. We will treat syntax separately,
later in this document.
Value = Atom
| Compound
Atom = SignedInteger
| String
| ByteString
| Symbol
| Boolean
| Float
| Double
| MIMEData
Compound = Record
| Sequence
| Set
| Dictionary
Our `Value`s fall into two broad categories: *atomic* and *compound*
data.[^zephyr-asdl]
[^zephyr-asdl]: This design was loosely inspired by S-expressions,
as seen in Lisp, Scheme, [SPKI/SDSI][sexp.txt], and many others,
and by the ML type system, as seen in languages such as SML,
OCaml, Haskell, Rust, and many others. It is also related to
Zephyr ASDL (h/t
[Darius Bacon](https://twitter.com/abecedarius/status/993545767884226561)),
which doesn't offer much in the way of atoms, but offers
general-purpose labelled sums and products. See D. C. Wang, A. W.
Appel, J. L. Korn, and C. S. Serra, “The Zephyr Abstract Syntax
Description Language,” in USENIX Conference on Domain-Specific
Languages, 1997, pp. 213228.
[PDF available.](https://www.usenix.org/legacy/publications/library/proceedings/dsl97/full_papers/wang/wang.pdf)
**Total order.**<a name="total-order"></a> As we go, we will
incrementally specify a total order over `Value`s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:[^ordering-by-syntax]
(Values) Compound < Atom
(Compounds) Record < Sequence < Set < Dictionary
(Atoms) SignedInteger < String < ByteString < Symbol
< Boolean < Float < Double < MIMEData
[^ordering-by-syntax]: The observant reader may note that the
ordering here is the same as that implied by the tagging scheme
used in the concrete binary syntax for `Value`s.
**Equivalence.**<a name="equivalence"></a> Two `Value`s are equal if
neither is less than the other according to the total order.
<!-- We should avoid unnecessary restrictions such as machine-oriented -->
<!-- fixed-width integer or floating-point values where possible. -->
### Signed integers.
A `SignedInteger` is a signed integer of arbitrary width.
`SignedInteger`s are compared as mathematical integers. We will write
examples of `SignedInteger`s using standard mathematical notation.
**Examples.** 10; -6; 0.
**Non-examples.** NaN (the clue is in the name!); ∞ (not finite); 0.2
(not an integer); 1/7 (likewise); 2+*i*3 (likewise); √2 (likewise).
### Unicode strings.
A `String` is a sequence of Unicode
[code-point](http://www.unicode.org/glossary/#code_point)s. Two
`String`s are compared lexicographically, code-point by
code-point.[^utf8-is-awesome] We will write examples of `String`s text
surrounded by double-quotes “`"`” using a monospace font.
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
gives the same result as a lexicographic byte-by-byte comparison
of the UTF-8 encoding of a string!
**Examples.** `"Hello world"`, an eleven-code-point string; `"z水𝄞"`,
the string containing the three Unicode code-points `z` (0x7A), `水`
(0x6C34) and `𝄞` (0x1D11E); `""`, the empty string.
**Normalization forms.** Unicode defines multiple
[normalization forms](http://unicode.org/reports/tr15/) for text. No
particular normalization form is required for `String`s;
[see below](#normalization-forms).
### Binary data.
A `ByteString` is an ordered sequence of zero or more integers in the
inclusive range [0..255]. `ByteString`s are compared
lexicographically, byte by byte. We will only write examples of
`ByteString`s that contain bytes mapping to printable ASCII
characters, using “`#"`” as an opening quote mark and “`"`” as a
closing quote mark.
**Examples.** The `ByteString` containing the integers 65, 66 and 67
(corresponding to ASCII characters `A`, `B` and `C`) is written as
`#"ABC"`. The empty `ByteString` is written as `#""`. **N.B.** Despite
appearances, these are *binary* data.
### Symbols or identifiers.
Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points, intended to represent an identifier
of some kind. `Symbol`s are also compared lexicographically by
code-point. We will write examples including only non-empty sequences
of non-whitespace characters, using a monospace font without quotation
marks.
**Examples.** `hello-world`; `utf8-string`; `exact-integer?`.
### Booleans.
There are exactly two `Boolean` values, “false” and “true”. The
“false” value compares less-than the “true” value. We write `#f` for
“false”, and `#t` for “true”.
**Examples.** `#f`; `#t`.
### IEEE floating-point values.
A `Float` is a single-precision IEEE 754 floating-point value; a
`Double` is a double-precision IEEE 754 floating-point value.
`Float`s, `Double`s and `SignedInteger`s are considered disjoint, and
so by the rules [above](#total-order), every `Float` is less than
every `Double`, and every `SignedInteger` is less than both. Two
`Float`s or two `Double`s are to be ordered by the `totalOrder`
predicate defined in section 5.10 of
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
We write examples using standard mathematical notation, avoiding NaN
and infinities, using a suffix `f` or `d` to indicate `Float` or
`Double`, respectively.
**Examples.** 10f; -6d; 0f; 0.5d; -1.202e300d.
**Non-examples.** 10, -6, and 0, because writing them this way
indicates `SignedInteger`s, not `Float`s or `Double`s.
### MIME-type tagged binary data.
A `MIMEData` is a pair of a `Symbol` denoting a
[media type](https://tools.ietf.org/html/rfc6838) and a `ByteString`
body, intended to be interpreted as an encoding of a document having
that media type. While each media type may define its own rules for
comparing documents, we define ordering among `MIMEData`
*representations* of such media types lexicographically over the
(`Symbol`, `ByteString`) pair. We write examples using the same syntax
as for byte strings, but with the media type `Symbol` sandwiched
between the “`#`” and the first “`"`”.
**Examples.** `#application/octet-stream""`; `#text/plain"ABC"`;
`#application/xml"<xhtml/>"`; `#text/csv"123,234,345"`.
### Records.
A `Record` is a *labelled* tuple of zero or more `Value`s, called the
record's *fields*. A record's label is, itself, a `Value`, though it
will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s
are compared lexicographically as if they were just tuples; that is,
first by their labels, and then by the remainder of their fields. We
will only write examples of `Record`s having labels that are `Symbol`s
entirely composed of ASCII characters. Such `Record`s will be written
as a parenthesised, space-separated sequence of their label followed
by their fields.
[^extensibility]: The [Racket](https://racket-lang.org/) programming
language defines
[“prefab”](http://docs.racket-lang.org/guide/define-struct.html#(part._prefab-struct))
structure types, which map well to our `Record`s. Racket supports
record extensibility by encoding record supertypes into record
labels as specially-formatted lists.
[^iri-labels]: It is occasionally (but seldom) necessary to
interpret such `Symbol` labels as UTF-8 encoded IRIs. Where a
label can be read as a relative IRI, it is notionally interpreted
with respect to the IRI
`urn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34`; where a label can
be read as an absolute IRI, it stands for that IRI; and otherwise,
it cannot be read as an IRI at all, and so the label simply stands
for itself - for its own `Value`.
**Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is
written `(foo 1 2 3)`; the `Record` with label `void` and no fields is
written `(void)`.
### Sequences.
A `Sequence` is a general-purpose, variable-length ordered sequence of
zero or more `Value`s. `Sequence`s are compared lexicographically,
appealing to the ordering on `Value`s for comparisons at each position
in the `Sequence`s. We write examples space-separated, surrounded with
square brackets.
**Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of
`SignedInteger`s 1, 2 and 3.
### Sets.
A `Set` is an unordered finite set of `Value`s. It contains no
duplicate values, following the [equivalence relation](#equivalence)
induced by the total order on `Value`s. Two `Set`s are compared by
sorting their elements using the [total order](#total-order) and
comparing the resulting sequences as `Sequence`s. We write examples
space-separated, surrounded with curly braces, prefixed by `#set`.
**Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set
containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set
containing 4, the string `"hello"`, the record with label `void` and
no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`,
the set containing a `SignedInteger` and a `Float`, both denoting the
number 1; `#set{#application/xml"<x/>" #application/xml"<x />"}`, a
set containing two different `MIMEData`
values.[^mimedata-xml-difference]
[^mimedata-xml-difference]: The two XML documents `<x/>` and `<x />`
differ by bytewise comparison, and thus yield different `MIMEData`
values, even though under the semantics of XML they denote
identical XML infoset.
**Non-examples.** `#set{1 1 1}`, because it contains multiple
equivalent `Value`s.
### Dictionaries, hash-tables or maps.
A `Dictionary` is an unordered finite collection of zero or more pairs
of `Value`s. Each pair comprises a *key* and a *value*. Keys in a
`Dictionary` must be pairwise distinct. Instances of `Dictionary` are
compared by lexicographic comparison of the sequences resulting from
ordering each `Dictionary`'s pairs in ascending order by key. Examples
are written as a `#dict`-prefixed, curly-brace-surrounded sequence of
space-separated key-value pairs, each written with a colon between the
key and value.
**Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the
dictionary mapping the `Symbol` `a` to the `SignedInteger` 1;
`#dict{1:a}`, mapping 1 to `a`; `#dict{"hi":0 hi:0 there:[]}`, having
a `String` and two `Symbol` keys, and `SignedInteger` and `Sequence`
values.
**Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate
keys; `#dict{[]:[] []:99}`, for the same reason.
## Syntax
Now we have discussed `Value`s and their meanings, we may turn to
techniques for *representing* `Value`s for communication or storage.
The syntax we have used for the examples so far is inadequate in many
ways, not least of which is that it cannot represent every `Value`.
Separation of the meaning of a piece of syntax from the syntax itself
opens the door to domain-specific syntaxes, all equivalent and
interconvertible.[^asn1] With a robust semantic foundation,
connections to other data languages can also be made.
[^asn1]: Those who remember
[ASN.1](https://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx)
will recall BER, DER, PER, CER, XER and so on, each appropriate to
a different setting. Similarly,
[Rivest's S-Expression design][sexp.txt] offers a human-friendly
syntax, a syntax robust to network-induced message corruption, and
an unambiguous, simple and easily-parsed machine-friendly syntax
for the same underlying values.
### Binary syntax
For now, we limit our attention to an easily-parsed, easily-produced
machine-readable syntax.
Every `Value` is represented as one or more bytes describing first its
kind and its length, and then its specific contents.
For a value `v`, we write `[[v]]` for the encoding of v.
The following figure summarises the definitions below:
tt nn mmmm varint(m) contents
-------------------------------
00 00 mmmm ... application-specific Record
00 01 mmmm ... application-specific Record
00 10 mmmm ... application-specific Record
00 11 mmmm ... Record
01 00 mmmm ... Sequence
01 01 mmmm ... Set
01 10 mmmm ... Dictionary
10 00 mmmm ... SignedInteger, big-endian binary
10 01 mmmm ... String, UTF-8 binary
10 10 mmmm ... Bytes
10 11 mmmm ... Symbol, UTF-8 binary
11 00 0000 False
11 00 0001 True
11 00 0010 Float, 32 bits big-endian binary
11 00 0011 Double, 64 bits big-endian binary
11 01 mmmm ... MIME-type-labelled binary data
If mmmm = 1111, varint(m) is present; otherwise, m is the length
#### Type and Length representation
A `Value`'s type and length is represented by use of a function
`header(t,n,m)` that yields a sequence of bytes when `t`, `n` and `m`
are appropriate non-negative integers.
header(t,n,m) = leadbyte(t,n,m) when m < 15
or leadbyte(t,n,15) ++ varint(m) otherwise
The lead byte in a `Value`'s representation is constructed by a function
leadbyte(t,n,m) = [t*64 + n*16 + m]
The lead byte describes the rest of the representation as
follows:[^some-encodings-unused]
leadbyte(0,-,-) represents a Record
leadbyte(1,-,-) represents a Sequence, Set or Dictionary
leadbyte(2,-,-) represents an Atom with variable-length binary representation
leadbyte(3,0,-) represents an Atom with fixed-length binary representation
leadbyte(3,1,-) represents certain special variable-length values
[^some-encodings-unused]: Some encodings are unused. All such
encodings are reserved for future versions of this specification.
Variable-length representations use the value of `m` to encode their
lengths:
- Lengths between 0 and 14 are represented using `leadbyte` with `m`
values 0 through 14.
- Lengths of 15 or greater are represented by `m` value 15, and
additional "length bytes" describing the length then follow the
lead byte.
These additional length bytes are formatted as
[base 128 varints][varint]. Quoting the
[Google Protocol Buffers][varint] definition,
> Each byte in a varint, except the last byte, has the most
> significant bit (msb) set this indicates that there are further
> bytes to come. The lower 7 bits of each byte are used to store the
> two's complement representation of the number in groups of 7 bits,
> least significant group first.
**Examples.**
- The varint representation of 15 is just the byte 15.
- 300 (binary, grouped into 7-bit chunks, `10 0101100`) varint-encodes to the two bytes 172 and 2.
- 1000000000 (binary `11 1011100 1101011 0010100 0000000`) varint-encodes to bytes 128, 148, 235, 220, and 3.
We write `varint(m)` for the varint-encoding of `m`.
#### Records
[[ (L F_1 ... F_m) ]] = header(0,3,m+1) ++ [[L]] ++ [[F_1]] ++ ... ++ [[F_m]]
For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label.
##### Application-specific short form for labels
Any given protocol using Preserves may additionally define an
interpretation for `n ∈ {0,1,2}`, mapping each *short form label
number* `n` to a specific record label. When encoding `m` fields with
short form label number `n`, the header is `header(0,n,m)` (rather
than `m+1`) since the label is implicit.
**Examples.** For example, a protocol may choose to map records
labelled `void` to `n=0`, making
[[(void)]] = header(0,0,0) = [0x00]
or it may map records labelled `person` to short form label number 1,
making
[[(person "Dr" "Elizabeth" "Blackwell")]]
= header(0,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`
= [0x13] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]`
#### Sequences, Sets and Dictionaries
[[ [X_1 ... X_m] ]] = header(1,0,m) ++ [[X_1]] ++ ... ++ [[X_m]]
[[ #set{X_1 ... X_m} ]] = header(1,1,m) ++ [[Y_1]] ++ ... ++ [[Y_m]]
where [Y_1 ... Y_m] = sort([X_1 ... X_m])
[[ #dict{K_1:V_1 ... K_m:V_m} ]]
= header(1,2,m) ++ [[K'_1]] ++ [[V'_1]] ++ ... ++ [[K'_m]] ++ [[V'_m]]
where [[K'_1 V'_1] ... [K'_m V'_m]]
= sort([[K_1 V_1] ... [K_m V_m]])
Note that `n=3` is unused and reserved.
#### Variable-length Atoms
##### SignedInteger
[[ x ]] when x ∈ SignedInteger = header(2,0,m) ++ intbytes(x)
where m = |intbytes(x)|
and intbytes(x) = a big-endian two's-complement representation
of the signed integer x, taking exactly as
many whole bytes as needed to unambiguously
identify the value
For example,
[[ -257 ]] = [0x82, 0xFE, 0xFF]
[[ -256 ]] = [0x82, 0xFF, 0x00]
[[ -255 ]] = [0x82, 0xFF, 0x01]
[[ -254 ]] = [0x82, 0xFF, 0x02]
[[ -129 ]] = [0x82, 0xFF, 0x7F]
[[ -128 ]] = [0x81, 0x80]
[[ -127 ]] = [0x81, 0x81]
[[ -2 ]] = [0x81, 0xFE]
[[ -1 ]] = [0x81, 0xFF]
[[ 0 ]] = [0x80]
[[ 1 ]] = [0x81, 0x01]
[[ 127 ]] = [0x81, 0x7F]
[[ 128 ]] = [0x82, 0x00, 0x80]
[[ 255 ]] = [0x82, 0x00, 0xFF]
[[ 256 ]] = [0x82, 0x01, 0x00]
[[ 32767 ]] = [0x82, 0x7F, 0xFF]
[[ 32768 ]] = [0x83, 0x00, 0x80, 0x00]
[[ 65535 ]] = [0x83, 0x00, 0xFF, 0xFF]
[[ 65536 ]] = [0x83, 0x01, 0x00, 0x00]
[[ 131072 ]] = [0x83, 0x02, 0x00, 0x00]
##### String
[[ S ]] when S ∈ String = header(2,1,m) ++ utf8(S)
where m = |utf8(x)|
and utf8(x) = the UTF-8 encoding of S
##### ByteString
[[ B ]] when B ∈ ByteString = header(2,2,m) ++ B
where m = |B|
##### Symbol
[[ S ]] when S ∈ Symbol = header(2,2,m) ++ utf8(S)
where m = |utf8(x)|
and utf8(x) = the UTF-8 encoding of S
#### Fixed-length Atoms
##### Booleans
[[ #f ]] = header(3,0,0) = [0xC0]
[[ #t ]] = header(3,0,1) = [0xC1]
##### Floats and Doubles
[[ F ]] when F ∈ Float = header(3,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(3,0,3) ++ binary64(D)
where binary32(F) and binary64(D) are big-endian 4- and 8-byte
IEEE 754 binary representations
#### Special variable-length values
##### MIMEData
Each `MIMEData` value is comprised of a media type `Symbol` and a raw
binary body.
[[ M ]] when M ∈ MIMEData = header(3,1,m) ++ [[T]] ++ B
where m = |B|
and T is the Symbol media type of M
and B is the ByteString body of M
## Examples
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. -->
For the following examples, imagine an application that maps `Record`
short form label number 0 to label `discard`, 1 to `capture`, and 2 to
`observe`.
| Value | Encoded hexadecimal byte sequence |
|--------------------------------------------------------------------|----------------------------------------------------|
| `(capture (discard))` | 11 00 |
| `(observe (speak (discard) (capture (discard))))` | 21 33 B5 73 70 65 61 6B 00 11 00 |
| `[1 2 3 4]` | 44 81 01 81 02 81 03 81 04 |
| `[-2 -1 0 1]` | 54 81 FE 81 FF 80 81 01 |
| `["hello" there #"world" [] #set{} #t #f]` | 47 95 68 65 6C 6C 6F A5 74 68 65 72 65 40 50 C1 C0 |
| `-257` | 82 FE FF |
| `-1` | 81 FF |
| `0` | 80 |
| `1` | 81 01 |
| `255` | 82 00 FF |
| `1f` | C2 3F 80 00 00 |
| `1d` | C3 3F F0 00 00 00 00 00 00 |
| `-1.202e300d` | C3 FE 3C B7 B7 59 BF 04 26 |
Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Value`
([titled person 2 thing 1]
101
"Blackwell"
(date 1821 2 3)
"Dr")
encodes to
35 ;; Record, generic, 4+1
45 ;; Sequence, 5
b6 74 69 74 6c 65 64 ;; Symbol, "titled"
b6 70 65 72 73 6f 6e ;; Symbol, "person"
81 02 ;; SignedInteger, "2"
b5 74 68 69 6e 67 ;; Symbol, "thing"
81 01 ;; SignedInteger, "1"
81 65 ;; SignedInteger, "101"
99 42 6c 61 63 6b 77 65 6c 6c ;; String, "Blackwell"
34 ;; Record, generic, 3+1
b4 64 61 74 65 ;; Symbol, "date"
82 07 1d ;; SignedInteger, "1821"
81 02 ;; SignedInteger, "2"
81 03 ;; SignedInteger, "3"
92 44 72 ;; String, "Dr"
[^extensibility2]: It happens to line up with Racket's
representation of a record label for an inheritance hierarchy
where `titled` extends `person` extends `thing`:
(struct date (year month day) #:prefab)
(struct thing (id) #:prefab)
(struct person thing (name date-of-birth) #:prefab)
(struct titled person (title) #:prefab)
## Conventions for Common Data Types
The `Value` data type is essentially an S-Expression, able to
represent semi-structured data over `ByteString`, `String`,
`SignedInteger` atoms and so on.
However, users need a wide variety of data types for representing
domain-specific values such as various kinds of encoded and normalized
text, calendrical values, machine words, and so on.
We use appropriately-labelled `Record`s to denote these
domain-specific data types.
All of these conventions are optional. They form a layer atop the core
`Value` structure. Non-domain-specific tools do not in general need to
treat them specially.
**Validity.** Many of the labels we will describe in this section come
with side-conditions on the contents of labelled `Record`s. It is
possible to construct an instance of `Value` that violates these
side-conditions without ceasing to be a `Value` or becoming
unrepresentable. However, we say that such a `Value` is *invalid*
because it fails to honour the necessary side-conditions.
Implementations *SHOULD* allow two modes of working: one which
treats all `Value`s identically, without regard for side-conditions,
and one which enforces validity (i.e. side-conditions) when reading,
writing, or constructing `Value`s.
### Text
#### Normalization forms
In order for users to unambiguously signal or require a particular
[normalization form](http://unicode.org/reports/tr15/), we define a
`NormalizedString`, which is a `Record` labelled with
`unicode-normalization` and having two fields, the first of which is a
`Symbol` specifying the normalization form used (e.g. `nfc`, `nfd`,
`nfkc`, `nfkd`), and the second of which is a `String` whose
underlying code point representation *MUST* be normalized according to
the named normalization form.
#### IRIs (URIs, URLs, URNs, etc.)
An `IRI` is a `Record` labelled with `iri` and having one field, a
`String` which is the IRI itself and which *MUST* be a valid absolute
or relative IRI.
### Machine words
The definition of `SignedInteger` captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels `i`*n* and `u`*n* for *n* ∈ {16,32,64} denote
*n*-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels *MUST* have one field, a `SignedInteger`,
which *MUST* fall within the appropriate range. That is, to be valid,
- in `(i16 `*x*`)`, -32768 <= *x* <= 32767.
- in `(u16 `*x*`)`, 0 <= *x* <= 65535.
- in `(i32 `*x*`)`, -2147483648 <= *x* <= 2147483647.
- etc.
### Anonymous Tuples and Unit
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, `(tuple)`, denotes the empty tuple, sometimes called
"unit" or "void" (but *not* e.g. JavaScript's "undefined" value).
### Null and Undefined
Tony Hoare's
"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
can be represented with the 0-ary `Record` `(null)`. An "undefined"
value can be represented as `(undefined)`.
### Dates and Times
Dates, times, moments, and timestamps can be represented with a
`Record` with label `rfc3339` having a single field, a `String`, which
*MUST* conform to one of the `full-date`, `partial-time`, `full-time`,
or `date-time` productions of
[section 5.6 of RFC 3339](https://tools.ietf.org/html/rfc3339#section-5.6).
## Representing Values in Programming Languages
We have given a definition of `Value` and its semantics, and proposed
a concrete syntax for communicating and storing `Value`s. We now turn
to **suggested** representations of `Value`s as *programming-language
values* for various programming languages.
When designing a language mapping, an important consideration is
roundtripping: serialization after deserialization, and vice versa,
should both be identities.
### JavaScript
- `SignedInteger` ↔ numbers or `BigInt` [[1](https://developers.google.com/web/updates/2018/05/bigint), [2](https://github.com/tc39/proposal-bigint)]
- `String` ↔ strings
- `ByteString``Uint8Array`
- `Symbol``Symbol.for(...)`
- `Boolean``Boolean`
- `Float` and `Double` ↔ numbers,
- `MIMEData``{ "type": aString, "data": aUint8Array }`
- `Record``{ "_label": theLabel, "_fields": [field0, ..., fieldN] }`, plus convenience accessors
- `(undefined)` ↔ the undefined value
- `(rfc3339 F)``Date`, if `F` matches the `date-time` RFC 3339 production
- `Sequence``Array`
- `Set``{ "_set": M }` where `M` is a `Map` from the elements of the set to `true`
- `Dictionary` ↔ a `Map`
### Scheme/Racket
- `SignedInteger` ↔ exact numbers
- `String` ↔ strings
- `ByteString` ↔ byte vector (Racket: "Bytes")
- `Symbol` ↔ symbols
- `Boolean` ↔ booleans
- `Float` and `Double` ↔ inexact numbers (Racket: single- and double-precision floats)
- `MIMEData` ↔ a structure with a `type` and a `data` field (Racket: `(struct mime (type data))`)
- `Record` ↔ structures (Racket: prefab struct)
- `Sequence` ↔ lists
- `Set` ↔ Racket: sets
- `Dictionary` ↔ Racket: hash-table
### Java
- `SignedInteger``Integer`, `Long`, `BigInteger`
- `String``String`
- `ByteString``byte[]`
- `Symbol` ↔ a simple data class wrapping a `String`
- `Boolean``Boolean`
- `Float` and `Double``Float` and `Double`
- `MIMEData` ↔ an implementation of `javax.activation.DataSource`, maybe?
- `Record` ↔ in a simple implementation, a generic `Record` class; else perhaps a bean mapping?
- `Sequence` ↔ an implementation of `java.util.List`
- `Set` ↔ an implementation of `java.util.Set`
- `Dictionary` ↔ an implementation of `java.util.Map`
### Erlang
- `SignedInteger` ↔ integers
- `String` ↔ tuple of `utf8` and a binary
- `ByteString` ↔ a binary
- `Symbol` ↔ the underlying string converted to an Erlang atom, if
some kind of an "unsafe" mode is set on the decoder (because Erlang
atoms are not GC'd); otherwise perhaps a tuple of `symbol` and a
binary of the utf-8
- `Boolean``true` and `false`
- `Float` and `Double` ↔ floats (unsure how Erlang deals with single-precision)
- `MIMEData` ↔ tuple of the type as a utf8 binary, and the data as a binary
- `Record` ↔ a tuple with the label in the first position, and the fields in subsequent positions
- `Sequence` ↔ a list
- `Set` ↔ a `sets` set (is this unambiguous? Maybe a [map][erlang-map] from elements to `true`?)
- `Dictionary` ↔ a [map][erlang-map] (new in Erlang/OTP R17)
## Appendix. Table of lead byte values
0x - short form Record label index 0
1x - short form Record label index 1
2x - short form Record label index 2
3x - Record
4x - Sequence
5x - Set
6x - Dictionary
(7x) RESERVED
8x - SignedInteger
9x - String
Ax - Bytes
Bx - Symbol
C0 - False
C1 - True
C2 - Float
C3 - Double
(Cx) RESERVED C4-CF
Dx - MIMEData
(Ex) RESERVED
(Fx) RESERVED
## Appendix. Why not Just Use JSON?
<!-- JSON lacks semantics: JSON syntax doesn't denote anything -->
JSON offers *syntax* for numbers, strings, booleans, null, arrays and
string-keyed maps. However, it suffers from two major problems. First,
it offers no *semantics* for the syntax: it is left to each
implementation to determine how to treat each JSON term. This causes
[interoperability](http://seriot.ch/parsing_json.php) and even
[security](http://web.archive.org/web/20180906202559/http://docs.couchdb.org/en/stable/cve/2017-12635.html)
issues. Second, JSON's lack of support for type tags leads to awkward
and incompatible *encodings* of type information in terms of the fixed
suite of constructors on offer.
There are other minor problems with JSON having to do with its syntax.
Examples include its relative verbosity and its lack of support for
binary data.
### JSON syntax doesn't *mean* anything
When are two JSON values the same? When are they different?
<!-- When is one JSON value "less than" another? -->
The specifications are largely silent on these questions. Different
JSON implementations give different answers.
Specifically, JSON does not:
- assign any meaning to numbers,[^meaning-ieee-double]
- determine how strings are to be compared,[^string-key-comparison]
- determine whether object key ordering is significant,[^json-member-ordering] or
- determine whether duplicate object keys are permitted, what it
would mean if they were, or how to determine a duplicate in the
first place.[^json-key-uniqueness]
In short, JSON syntax doesn't *denote* anything.[^xml-infoset] [^other-formats]
[^meaning-ieee-double]:
[Section 6 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-6)
does go so far as to indicate “good interoperability can be
achieved” by imagining that parsers are able reliably to
understand the syntax of numbers as denoting an IEEE 754
double-precision floating-point value.
[^string-key-comparison]:
[Section 8.3 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-8.3)
suggests that *if* an implementation compares strings used as
object keys “code unit by code unit”, then it will interoperate
with *other such implementations*, but neither requires this
behaviour nor discusses comparisons of strings used in other
contexts.
[^json-member-ordering]:
[Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4)
remarks that “[implementations] differ as to whether or not they
make the ordering of object members visible to calling software.”
[^json-key-uniqueness]:
[Section 4 of RFC 7159](https://tools.ietf.org/html/rfc7159#section-4)
is the only place in the specification that mentions the issue. It
explicitly sanctions implementations supporting duplicate keys,
noting only that “when the names within an object are not unique,
the behavior of software that receives such an object is
unpredictable.” Implementations are free to choose any behaviour
at all in this situation, including signalling an error, or
discarding all but one of a set of duplicates.
[^xml-infoset]: The XML world has the concept of
[XML infoset](https://www.w3.org/TR/xml-infoset/). Loosely
speaking, XML infoset is the *denotation* of an XML document; the
*meaning* of the document.
[^other-formats]: Most other recent data languages are like JSON in
specifying only a syntax with no associated semantics. While some
do make a sketch of a semantics, the result is often
underspecified (e.g. in terms of how strings are to be compared),
overly machine-oriented (e.g. treating 32-bit integers as
fundamentally distinct from 64-bit integers and from
floating-point numbers), overly fine (e.g. giving visibility to
the order in which map entries are written), or all three.
Some examples:
- are the JSON values `1`, `1.0`, and `1e0` the same or different?
- are the JSON values `1.0` and `1.0000000000000001` the same or different?
- are the JSON strings `"päron"` (UTF-8 `70c3a4726f6e`) and `"päron"`
(UTF-8 `7061cc88726f6e`) the same or different?
- are the JSON objects `{"a":1, "b":2}` and `{"b":2, "a":1}` the same
or different?
- which, if any, of `{"a":1, "a":2}`, `{"a":1}` and `{"a":2}` are the
same? Are all three legal?
- are `{"päron":1}` and `{"päron":1}` the same or different?
### JSON can multiply nicely, but it can't add very well
JSON includes a fixed set of types: numbers, strings, booleans, null,
arrays and string-keyed maps. Domain-specific data must be *encoded*
into these types. For example, dates and email addresses are often
represented as strings with an implicit internal structure.
There is no convention for *labelling* a value as belonging to a
particular category. This makes it difficult to extract, say, all
email addresses, or all URLs, from an arbitrary JSON document.
Instead, JSON-encoded data are often labelled in an ad-hoc way.
Multiple incompatible approaches exist. For example, a "money"
structure containing a `currency` field and an `amount` may be
represented in any number of ways:
{ "_type": "money", "currency": "EUR", "amount": 10 }
{ "type": "money", "value": { "currency": "EUR", "amount": 10 } }
[ "money", { "currency": "EUR", "amount": 10 } ]
{ "@money": { "currency": "EUR", "amount": 10 } }
This causes particular problems when JSON is used to represent *sum*
or *union* types, such as "either a value or an error, but not both".
Again, multiple incompatible approaches exist.
For example, imagine an API for depositing money in an account. The
response might be either a "success" response indicating the new
balance, or one of a set of possible errors.
Sometimes, a *pair* of values is used, with `null` marking the option
not taken.[^interesting-failure-mode]
{ "ok": { "balance": 210 }, "error": null }
{ "ok": null, "error": "Unauthorized" }
[^interesting-failure-mode]: What is the meaning of a document where
both `ok` and `error` are non-null? What might happen when a
program is presented with such a document?
The branch not chosen is sometimes present, sometimes omitted as if it
were an optional field:
{ "ok": { "balance": 210 } }
{ "error": "Unauthorized" }
Sometimes, an array of a label and a value is used:
[ "ok", { "balance": 210 } ]
[ "error", "Unauthorized" ]
Sometimes, the shape of the data is sufficient to distinguish among
the alternatives, and the label is left implicit:
{ "balance": 210 }
"Unauthorized"
JSON itself does not offer any guidance for which of these options to
choose. In many real cases on the web, poor choices have led to
encodings that are irrecoverably ambiguous.
---
---
# Open questions
Q. Should "symbols" instead be URIs? Relative, usually; relative to
what? Some domain-specific base URI?
Q. What about general rationals, subsuming integers and IEEE floats
(except NaN and the Infinities)?
Q. Should I map to SPKI SEXP or is that nonsense / for later?[^why-not-spki-sexps]
[^why-not-spki-sexps]: Why not just use Rivest's S-Expressions as
they are? While they include binary data and sequences, and an
obvious equivalence for them exists, they lack numbers *per se* as
well as any kind of unordered structure such as sets or maps. In
addition, while "display hints" allow labelling of binary data
with an intended interpretation, they cannot be attached to any
other kind of structure, and the "hint" itself can only be a
binary blob.
Q. Should `MIMEData` be a special syntax for `Record`s with a single
`ByteString` field?
A. Not even. It should probably just be moved to the "conventions"
section. Compare:
D5 BA text/plain hello -- using special MIMEData encoding
32 BA text/plain A5 hello -- using bog standard type-labelled Record
Q. Should `Symbol` be a special syntax for a `Record` with a `Symbol`
label (recursive!?) and a single `String` field?
Q. Should `String` be a special syntax for `(utf8 Bytes)`? Again,
recursiveness problems...?
Q. Should `Dictionary` be a special syntax for etc etc.? `Set`?
`Float`? `Double`?
--> Rule of thumb: if there's a special equivalence predicate for it,
it needs to be built-in syntax. Otherwise it can be a regular
record. (So: `Boolean` might not make the cut for special
treatment?? Likewise `String`...? Ugh those are psychologically
important perhaps)
Q. Are the language mappings reasonable? How about one for Python?
---
OK so. No built-in `MIMEData`, but maybe a conventional `(mime-data
Symbol Bytes)`? Applications can put it in a short slot if they like.
Streaming: needed for variable-sized structures. Tricky to design
syntax for this that isn't gratuitously warty. End byte value.
Literal small integers: could be nice? Not absolutely necessary.
Give algorithm for computing size of integers.
Give up on sorting requirement for representation of sets and
dictionaries?? Probably a good idea if there are streaming forms of
them because that sounds impossible to do??
Maybe reorder: fixed-length atoms first, then variable-length atoms,
then fixed-length compounds, then variable-length compounds? Reason
being that then maybe can put the streaming forms of the
variable-length ones very last.
---