WIP from the early hours of this morning, adding textual syntax

This commit is contained in:
Tony Garnock-Jones 2018-09-27 11:42:55 +01:00
parent 906f8a01b6
commit 6fa0dde8f4
1 changed files with 249 additions and 99 deletions

View File

@ -6,12 +6,13 @@
# Preserves: an Expressive Data Language # Preserves: an Expressive Data Language
Tony Garnock-Jones <tonyg@leastfixedpoint.com> Tony Garnock-Jones <tonyg@leastfixedpoint.com>
September 2018. Version 0.0.2. September 2018. Version 0.0.3.
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
[spki]: http://world.std.com/~cme/html/spki.html [spki]: http://world.std.com/~cme/html/spki.html
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints [varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
[abnf]: https://tools.ietf.org/html/rfc7405
This document proposes a data model and serialization format called This document proposes a data model and serialization format called
*Preserves*. *Preserves*.
@ -47,7 +48,8 @@ structures of any particular implementation language.
Taking inspiration from functional programming, we start with a Taking inspiration from functional programming, we start with a
definition of the *values* that we want to work with and give them definition of the *values* that we want to work with and give them
meaning independent of their syntax. We will treat syntax separately, meaning independent of their syntax. When we write examples of values,
we will do so using the [textual syntax](#textual-syntax) defined
later in this document. later in this document.
Our `Value`s fall into two broad categories: *atomic* and *compound* Our `Value`s fall into two broad categories: *atomic* and *compound*
@ -94,8 +96,7 @@ neither is less than the other according to the total order.
### Signed integers. ### Signed integers.
A `SignedInteger` is a signed integer of arbitrary width. A `SignedInteger` is a signed integer of arbitrary width.
`SignedInteger`s are compared as mathematical integers. We will write `SignedInteger`s are compared as mathematical integers.
examples of `SignedInteger`s using standard mathematical notation.
**Examples.** 10; -6; 0. **Examples.** 10; -6; 0.
@ -107,8 +108,7 @@ examples of `SignedInteger`s using standard mathematical notation.
A `String` is a sequence of Unicode A `String` is a sequence of Unicode
[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s [code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
are compared lexicographically, code-point by are compared lexicographically, code-point by
code-point.[^utf8-is-awesome] We will write examples of `String`s as code-point.[^utf8-is-awesome]
text surrounded by quotes “`"`”.
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this [^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
gives the same result as a lexicographic byte-by-byte comparison gives the same result as a lexicographic byte-by-byte comparison
@ -121,33 +121,27 @@ the string containing the three Unicode code-points `z` (0x7A), `水`
### Binary data. ### Binary data.
A `ByteString` is an ordered sequence of zero or more eight-bit bytes. A `ByteString` is an ordered sequence of zero or more eight-bit bytes.
`ByteString`s are compared lexicographically. We will only write `ByteString`s are compared lexicographically.
examples of `ByteString`s that contain bytes denoting printable ASCII
characters, using “`#"`” as an open-quote and “`"`” as a close-quote
mark.
**Examples.** The `ByteString` containing the integers 65, 66 and 67 **Examples.** `#""`, the empty `ByteString`; `#"ABC"`, the
(corresponding to ASCII characters `A`, `B` and `C`) is written as `ByteString` containing the integers 65, 66 and 67 (corresponding to
`#"ABC"`. The empty `ByteString` is written as `#""`. **N.B.** Despite ASCII characters `A`, `B` and `C`). **N.B.** Despite appearances,
appearances, these are *binary* data. these are *binary* data.
### Symbols. ### Symbols.
Programming languages like Lisp and Prolog frequently use string-like Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points representing an identifier of some sequence of Unicode code-points representing an identifier of some
kind. `Symbol`s are also compared lexicographically by code-point. We kind. `Symbol`s are also compared lexicographically by code-point.
will write examples including only non-empty sequences of
non-whitespace characters, using a monospace font without quotation
marks.
**Examples.** `hello-world`; `utf8-string`; `exact-integer?`. **Examples.** `hello-world`; `utf8-string`; `exact-integer?`.
### Booleans. ### Booleans.
There are exactly two `Boolean` values, “false” and “true”. The There are exactly two `Boolean` values, “false” and “true”. The
“false” value compares less-than the “true” value. We write `#f` for “false” value compares less-than the “true” value. We write `#false`
“false”, and `#t` for “true”. for “false”, and `#true` for “true”.
### IEEE floating-point values. ### IEEE floating-point values.
@ -159,11 +153,11 @@ every `Double`, and every `SignedInteger` is greater than both. Two
`Float`s or two `Double`s are to be ordered by the `totalOrder` `Float`s or two `Double`s are to be ordered by the `totalOrder`
predicate defined in section 5.10 of predicate defined in section 5.10 of
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935). [IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
We write examples using standard mathematical notation, avoiding NaN We write examples using a fractional part and/or an exponent to
and infinities, using a suffix `f` or `d` to indicate `Float` or distinguish them from `SignedInteger`s. An additional suffix `f`
`Double`, respectively. distinguishes `Float`s from `Double`s.
**Examples.** 10f; -6d; 0f; 0.5d; -1.202e300d. **Examples.** 10.0f; -6.0; 0.0f; 0.5; -1.202e300.
**Non-examples.** 10, -6, and 0, because writing them this way **Non-examples.** 10, -6, and 0, because writing them this way
indicates `SignedInteger`s, not `Float`s or `Double`s. indicates `SignedInteger`s, not `Float`s or `Double`s.
@ -174,9 +168,7 @@ A `Record` is a *labelled* tuple of zero or more `Value`s, called the
record's *fields*. A record's label is itself a `Value`, though it record's *fields*. A record's label is itself a `Value`, though it
will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s
are compared lexicographically as if they were just tuples; that is, are compared lexicographically as if they were just tuples; that is,
first by their labels, and then by the remainder of their fields. We first by their labels, and then by the remainder of their fields.
will write examples of `Record`s as a parenthesised, space-separated
sequence of their label `Value` followed by their field `Value`s.
[^extensibility]: The [Racket](https://racket-lang.org/) programming [^extensibility]: The [Racket](https://racket-lang.org/) programming
language defines language defines
@ -194,17 +186,16 @@ sequence of their label `Value` followed by their field `Value`s.
it cannot be read as an IRI at all, and so the label simply stands it cannot be read as an IRI at all, and so the label simply stands
for itself—for its own `Value`. for itself—for its own `Value`.
**Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is **Examples.** `foo(1 2 3)`, a `Record` with label `foo` and fields 1,
written `(foo 1 2 3)`; the `Record` with label `void` and no fields is 2 and 3; `void()`, a `Record` with label `void` and no fields.
written `(void)`.
**Non-examples.** `()`, because it lacks a label. **Non-examples.** `()`, because it lacks a label; `void`, because it
lacks even an empty tuple of fields.
### Sequences. ### Sequences.
A `Sequence` is a general-purpose, variable-length ordered sequence of A `Sequence` is a general-purpose, variable-length ordered sequence of
zero or more `Value`s. `Sequence`s are compared lexicographically. We zero or more `Value`s. `Sequence`s are compared lexicographically.
write examples space-separated, surrounded with square brackets.
**Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of **Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of
`SignedInteger`s 1, 2 and 3. `SignedInteger`s 1, 2 and 3.
@ -215,25 +206,24 @@ A `Set` is an unordered finite set of `Value`s. It contains no
duplicate values, following the [equivalence relation](#equivalence) duplicate values, following the [equivalence relation](#equivalence)
induced by the total order on `Value`s. Two `Set`s are compared by induced by the total order on `Value`s. Two `Set`s are compared by
sorting their elements ascending using the [total order](#total-order) sorting their elements ascending using the [total order](#total-order)
and comparing the resulting `Sequence`s. We write examples and comparing the resulting `Sequence`s.
space-separated, surrounded with curly braces, prefixed by `#set`.
**Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set **Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set
containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set containing only the empty set; `{4 "hello" (void) 9.0f}`, the set
containing 4, the string `"hello"`, the record with label `void` and containing 4, the string `"hello"`, the record with label `void` and
no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`, no fields, and the `Float` denoting the number 9.0; `{1 1.0f}`, the
the set containing a `SignedInteger` and a `Float`; `#set{(mime set containing a `SignedInteger` and a `Float`; `{mime(application/xml
application/xml #"<x/>") (mime application/xml #"<x />")}`, a set #"<x/>") mime(application/xml #"<x />")}`, a set containing two
containing two different type-labelled byte different `mime` records.[^mime-xml-difference]
arrays.[^mime-xml-difference]
[^mime-xml-difference]: The two XML documents `<x/>` and `<x />` [^mime-xml-difference]: The two XML documents `<x/>` and `<x />`
differ by bytewise comparison, and thus yield different record differ by bytewise comparison, and thus yield different record
values, even though under the semantics of XML they denote values, even though under the semantics of XML they denote
identical XML infoset. identical XML infoset.
**Non-examples.** `#set{1 1 1}`, because it contains multiple **Non-examples.** `{1 1}`, because it contains multiple equivalent
equivalent `Value`s. `Value`s; `{}`, because without the `#set` marker, it denotes the
empty dictionary.
### Dictionaries. ### Dictionaries.
@ -241,27 +231,189 @@ A `Dictionary` is an unordered finite collection of pairs of `Value`s.
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` must Each pair comprises a *key* and a *value*. Keys in a `Dictionary` must
be pairwise distinct. Instances of `Dictionary` are compared by be pairwise distinct. Instances of `Dictionary` are compared by
lexicographic comparison of the sequences resulting from ordering each lexicographic comparison of the sequences resulting from ordering each
`Dictionary`'s pairs in ascending order by key. Examples are written `Dictionary`'s pairs in ascending order by key.
as a `#dict`-prefixed, curly-brace-surrounded sequence of
space-separated key-value pairs, each written with a colon between the
key and value.
**Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the **Examples.** `{}`, the empty dictionary; `{a: 1}`, the dictionary
dictionary mapping the `Symbol` `a` to the `SignedInteger` 1; mapping the `Symbol` `a` to the `SignedInteger` 1; `{[1 2 3]: a}`,
`#dict{[1 2 3]:a}`, mapping `[1 2 3]` to `a`; `#dict{"hi":0 hi:0 mapping `[1 2 3]` to `a`; `{"hi": 0, hi: 0, there: []}`, having a
there:[]}`, having a `String` and two `Symbol` keys, and `String` and two `Symbol` keys, and `SignedInteger` and `Sequence`
`SignedInteger` and `Sequence` values. values.
**Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate **Non-examples.** `{a:1 b:2 a:3}`, because it contains duplicate
keys; `#dict{[7 8]:[] [7 8]:99}`, for the same reason. keys; `{[7 8]:[] [7 8]:99}`, for the same reason.
## Syntax ## Textual Syntax
Now we have discussed `Value`s and their meanings, we may turn to Now we have discussed `Value`s and their meanings, we may turn to
techniques for *representing* `Value`s for communication or storage. techniques for *representing* `Value`s for communication or storage.
For now, we limit our attention to an easily-parsed, easily-produced In this section, we use [case-sensitive ABNF][abnf] to define a
machine-readable syntax. textual syntax that is easy for people to read and
write.[^json-superset] Most of the examples in this document are
written using this syntax. In the following section, we will define an
equivalent compact machine-readable syntax.
[^json-superset]: The grammar of the textual syntax is a superset of
JSON, with the slightly unusual feature that `true`, `false`, and
`null` are all read as `Symbol`s, and that `SignedInteger`s are
never read as `Double`s.
### Character set
[ABNF][abnf] allows easy definition of US-ASCII-based languages.
However, Preserves is a Unicode-based language. Therefore, we
reinterpret ABNF as a grammar for recognising sequences of Unicode
code points.
Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
possible.
### Whitespace
Whitespace is defined as any number of spaces, tabs, carriage returns,
line feeds, comments, or commas. A comment is a semicolon followed by
the unicode code points up to and including the next carriage return
or line feed.
ws = *(%x20 / %x09 / newline / comment / ",")
newline = CR / LF
comment = ";" *(WSP / nonnl) newline
nonnl = <any Unicode code point except CR or LF>
### Grammar
Standalone documents containing textual representations of `Value`s may have trailing whitespace.
Document = Value ws
Any `Value` may be preceded by whitespace.
Value = ws (Record / Collection / Atom / Compact)
Collection = Sequence / Dictionary / Set
Atom = Boolean / Float / Double / SignedInteger /
String / ByteString / Symbol
Each `Record` is its label-`Value` followed by a parenthesised
grouping of its field-`Value`s.
Record = Value ws "(" *Value ws ")"
`Sequence`s are enclosed in square brackets. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of values. `Set`s are
written either as a simple curly-brace-enclosed non-empty sequence of
values, or as a possibly-empty sequence of values enclosed by the
tokens `#set{` and `}`.
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
Any `Value` may be represented using the
[compact binary syntax](#compact-binary-syntax) by directly prefixing
the binary form of the `Value` with ASCII `SOH` (`%x01`), or by
enclosing a hexadecimal representation of the binary form of the
`Value` in the tokens `#hexvalue{` and `}`.
Compact = %x01 <binary data> / %s"#hexvalue{" *(ws / HEXDIG) ws "}"
`Boolean`s are the simple literal strings `#true` and `#false`.
Boolean = %s"#true" / %s"#false"
Numeric data follow the
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
the addition of a trailing "f" distinguishing `Float` from `Double`
values. `Float`s and `Double`s always have either a fractional part or
an exponent part, where `SignedInteger`s never have either.
TODO: talk about precise reading of floats, and the need for arbitrary
precision. Your language will often have a good floating-point reading
library.
Float = flt %i"f"
Double = flt
SignedInteger = int
digit1-9 = %x31-39
nat = %x30 / ( digit1-9 *DIGIT )
int = ["-"] nat
frac = "." 1*DIGIT
exp = %i"e" ["-"/"+"] 1*DIGIT
flt = int (frac exp / frac / exp)
`String`s are,
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.[^string-json-correspondence]
TODO: discuss surrogate pairs in \uXXXX form
String = %x22 *char %x22
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
escape = %x5C ; \
escaped = ( %x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 ) ; t tab U+0009
[^string-json-correspondence]: The grammar for `String` has the same
effect as the
[JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
`string`. Some auxiliary definitions (e.g. `escaped`) are lifted
largely unmodified from the text of RFC 8259.
A `ByteString` may be written in any of three different forms.
The first is similar to a `String`, but prepended with a hash sign
`#`. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
byte values must be escaped by prepending a two-digit hexadecimal
value with `\x`.
ByteString = "#" %x22 *binchar %x22
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs hexadecimal digits interleaved
with whitespace and surrounded by `#hex{` and `}`.
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#base64{` and `}`. Plain and
URL-safe Base64 characters are allowed.
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A `Symbol` may be written in a "bare" form,[^cf-sexp-token] so long as
it conforms to certain restrictions on the characters appearing in the
symbol, or in a quoted form. The quoted form is much the same as the
syntax for `String`s, including embedded escape syntax, except using a
bar or pipe character (`|`) instead of a double quote mark.
Symbol = symstart *symcont / "|" *symchar "|"
symstart = ALPHA / sympunct
symcont = ALPHA / sympunct / DIGIT / "-" / "."
sympunct = "~" / "!" / "@" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "<" / ">" / "/"
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
[^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
definition of "token representation".
TODO: More unicode in unescaped symbols?
### Printing
Recommend a JSON-compatible print mode. Recommend a submode with trailing commas.
## Compact Binary Syntax
A `Repr` is an encoding, or representation, of a specific `Value`. A `Repr` is an encoding, or representation, of a specific `Value`.
Each `Repr` comprises one or more bytes describing first the kind of Each `Repr` comprises one or more bytes describing first the kind of
@ -373,14 +525,14 @@ be a single `Repr`.
Format B (known length): Format B (known length):
[[ (L F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] [[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
For `m` fields, `m+1` is supplied to `header`, to account for the For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label. encoding of the record label.
Format C (streaming): Format C (streaming):
[[ (L F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3) [[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
Applications *SHOULD* prefer the known-length format for encoding Applications *SHOULD* prefer the known-length format for encoding
`Record`s. `Record`s.
@ -401,12 +553,12 @@ and format C becomes
**Examples.** For example, a protocol may choose to map records **Examples.** For example, a protocol may choose to map records
labelled `void` to `n=0`, making labelled `void` to `n=0`, making
[[(void)]] = header(2,0,0) = [0x80] [[void()]] = header(2,0,0) = [0x80]
or it may map records labelled `person` to short form label number 1, or it may map records labelled `person` to short form label number 1,
making making
[[(person "Dr" "Elizabeth" "Blackwell")]] [[person("Dr", "Elizabeth", "Blackwell")]]
= header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] = header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]] = [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
@ -421,20 +573,20 @@ for format C.
Format B (known length): Format B (known length):
[[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]] [[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]] [[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #dict{K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++... [[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ [[K_m]] ++ [[V_m]]
Note that `m*2` is given to `header` for a `Dictionary`, since there Note that `m*2` is given to `header` for a `Dictionary`, since there
are two `Value`s in each key-value pair. are two `Value`s in each key-value pair.
Format C (streaming): Format C (streaming):
[[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0) [[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
[[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1) [[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
[[ #dict{K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++... [[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close(3,2) ++ [[K_m]] ++ [[V_m]] ++ close(3,2)
Applications may use whichever format suits their needs on a Applications may use whichever format suits their needs on a
case-by-case basis. case-by-case basis.
@ -528,8 +680,8 @@ specify lengths. Applications *MUST NOT* use format C with
#### Booleans #### Booleans
[[ #f ]] = header(0,0,0) = [0x00] [[ #false ]] = header(0,0,0) = [0x00]
[[ #t ]] = header(0,0,1) = [0x01] [[ #true ]] = header(0,0,1) = [0x01]
#### Floats and Doubles #### Floats and Doubles
@ -550,31 +702,27 @@ short form label number 0 to label `discard`, 1 to `capture`, and 2 to
| Value | Encoded hexadecimal byte sequence | | Value | Encoded hexadecimal byte sequence |
|---------------------------------------------------|----------------------------------------------------------------------| |---------------------------------------------------|----------------------------------------------------------------------|
| `(capture (discard))` | 91 80 | | `capture(discard())` | 91 80 |
| `(observe (speak (discard) (capture (discard))))` | A1 B3 75 73 70 65 61 6B 80 91 80 | | `observe(speak(discard(), capture(discard())))` | A1 B3 75 73 70 65 61 6B 80 91 80 |
| `[1 2 3 4]` (format B) | C4 11 12 13 14 | | `[1 2 3 4]` (format B) | C4 11 12 13 14 |
| `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C | | `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C |
| `[-2 -1 0 1]` | C4 1E 1F 10 11 | | `[-2 -1 0 1]` | C4 1E 1F 10 11 |
| `"hello"` (format B) | 55 68 65 6C 6C 6F | | `"hello"` (format B) | 55 68 65 6C 6C 6F |
| `"hello"` (format C, 2 chunks) | 25 62 68 65 63 6C 6C 6F 35 | | `"hello"` (format C, 2 chunks) | 25 62 68 65 63 6C 6C 6F 35 |
| `"hello"` (format C, 5 chunks) | 25 62 68 65 62 6C 6C 60 60 61 6F 35 | | `"hello"` (format C, 5 chunks) | 25 62 68 65 62 6C 6C 60 60 61 6F 35 |
| `["hello" there #"world" [] #set{} #t #f]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 | | `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
| `-257` | 42 FE FF | | `-257` | 42 FE FF |
| `-1` | 1F | | `-1` | 1F |
| `0` | 10 | | `0` | 10 |
| `1` | 11 | | `1` | 11 |
| `255` | 42 00 FF | | `255` | 42 00 FF |
| `1f` | 02 3F 80 00 00 | | `1.0f` | 02 3F 80 00 00 |
| `1d` | 03 3F F0 00 00 00 00 00 00 | | `1.0` | 03 3F F0 00 00 00 00 00 00 |
| `-1.202e300d` | 03 FE 3C B7 B7 59 BF 04 26 | | `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 |
Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Record` Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Record`
([titled person 2 thing 1] [titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr")
101
"Blackwell"
(date 1821 2 3)
"Dr")
encodes to encodes to
@ -671,16 +819,16 @@ such media types following the general rules for ordering of
| Value | Encoded hexadecimal byte sequence | | Value | Encoded hexadecimal byte sequence |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------| |--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| `(mime application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 | | `mime(application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `(mime text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 | | `mime(text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `(mime application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E | | `mime(application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `(mime text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 | | `mime(text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of `mime` records may choose to use a Applications making heavy use of `mime` records may choose to use a
short form label number for the record type. For example, if short short form label number for the record type. For example, if short
form label number 1 were chosen, the second example above, `(mime form label number 1 were chosen, the second example above,
text/plain "ABC")`, would be encoded with "92" in place of "B3 74 6D `mime(text/plain "ABC")`, would be encoded with "92" in place of "B3
69 6D 65". 74 6D 69 6D 65".
### Unicode normalization forms ### Unicode normalization forms
@ -707,13 +855,13 @@ The definition of `SignedInteger` captures all integers. However, in
certain circumstances it can be valuable to assert that a number certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word. inhabits a particular range, such as a fixed-width machine word.
A family of labels `i`*n* and `u`*n* for *n* ∈ {16,32,64} denote A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
*n*-bit-wide signed and unsigned range restrictions, respectively. *n*-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels *MUST* have one field, a `SignedInteger`, Records with these labels *MUST* have one field, a `SignedInteger`,
which *MUST* fall within the appropriate range. That is, to be valid, which *MUST* fall within the appropriate range. That is, to be valid,
- in `(i16 `*x*`)`, -32768 <= *x* <= 32767. - in `i8(`*x*`)`, -128 <= *x* <= 127.
- in `(u16 `*x*`)`, 0 <= *x* <= 65535. - in `u8(`*x*`)`, 0 <= *x* <= 255.
- in `(i32 `*x*`)`, -2147483648 <= *x* <= 2147483647. - in `i16(`*x*`)`, -32768 <= *x* <= 32767.
- etc. - etc.
### Anonymous Tuples and Unit ### Anonymous Tuples and Unit
@ -721,15 +869,15 @@ which *MUST* fall within the appropriate range. That is, to be valid,
A `Tuple` is a `Record` with label `tuple` and zero or more fields, A `Tuple` is a `Record` with label `tuple` and zero or more fields,
denoting an anonymous tuple of values. denoting an anonymous tuple of values.
The 0-ary tuple, `(tuple)`, denotes the empty tuple, sometimes called The 0-ary tuple, `tuple()`, denotes the empty tuple, sometimes called
"unit" or "void" (but *not* e.g. JavaScript's "undefined" value). "unit" or "void" (but *not* e.g. JavaScript's "undefined" value).
### Null and Undefined ### Null and Undefined
Tony Hoare's Tony Hoare's
"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)" "[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
can be represented with the 0-ary `Record` `(null)`. An "undefined" can be represented with the 0-ary `Record` `null()`. An "undefined"
value can be represented as `(undefined)`. value can be represented as `undefined()`.
### Dates and Times ### Dates and Times
@ -741,6 +889,8 @@ or `date-time` productions of
## Security Considerations ## Security Considerations
TODO: Lots of whitespace is just like lots of empty chunks
**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and **Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
`Symbol`s may include chunks of zero length. This opens up a `Symbol`s may include chunks of zero length. This opens up a
possibility for denial-of-service: an attacker may begin streaming a possibility for denial-of-service: an attacker may begin streaming a
@ -751,9 +901,9 @@ chunks that may appear in a stream, and may even supply an optional
mode that rejects empty chunks entirely. mode that rejects empty chunks entirely.
**Canonical form for cryptographic hashing and signing.** As **Canonical form for cryptographic hashing and signing.** As
specified, the encoding rules for `Value`s do not force canonical specified, neither the textual nor the compact binary encoding rules
serializations for `Set` or `Dictionary` values. Two serializations of for `Value`s force canonical serializations. Two serializations of the
the same `Value` may yield different binary `Repr`s. same `Value` may yield different binary `Repr`s.
## Appendix. Table of lead byte values ## Appendix. Table of lead byte values