WIP from the early hours of this morning, adding textual syntax

This commit is contained in:
Tony Garnock-Jones 2018-09-27 11:42:55 +01:00
parent 906f8a01b6
commit 6fa0dde8f4
1 changed files with 249 additions and 99 deletions

View File

@ -6,12 +6,13 @@
# Preserves: an Expressive Data Language
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
September 2018. Version 0.0.2.
September 2018. Version 0.0.3.
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
[spki]: http://world.std.com/~cme/html/spki.html
[varint]: https://developers.google.com/protocol-buffers/docs/encoding#varints
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
[abnf]: https://tools.ietf.org/html/rfc7405
This document proposes a data model and serialization format called
*Preserves*.
@ -47,7 +48,8 @@ structures of any particular implementation language.
Taking inspiration from functional programming, we start with a
definition of the *values* that we want to work with and give them
meaning independent of their syntax. We will treat syntax separately,
meaning independent of their syntax. When we write examples of values,
we will do so using the [textual syntax](#textual-syntax) defined
later in this document.
Our `Value`s fall into two broad categories: *atomic* and *compound*
@ -94,8 +96,7 @@ neither is less than the other according to the total order.
### Signed integers.
A `SignedInteger` is a signed integer of arbitrary width.
`SignedInteger`s are compared as mathematical integers. We will write
examples of `SignedInteger`s using standard mathematical notation.
`SignedInteger`s are compared as mathematical integers.
**Examples.** 10; -6; 0.
@ -107,8 +108,7 @@ examples of `SignedInteger`s using standard mathematical notation.
A `String` is a sequence of Unicode
[code-point](http://www.unicode.org/glossary/#code_point)s. `String`s
are compared lexicographically, code-point by
code-point.[^utf8-is-awesome] We will write examples of `String`s as
text surrounded by quotes “`"`”.
code-point.[^utf8-is-awesome]
[^utf8-is-awesome]: Happily, the design of UTF-8 is such that this
gives the same result as a lexicographic byte-by-byte comparison
@ -121,33 +121,27 @@ the string containing the three Unicode code-points `z` (0x7A), `水`
### Binary data.
A `ByteString` is an ordered sequence of zero or more eight-bit bytes.
`ByteString`s are compared lexicographically. We will only write
examples of `ByteString`s that contain bytes denoting printable ASCII
characters, using “`#"`” as an open-quote and “`"`” as a close-quote
mark.
`ByteString`s are compared lexicographically.
**Examples.** The `ByteString` containing the integers 65, 66 and 67
(corresponding to ASCII characters `A`, `B` and `C`) is written as
`#"ABC"`. The empty `ByteString` is written as `#""`. **N.B.** Despite
appearances, these are *binary* data.
**Examples.** `#""`, the empty `ByteString`; `#"ABC"`, the
`ByteString` containing the integers 65, 66 and 67 (corresponding to
ASCII characters `A`, `B` and `C`). **N.B.** Despite appearances,
these are *binary* data.
### Symbols.
Programming languages like Lisp and Prolog frequently use string-like
values called *symbols*. Here, a `Symbol` is, like a `String`, a
sequence of Unicode code-points representing an identifier of some
kind. `Symbol`s are also compared lexicographically by code-point. We
will write examples including only non-empty sequences of
non-whitespace characters, using a monospace font without quotation
marks.
kind. `Symbol`s are also compared lexicographically by code-point.
**Examples.** `hello-world`; `utf8-string`; `exact-integer?`.
### Booleans.
There are exactly two `Boolean` values, “false” and “true”. The
“false” value compares less-than the “true” value. We write `#f` for
“false”, and `#t` for “true”.
“false” value compares less-than the “true” value. We write `#false`
for “false”, and `#true` for “true”.
### IEEE floating-point values.
@ -159,11 +153,11 @@ every `Double`, and every `SignedInteger` is greater than both. Two
`Float`s or two `Double`s are to be ordered by the `totalOrder`
predicate defined in section 5.10 of
[IEEE Std 754-2008](https://dx.doi.org/10.1109/IEEESTD.2008.4610935).
We write examples using standard mathematical notation, avoiding NaN
and infinities, using a suffix `f` or `d` to indicate `Float` or
`Double`, respectively.
We write examples using a fractional part and/or an exponent to
distinguish them from `SignedInteger`s. An additional suffix `f`
distinguishes `Float`s from `Double`s.
**Examples.** 10f; -6d; 0f; 0.5d; -1.202e300d.
**Examples.** 10.0f; -6.0; 0.0f; 0.5; -1.202e300.
**Non-examples.** 10, -6, and 0, because writing them this way
indicates `SignedInteger`s, not `Float`s or `Double`s.
@ -174,9 +168,7 @@ A `Record` is a *labelled* tuple of zero or more `Value`s, called the
record's *fields*. A record's label is itself a `Value`, though it
will usually be a `Symbol`.[^extensibility] [^iri-labels] `Record`s
are compared lexicographically as if they were just tuples; that is,
first by their labels, and then by the remainder of their fields. We
will write examples of `Record`s as a parenthesised, space-separated
sequence of their label `Value` followed by their field `Value`s.
first by their labels, and then by the remainder of their fields.
[^extensibility]: The [Racket](https://racket-lang.org/) programming
language defines
@ -194,17 +186,16 @@ sequence of their label `Value` followed by their field `Value`s.
it cannot be read as an IRI at all, and so the label simply stands
for itself—for its own `Value`.
**Examples.** The `Record` with label `foo` and fields 1, 2 and 3 is
written `(foo 1 2 3)`; the `Record` with label `void` and no fields is
written `(void)`.
**Examples.** `foo(1 2 3)`, a `Record` with label `foo` and fields 1,
2 and 3; `void()`, a `Record` with label `void` and no fields.
**Non-examples.** `()`, because it lacks a label.
**Non-examples.** `()`, because it lacks a label; `void`, because it
lacks even an empty tuple of fields.
### Sequences.
A `Sequence` is a general-purpose, variable-length ordered sequence of
zero or more `Value`s. `Sequence`s are compared lexicographically. We
write examples space-separated, surrounded with square brackets.
zero or more `Value`s. `Sequence`s are compared lexicographically.
**Examples.** `[]`, the empty sequence; `[1 2 3]`, the sequence of
`SignedInteger`s 1, 2 and 3.
@ -215,25 +206,24 @@ A `Set` is an unordered finite set of `Value`s. It contains no
duplicate values, following the [equivalence relation](#equivalence)
induced by the total order on `Value`s. Two `Set`s are compared by
sorting their elements ascending using the [total order](#total-order)
and comparing the resulting `Sequence`s. We write examples
space-separated, surrounded with curly braces, prefixed by `#set`.
and comparing the resulting `Sequence`s.
**Examples.** `#set{}`, the empty set; `#set{#set{}}`, the set
containing only the empty set; `#set{4 "hello" (void) 9.0f}`, the set
containing only the empty set; `{4 "hello" (void) 9.0f}`, the set
containing 4, the string `"hello"`, the record with label `void` and
no fields, and the `Float` denoting the number 9.0; `#set{1 1.0f}`,
the set containing a `SignedInteger` and a `Float`; `#set{(mime
application/xml #"<x/>") (mime application/xml #"<x />")}`, a set
containing two different type-labelled byte
arrays.[^mime-xml-difference]
no fields, and the `Float` denoting the number 9.0; `{1 1.0f}`, the
set containing a `SignedInteger` and a `Float`; `{mime(application/xml
#"<x/>") mime(application/xml #"<x />")}`, a set containing two
different `mime` records.[^mime-xml-difference]
[^mime-xml-difference]: The two XML documents `<x/>` and `<x />`
differ by bytewise comparison, and thus yield different record
values, even though under the semantics of XML they denote
identical XML infoset.
**Non-examples.** `#set{1 1 1}`, because it contains multiple
equivalent `Value`s.
**Non-examples.** `{1 1}`, because it contains multiple equivalent
`Value`s; `{}`, because without the `#set` marker, it denotes the
empty dictionary.
### Dictionaries.
@ -241,27 +231,189 @@ A `Dictionary` is an unordered finite collection of pairs of `Value`s.
Each pair comprises a *key* and a *value*. Keys in a `Dictionary` must
be pairwise distinct. Instances of `Dictionary` are compared by
lexicographic comparison of the sequences resulting from ordering each
`Dictionary`'s pairs in ascending order by key. Examples are written
as a `#dict`-prefixed, curly-brace-surrounded sequence of
space-separated key-value pairs, each written with a colon between the
key and value.
`Dictionary`'s pairs in ascending order by key.
**Examples.** `#dict{}`, the empty dictionary; `#dict{a:1}`, the
dictionary mapping the `Symbol` `a` to the `SignedInteger` 1;
`#dict{[1 2 3]:a}`, mapping `[1 2 3]` to `a`; `#dict{"hi":0 hi:0
there:[]}`, having a `String` and two `Symbol` keys, and
`SignedInteger` and `Sequence` values.
**Examples.** `{}`, the empty dictionary; `{a: 1}`, the dictionary
mapping the `Symbol` `a` to the `SignedInteger` 1; `{[1 2 3]: a}`,
mapping `[1 2 3]` to `a`; `{"hi": 0, hi: 0, there: []}`, having a
`String` and two `Symbol` keys, and `SignedInteger` and `Sequence`
values.
**Non-examples.** `#dict{a:1 b:2 a:3}`, because it contains duplicate
keys; `#dict{[7 8]:[] [7 8]:99}`, for the same reason.
**Non-examples.** `{a:1 b:2 a:3}`, because it contains duplicate
keys; `{[7 8]:[] [7 8]:99}`, for the same reason.
## Syntax
## Textual Syntax
Now we have discussed `Value`s and their meanings, we may turn to
techniques for *representing* `Value`s for communication or storage.
For now, we limit our attention to an easily-parsed, easily-produced
machine-readable syntax.
In this section, we use [case-sensitive ABNF][abnf] to define a
textual syntax that is easy for people to read and
write.[^json-superset] Most of the examples in this document are
written using this syntax. In the following section, we will define an
equivalent compact machine-readable syntax.
[^json-superset]: The grammar of the textual syntax is a superset of
JSON, with the slightly unusual feature that `true`, `false`, and
`null` are all read as `Symbol`s, and that `SignedInteger`s are
never read as `Double`s.
### Character set
[ABNF][abnf] allows easy definition of US-ASCII-based languages.
However, Preserves is a Unicode-based language. Therefore, we
reinterpret ABNF as a grammar for recognising sequences of Unicode
code points.
Textual syntax for a `Value` *SHOULD* be encoded using UTF-8 where
possible.
### Whitespace
Whitespace is defined as any number of spaces, tabs, carriage returns,
line feeds, comments, or commas. A comment is a semicolon followed by
the unicode code points up to and including the next carriage return
or line feed.
ws = *(%x20 / %x09 / newline / comment / ",")
newline = CR / LF
comment = ";" *(WSP / nonnl) newline
nonnl = <any Unicode code point except CR or LF>
### Grammar
Standalone documents containing textual representations of `Value`s may have trailing whitespace.
Document = Value ws
Any `Value` may be preceded by whitespace.
Value = ws (Record / Collection / Atom / Compact)
Collection = Sequence / Dictionary / Set
Atom = Boolean / Float / Double / SignedInteger /
String / ByteString / Symbol
Each `Record` is its label-`Value` followed by a parenthesised
grouping of its field-`Value`s.
Record = Value ws "(" *Value ws ")"
`Sequence`s are enclosed in square brackets. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of values. `Set`s are
written either as a simple curly-brace-enclosed non-empty sequence of
values, or as a possibly-empty sequence of values enclosed by the
tokens `#set{` and `}`.
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
Any `Value` may be represented using the
[compact binary syntax](#compact-binary-syntax) by directly prefixing
the binary form of the `Value` with ASCII `SOH` (`%x01`), or by
enclosing a hexadecimal representation of the binary form of the
`Value` in the tokens `#hexvalue{` and `}`.
Compact = %x01 <binary data> / %s"#hexvalue{" *(ws / HEXDIG) ws "}"
`Boolean`s are the simple literal strings `#true` and `#false`.
Boolean = %s"#true" / %s"#false"
Numeric data follow the
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
the addition of a trailing "f" distinguishing `Float` from `Double`
values. `Float`s and `Double`s always have either a fractional part or
an exponent part, where `SignedInteger`s never have either.
TODO: talk about precise reading of floats, and the need for arbitrary
precision. Your language will often have a good floating-point reading
library.
Float = flt %i"f"
Double = flt
SignedInteger = int
digit1-9 = %x31-39
nat = %x30 / ( digit1-9 *DIGIT )
int = ["-"] nat
frac = "." 1*DIGIT
exp = %i"e" ["-"/"+"] 1*DIGIT
flt = int (frac exp / frac / exp)
`String`s are,
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.[^string-json-correspondence]
TODO: discuss surrogate pairs in \uXXXX form
String = %x22 *char %x22
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
escape = %x5C ; \
escaped = ( %x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 ) ; t tab U+0009
[^string-json-correspondence]: The grammar for `String` has the same
effect as the
[JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
`string`. Some auxiliary definitions (e.g. `escaped`) are lifted
largely unmodified from the text of RFC 8259.
A `ByteString` may be written in any of three different forms.
The first is similar to a `String`, but prepended with a hash sign
`#`. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
byte values must be escaped by prepending a two-digit hexadecimal
value with `\x`.
ByteString = "#" %x22 *binchar %x22
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs hexadecimal digits interleaved
with whitespace and surrounded by `#hex{` and `}`.
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#base64{` and `}`. Plain and
URL-safe Base64 characters are allowed.
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A `Symbol` may be written in a "bare" form,[^cf-sexp-token] so long as
it conforms to certain restrictions on the characters appearing in the
symbol, or in a quoted form. The quoted form is much the same as the
syntax for `String`s, including embedded escape syntax, except using a
bar or pipe character (`|`) instead of a double quote mark.
Symbol = symstart *symcont / "|" *symchar "|"
symstart = ALPHA / sympunct
symcont = ALPHA / sympunct / DIGIT / "-" / "."
sympunct = "~" / "!" / "@" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "<" / ">" / "/"
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
[^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
definition of "token representation".
TODO: More unicode in unescaped symbols?
### Printing
Recommend a JSON-compatible print mode. Recommend a submode with trailing commas.
## Compact Binary Syntax
A `Repr` is an encoding, or representation, of a specific `Value`.
Each `Repr` comprises one or more bytes describing first the kind of
@ -373,14 +525,14 @@ be a single `Repr`.
Format B (known length):
[[ (L F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
[[ L(F_1...F_m) ]] = header(2,3,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label.
Format C (streaming):
[[ (L F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
[[ L(F_1...F_m) ]] = open(2,3) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close(2,3)
Applications *SHOULD* prefer the known-length format for encoding
`Record`s.
@ -401,12 +553,12 @@ and format C becomes
**Examples.** For example, a protocol may choose to map records
labelled `void` to `n=0`, making
[[(void)]] = header(2,0,0) = [0x80]
[[void()]] = header(2,0,0) = [0x80]
or it may map records labelled `person` to short form label number 1,
making
[[(person "Dr" "Elizabeth" "Blackwell")]]
[[person("Dr", "Elizabeth", "Blackwell")]]
= header(2,1,3) ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
= [0x93] ++ [["Dr"]] ++ [["Elizabeth"]] ++ [["Blackwell"]]
@ -421,20 +573,20 @@ for format C.
Format B (known length):
[[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #dict{K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]]
[[ [X_1...X_m] ]] = header(3,0,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(3,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(3,2,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]]
Note that `m*2` is given to `header` for a `Dictionary`, since there
are two `Value`s in each key-value pair.
Format C (streaming):
[[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
[[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
[[ #dict{K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close(3,2)
[[ [X_1...X_m] ]] = open(3,0) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,0)
[[ #set{X_1...X_m} ]] = open(3,1) ++ [[X_1]] ++...++ [[X_m]] ++ close(3,1)
[[ {K_1:V_1...K_m:V_m} ]] = open(3,2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close(3,2)
Applications may use whichever format suits their needs on a
case-by-case basis.
@ -528,8 +680,8 @@ specify lengths. Applications *MUST NOT* use format C with
#### Booleans
[[ #f ]] = header(0,0,0) = [0x00]
[[ #t ]] = header(0,0,1) = [0x01]
[[ #false ]] = header(0,0,0) = [0x00]
[[ #true ]] = header(0,0,1) = [0x01]
#### Floats and Doubles
@ -550,31 +702,27 @@ short form label number 0 to label `discard`, 1 to `capture`, and 2 to
| Value | Encoded hexadecimal byte sequence |
|---------------------------------------------------|----------------------------------------------------------------------|
| `(capture (discard))` | 91 80 |
| `(observe (speak (discard) (capture (discard))))` | A1 B3 75 73 70 65 61 6B 80 91 80 |
| `capture(discard())` | 91 80 |
| `observe(speak(discard(), capture(discard())))` | A1 B3 75 73 70 65 61 6B 80 91 80 |
| `[1 2 3 4]` (format B) | C4 11 12 13 14 |
| `[1 2 3 4]` (format C) | 2C 11 12 13 14 3C |
| `[-2 -1 0 1]` | C4 1E 1F 10 11 |
| `"hello"` (format B) | 55 68 65 6C 6C 6F |
| `"hello"` (format C, 2 chunks) | 25 62 68 65 63 6C 6C 6F 35 |
| `"hello"` (format C, 5 chunks) | 25 62 68 65 62 6C 6C 60 60 61 6F 35 |
| `["hello" there #"world" [] #set{} #t #f]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
| `["hello" there #"world" [] #set{} #true #false]` | C7 55 68 65 6C 6C 6F 75 74 68 65 72 65 65 77 6F 72 6C 64 C0 D0 01 00 |
| `-257` | 42 FE FF |
| `-1` | 1F |
| `0` | 10 |
| `1` | 11 |
| `255` | 42 00 FF |
| `1f` | 02 3F 80 00 00 |
| `1d` | 03 3F F0 00 00 00 00 00 00 |
| `-1.202e300d` | 03 FE 3C B7 B7 59 BF 04 26 |
| `1.0f` | 02 3F 80 00 00 |
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
| `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 |
Finally, a larger example, using a non-`Symbol` label for a record.[^extensibility2] The `Record`
([titled person 2 thing 1]
101
"Blackwell"
(date 1821 2 3)
"Dr")
[titled person 2 thing 1](101, "Blackwell", date(1821 2 3), "Dr")
encodes to
@ -671,16 +819,16 @@ such media types following the general rules for ordering of
| Value | Encoded hexadecimal byte sequence |
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
| `(mime application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `(mime text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `(mime application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `(mime text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
| `mime(application/octet-stream #"abcde")` | B3 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `mime(text/plain #"ABC")` | B3 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
| `mime(application/xml #"<xhtml/>")` | B3 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
| `mime(text/csv #"123,234,345")` | B3 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
Applications making heavy use of `mime` records may choose to use a
short form label number for the record type. For example, if short
form label number 1 were chosen, the second example above, `(mime
text/plain "ABC")`, would be encoded with "92" in place of "B3 74 6D
69 6D 65".
form label number 1 were chosen, the second example above,
`mime(text/plain "ABC")`, would be encoded with "92" in place of "B3
74 6D 69 6D 65".
### Unicode normalization forms
@ -707,13 +855,13 @@ The definition of `SignedInteger` captures all integers. However, in
certain circumstances it can be valuable to assert that a number
inhabits a particular range, such as a fixed-width machine word.
A family of labels `i`*n* and `u`*n* for *n* ∈ {16,32,64} denote
A family of labels `i`*n* and `u`*n* for *n* ∈ {8,16,32,64} denote
*n*-bit-wide signed and unsigned range restrictions, respectively.
Records with these labels *MUST* have one field, a `SignedInteger`,
which *MUST* fall within the appropriate range. That is, to be valid,
- in `(i16 `*x*`)`, -32768 <= *x* <= 32767.
- in `(u16 `*x*`)`, 0 <= *x* <= 65535.
- in `(i32 `*x*`)`, -2147483648 <= *x* <= 2147483647.
- in `i8(`*x*`)`, -128 <= *x* <= 127.
- in `u8(`*x*`)`, 0 <= *x* <= 255.
- in `i16(`*x*`)`, -32768 <= *x* <= 32767.
- etc.
### Anonymous Tuples and Unit
@ -721,15 +869,15 @@ which *MUST* fall within the appropriate range. That is, to be valid,
A `Tuple` is a `Record` with label `tuple` and zero or more fields,
denoting an anonymous tuple of values.
The 0-ary tuple, `(tuple)`, denotes the empty tuple, sometimes called
The 0-ary tuple, `tuple()`, denotes the empty tuple, sometimes called
"unit" or "void" (but *not* e.g. JavaScript's "undefined" value).
### Null and Undefined
Tony Hoare's
"[billion-dollar mistake](https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retractions)"
can be represented with the 0-ary `Record` `(null)`. An "undefined"
value can be represented as `(undefined)`.
can be represented with the 0-ary `Record` `null()`. An "undefined"
value can be represented as `undefined()`.
### Dates and Times
@ -741,6 +889,8 @@ or `date-time` productions of
## Security Considerations
TODO: Lots of whitespace is just like lots of empty chunks
**Empty chunks.** Streamed (format C) `String`s, `ByteString`s and
`Symbol`s may include chunks of zero length. This opens up a
possibility for denial-of-service: an attacker may begin streaming a
@ -751,9 +901,9 @@ chunks that may appear in a stream, and may even supply an optional
mode that rejects empty chunks entirely.
**Canonical form for cryptographic hashing and signing.** As
specified, the encoding rules for `Value`s do not force canonical
serializations for `Set` or `Dictionary` values. Two serializations of
the same `Value` may yield different binary `Repr`s.
specified, neither the textual nor the compact binary encoding rules
for `Value`s force canonical serializations. Two serializations of the
same `Value` may yield different binary `Repr`s.
## Appendix. Table of lead byte values