MUCH simpler binary format, inspired by Syrup; alterations to text format

This commit is contained in:
Tony Garnock-Jones 2020-12-28 23:25:02 +01:00
parent ccf4f97ed8
commit 5d719c2c6f
6 changed files with 399 additions and 596 deletions

2
NOTICE
View File

@ -1,2 +1,2 @@
Preserves: an Expressive Data Language Preserves: an Expressive Data Language
Copyright 2018-2019 Tony Garnock-Jones Copyright 2018-2020 Tony Garnock-Jones

View File

@ -38,7 +38,7 @@ For that, see the [Preserves specification](preserves.html).
If you're familiar with JSON, Preserves looks fairly similar: If you're familiar with JSON, Preserves looks fairly similar:
``` javascript ```
{"name": "Missy Rose", {"name": "Missy Rose",
"species": "Felis Catus", "species": "Felis Catus",
"age": 13, "age": 13,
@ -49,35 +49,35 @@ Preserves also has something we can use for debugging/development
information called "annotations"; they aren't actually read in as data information called "annotations"; they aren't actually read in as data
but we can use them for comments. but we can use them for comments.
(They can also be used for other development tools and are not (They can also be used for other development tools and are not
restricted to strings; more on this later, but for now interpret them restricted to strings; more on this later, but for now, we will stick
as comments.) to the special comment annotation syntax.)
``` javascript ```
@"I'm an annotation... basically a comment. Ignore me!" ;I'm an annotation... basically a comment. Ignore me!
"I'm data! Don't ignore me!" "I'm data! Don't ignore me!"
``` ```
Preserves supports some data types you're probably already familiar Preserves supports some data types you're probably already familiar
with from JSON, and which look fairly similar in the textual format: with from JSON, and which look fairly similar in the textual format:
``` javascript ```
@"booleans" ;booleans
#true #t
#false #f
@"various kinds of numbers:" ;various kinds of numbers:
42 42
123556789012345678901234567890 123556789012345678901234567890
-10 -10
13.5 13.5
@"strings" ;strings
"I'm feeling stringy!" "I'm feeling stringy!"
@"sequences (lists)" ;sequences (lists)
["cat", "dog", "mouse", "goldfish"] ["cat", "dog", "mouse", "goldfish"]
@"dictionaries (hashmaps)" ;dictionaries (hashmaps)
{"cat": "meow", {"cat": "meow",
"dog": "woof", "dog": "woof",
"goldfish": "glub glub", "goldfish": "glub glub",
@ -90,16 +90,16 @@ with from JSON, and which look fairly similar in the textual format:
## Going beyond JSON ## Going beyond JSON
We can observe a few differences from JSON already; it's possible to We can observe a few differences from JSON already; it's possible to
express numbers of arbitrary length in Preserves, and booleans look a little *reliably* express integers of arbitrary length in Preserves, and booleans look a little
bit different. bit different.
A few more interesting differences: A few more interesting differences:
``` javascript ```
@"Preserves treats commas as whitespace, so these are the same" ;Preserves treats commas as whitespace, so these are the same
["cat", "dog", "mouse", "goldfish"] ["cat", "dog", "mouse", "goldfish"]
["cat" "dog" "mouse" "goldfish"] ["cat" "dog" "mouse" "goldfish"]
@"We can use anything as keys in dictionaries, not just strings" ;We can use anything as keys in dictionaries, not just strings
{1: "the loneliest number", {1: "the loneliest number",
["why", "was", 6, "afraid", "of", 7]: "because 7 8 9", ["why", "was", 6, "afraid", "of", 7]: "because 7 8 9",
{"dictionaries": "as keys???"}: "well, why not?"} {"dictionaries": "as keys???"}: "well, why not?"}
@ -107,17 +107,17 @@ A few more interesting differences:
Preserves technically provides a few types of numbers: Preserves technically provides a few types of numbers:
``` javascript ```
@"Signed Integers" ;Signed Integers
42 42
-42 -42
5907212309572059846509324862304968273468909473609826340 5907212309572059846509324862304968273468909473609826340
-5907212309572059846509324862304968273468909473609826340 -5907212309572059846509324862304968273468909473609826340
@"Floats (Single-precision IEEE floats) (notice the trailing f)" ;Floats (Single-precision IEEE floats) (notice the trailing f)
3.1415927f 3.1415927f
@"Doubles (Double-precision IEEE floats)" ;Doubles (Double-precision IEEE floats)
3.141592653589793 3.141592653589793
``` ```
@ -129,33 +129,33 @@ Often they're meant to be used for something that has symbolic importance
to the program, but not textual importance (other than to guide the to the program, but not textual importance (other than to guide the
programmer… not unlike variable names). programmer… not unlike variable names).
``` javascript ```
@"A symbol (NOT a string!)" ;A symbol (NOT a string!)
JustASymbol JustASymbol
@"You can do mixedCase or CamelCase too of course, pick your poison" ;You can do mixedCase or CamelCase too of course, pick your poison
@"(but be consistent, for the sake of your collaborators!" ;(but be consistent, for the sake of your collaborators!)
iAmASymbol iAmASymbol
i-am-a-symbol i-am-a-symbol
@"A list of symbols" ;A list of symbols
[GET, PUT, POST, DELETE] [GET, PUT, POST, DELETE]
@"A symbol with spaces in it" ;A symbol with spaces in it
|this is just one symbol believe it or not| |this is just one symbol believe it or not|
``` ```
We can also add binary data, aka ByteStrings: We can also add binary data, aka ByteStrings:
``` javascript ```
@"Some binary data, base64 encoded" ;Some binary data, base64 encoded
#base64{cGljdHVyZSBvZiBhIGNhdA==} #[cGljdHVyZSBvZiBhIGNhdA==]
@"Some other binary data, hexadecimal encoded" ;Some other binary data, hexadecimal encoded
#hex{616263} #x"616263"
@"Same binary data as above, base64 encoded" ;Same binary data as above, base64 encoded
#base64{YWJj} #[YWJj]
``` ```
What's neat about this is that we don't have to "pay the cost" of What's neat about this is that we don't have to "pay the cost" of
@ -165,48 +165,41 @@ the length of the binary data is the length of the binary data.
Conveniently, Preserves also includes Sets, which are collections of Conveniently, Preserves also includes Sets, which are collections of
unique elements where ordering of items is unimportant. unique elements where ordering of items is unimportant.
``` javascript ```
#set{flour, salt, water} #{flour, salt, water}
``` ```
<a id="orgefafe56"></a> <a id="orgefafe56"></a>
## Total ordering and canonicalization ## Canonicalization
This is a good time to mention that even though from a semantic This is a good time to mention that even though from a semantic
perspective sets and dictionaries do not carry information about the perspective sets and dictionaries do not carry information about the
ordering of their elements (and Preserves doesn't care what order we ordering of their elements (and Preserves doesn't care what order we
enter them in for our hand-written-as-text Preserves documents), enter them in for our hand-written-as-text Preserves documents),
Preserves has a well-defined "total ordering". [Preserves provides support for canonical ordering](canonical-binary.html)
when serializing.
Based on this total ordering, Preserves provides support for canonical In canonicalizing output mode, Preserves will always write out a given
ordering when serializing; in this mode, Preserves will always write value using exactly the same bytes, every time. This is important and
out the elements in the same order, every time. useful for many contexts, but especially for cryptographic signatures
When combined with binary serialization, this is Preserves' "canonical and hashing.
form".
This is important and useful for many contexts, but especially for
cryptographic signatures and hashing.
``` javascript ```
@"This hand-typed Preserves document..." ;This hand-typed Preserves document...
{monkey: {"noise": "ooh-ooh", {monkey: {"noise": "ooh-ooh",
"eats": #set{"bananas", "berries"}} "eats": #{"bananas", "berries"}}
cat: {"noise": "meow", cat: {"noise": "meow",
"eats": #set{"kibble", "cat treats", "tinned meat"}}} "eats": #{"kibble", "cat treats", "tinned meat"}}}
@"Will always, always be written out in this order when canonicalized:" ;Will always, always be written out in this order (except in
{cat: {"eats": #set{"cat treats", "kibble", "tinned meat"}, ;binary, of course) when canonicalized:
{cat: {"eats": #{"cat treats", "kibble", "tinned meat"},
"noise": "meow"} "noise": "meow"}
monkey: {"eats": #set{"bananas", "berries"}, monkey: {"eats": #{"bananas", "berries"},
"noise": "ooh-ooh"}} "noise": "ooh-ooh"}}
``` ```
Clever implementations can get canonicalized output for free by
carefully ordering set elements and dictionary entries at construction
time, but even in simple implementations, canonical serialization is
almost as cheap as normal serialization.
<a id="org0366627"></a> <a id="org0366627"></a>
## Defining our own types using Records ## Defining our own types using Records
@ -216,7 +209,7 @@ sense, it's a meta-type.
`Record` objects have a label and a series of arguments (or "fields"). `Record` objects have a label and a series of arguments (or "fields").
For example, we can make a `Date` record: For example, we can make a `Date` record:
``` javascript ```
<Date 2019 8 15> <Date 2019 8 15>
``` ```
@ -228,7 +221,7 @@ We could instead just decide to encode our date data in a string,
like "2019-08-15". like "2019-08-15".
A document using such a date structure might look like so: A document using such a date structure might look like so:
``` javascript ```
{"name": "Gregor Samsa", {"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body", "description": "humanoid trapped in an insect body",
"born": "1915-10-04"} "born": "1915-10-04"}
@ -243,13 +236,13 @@ know the date exactly.
This causes a problem. This causes a problem.
Now we might have two kinds of entries: Now we might have two kinds of entries:
``` javascript ```
@"Exact date known" ;Exact date known
{"name": "Gregor Samsa", {"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body", "description": "humanoid trapped in an insect body",
"born": "1915-10-04"} "born": "1915-10-04"}
@"Not sure about exact date..." ;Not sure about exact date...
{"name": "Gregor Samsa", {"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body", "description": "humanoid trapped in an insect body",
"born": "Sometime in October 1915? Or was that when he became an insect?"} "born": "Sometime in October 1915? Or was that when he became an insect?"}
@ -261,13 +254,13 @@ like a date", but doing this kind of thing is prone to errors and weird
edge cases. edge cases.
No, it's better to be able to have a separate type: No, it's better to be able to have a separate type:
``` javascript ```
@"Exact date known" ;Exact date known
{"name": "Gregor Samsa", {"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body", "description": "humanoid trapped in an insect body",
"born": <Date 1915 10 04>} "born": <Date 1915 10 04>}
@"Not sure about exact date..." ;Not sure about exact date...
{"name": "Gregor Samsa", {"name": "Gregor Samsa",
"description": "humanoid trapped in an insect body", "description": "humanoid trapped in an insect body",
"born": <Unknown "Sometime in October 1915? Or was that when he became an insect?">} "born": <Unknown "Sometime in October 1915? Or was that when he became an insect?">}
@ -285,7 +278,7 @@ the meaning the label signifies for it to be of use.
Still, there are plenty of interesting labels we can define. Still, there are plenty of interesting labels we can define.
Here is one for an "iri", a hyperlink: Here is one for an "iri", a hyperlink:
``` javascript ```
<iri "https://dustycloud.org/blog/"> <iri "https://dustycloud.org/blog/">
``` ```
@ -294,11 +287,11 @@ Records are usually symbols but aren't necessarily so.
They can also be strings or numbers or even dictionaries. They can also be strings or numbers or even dictionaries.
And very interestingly, they can also be other records: And very interestingly, they can also be other records:
``` javascript ```
<<iri "https://www.w3.org/ns/activitystreams#Note"> < <iri "https://www.w3.org/ns/activitystreams#Note">
{"to": [<iri "https://chatty.example/ben/">], {"to": [<iri "https://chatty.example/ben/">],
"attributedTo": <iri "https://social.example/alyssa/">, "attributedTo": <iri "https://social.example/alyssa/">,
"content": "Say, did you finish reading that book I lent you?"}> "content": "Say, did you finish reading that book I lent you?"} >
``` ```
Do you see it? This Record's label is&#x2026; an `iri` Record! Do you see it? This Record's label is&#x2026; an `iri` Record!
@ -327,16 +320,18 @@ Annotations are not strictly a necessary feature, but they are useful
in some circumstances. in some circumstances.
We have previously shown them used as comments: We have previously shown them used as comments:
``` javascript ```
@"I'm a comment!" ;I'm a comment!
"I am not a comment, I am data!" "I am not a comment, I am data!"
``` ```
Annotations annotate the values they precede. Annotations annotate the values they precede.
It is possible to have multiple annotations on a value. It is possible to have multiple annotations on a value.
The `;`-based comment syntax is syntactic sugar for the general
`@`-prefixed string annotation syntax.
``` javascript ```
@"I am annotating this number" ;I am annotating this number
@"And so am I!" @"And so am I!"
42 42
``` ```
@ -349,7 +344,7 @@ Many implementations will, in the same mode, also supply line number
and column information attached to each read value. and column information attached to each read value.
So what's the point of them then? So what's the point of them then?
If annotations were just for comments, there would be indeed hardly If annotations were just for comments, there would be indeed hardly any
point at all&#x2026; it would be simpler to just provide a comment syntax. point at all&#x2026; it would be simpler to just provide a comment syntax.
However, annotations can be used for more than just comments. However, annotations can be used for more than just comments.
@ -360,13 +355,17 @@ For instance, here's a reply from an HTTP API service running in
"debug" mode annotated with the time it took to produce the reply and "debug" mode annotated with the time it took to produce the reply and
the internal name of the server that produced the response: the internal name of the server that produced the response:
``` javascript ```
@<ResponseTime <Milliseconds 64.4>> @<ResponseTime <Milliseconds 64.4>>
@<BackendServer "humpty-dumpty.example.com"> @<BackendServer "humpty-dumpty.example.com">
<Success <Success
<Employees [ <Employees [
<Employee "Alyssa P. Hacker" #set{<Role Programmer>, <Role Manager>}, <Date 2018, 1, 24>> <Employee "Alyssa P. Hacker"
<Employee "Ben Bitdiddle" #set{<Role Programmer>}, <Date 2019, 2, 13>> ]>> #{<Role Programmer>, <Role Manager>}
<Date 2018, 1, 24>>
<Employee "Ben Bitdiddle"
#{<Role Programmer>}
<Date 2019, 2, 13>> ]>>
``` ```
The annotations aren't related to the data requested, which is all The annotations aren't related to the data requested, which is all

View File

@ -20,22 +20,17 @@ are equal.
This document specifies canonical form for the Preserves compact This document specifies canonical form for the Preserves compact
binary syntax. binary syntax.
**General rules.** **Annotations.**
Streaming formats ("format C") *MUST NOT* be used.
Annotations *MUST NOT* be present. Annotations *MUST NOT* be present.
Whenever there is a choice between fixed-length ("format A") or
variable-length ("format B") formats, the fixed-length format *MUST* be
used.
**Sets.** **Sets.**
The elements of a `Set` *MUST* be serialized sorted in ascending order The elements of a `Set` *MUST* be serialized sorted in ascending order
following the total order relation defined in the by comparing their canonical encoded binary representations.
[Preserves specification][spec].
**Dictionaries.** **Dictionaries.**
The key-value pairs in a `Dictionary` *MUST* be serialized sorted in The key-value pairs in a `Dictionary` *MUST* be serialized sorted in
ascending order by key, following the total order relation defined in ascending order by comparing the canonical encoded binary
the [Preserves specification][spec].[^no-need-for-by-value] representations of their keys.[^no-need-for-by-value]
[^no-need-for-by-value]: There is no need to order by (key, value) [^no-need-for-by-value]: There is no need to order by (key, value)
pair, since a `Dictionary` has no duplicate keys. pair, since a `Dictionary` has no duplicate keys.
@ -43,7 +38,9 @@ the [Preserves specification][spec].[^no-need-for-by-value]
**Other kinds of `Value`.** **Other kinds of `Value`.**
There are no special canonicalization restrictions on There are no special canonicalization restrictions on
`SignedInteger`s, `String`s, `ByteString`s, `Symbol`s, `Boolean`s, `SignedInteger`s, `String`s, `ByteString`s, `Symbol`s, `Boolean`s,
`Float`s, `Double`s, `Record`s, or `Sequence`s. `Float`s, `Double`s, `Record`s, or `Sequence`s. The constraints given
for these `Value`s in the [specification][spec] suffice to ensure
canonicity.
<!-- Heading to visually offset the footnotes from the main document: --> <!-- Heading to visually offset the footnotes from the main document: -->
## Notes ## Notes

View File

@ -65,28 +65,29 @@ interior portions of a tree.
## Comments. ## Comments.
`String` values used as annotations are conventionally interpreted as `String` values used as annotations are conventionally interpreted as
comments. comments. Special syntax exists for such string annotations, though
the usual `@`-prefixed annotation notation can also be used.
@"I am a comment for the Dictionary" ;I am a comment for the Dictionary
{ {
@"I am a comment for the key" ;I am a comment for the key
key: @"I am a comment for the value" key: ;I am a comment for the value
value value
} }
@"I am a comment for this entire IOList" ;I am a comment for this entire IOList
[ [
#hex{00010203} #x"00010203"
@"I am a comment for the middle half of the IOList" ;I am a comment for the middle half of the IOList
@"A second comment for the same portion of the IOList" ;A second comment for the same portion of the IOList
@ @"I am the first and only comment for the following comment" @ ;I am the first and only comment for the following comment
"A third (itself commented!) comment for the same part of the IOList" "A third (itself commented!) comment for the same part of the IOList"
[ [
@"I am a comment for the following ByteString" ;"I am a comment for the following ByteString"
#hex{04050607} #x"04050607"
#hex{08090A0B} #x"08090A0B"
] ]
#hex{0C0D0E0F} #x"0C0D0E0F"
] ]
## MIME-type tagged binary data. ## MIME-type tagged binary data.
@ -105,12 +106,17 @@ such media types following the general rules for ordering of
**Examples.** **Examples.**
| Value | Encoded hexadecimal byte sequence | «<mime application/octet-stream #"abcde">»
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------| = B4 B3 04 "mime" B3 18 "application/octet-stream" B2 05 "abcde"
| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
| `<mime text/plain #"ABC">` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 | «<mime text/plain #"ABC">»
| `<mime application/xml #"<xhtml/>">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E | = B4 B3 04 "mime" B3 0A "text/plain" B2 03 "ABC" 84
| `<mime text/csv #"123,234,345">` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
«<mime application/xml #"<xhtml/>">»
= B4 B3 04 "mime" B3 0F "application/xml" B2 08 "<xhtml/>" 84
«<mime text/csv #"123,234,345">»
= B4 B3 04 "mime" B3 08 "text/csv" B2 0B "123,234,345" 84
## Unicode normalization forms. ## Unicode normalization forms.

View File

@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language"
--- ---
Tony Garnock-Jones <tonyg@leastfixedpoint.com> Tony Garnock-Jones <tonyg@leastfixedpoint.com>
May 2020. Version 0.0.8. Jan 2021. Version 0.4.0.
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt [sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
[spki]: http://world.std.com/~cme/html/spki.html [spki]: http://world.std.com/~cme/html/spki.html
@ -12,6 +12,7 @@ May 2020. Version 0.0.8.
[LEB128]: https://en.wikipedia.org/wiki/LEB128 [LEB128]: https://en.wikipedia.org/wiki/LEB128
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map [erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
[abnf]: https://tools.ietf.org/html/rfc7405 [abnf]: https://tools.ietf.org/html/rfc7405
[canonical]: canonical-binary.html
This document proposes a data model and serialization format called This document proposes a data model and serialization format called
*Preserves*. *Preserves*.
@ -42,20 +43,20 @@ Our `Value`s fall into two broad categories: *atomic* and *compound*
data. Every `Value` is finite and non-cyclic. data. Every `Value` is finite and non-cyclic.
Value = Atom Value = Atom
| Compound | Compound
Atom = Boolean Atom = Boolean
| Float | Float
| Double | Double
| SignedInteger | SignedInteger
| String | String
| ByteString | ByteString
| Symbol | Symbol
Compound = Record Compound = Record
| Sequence | Sequence
| Set | Set
| Dictionary | Dictionary
**Total order.**<a name="total-order"></a> As we go, we will **Total order.**<a name="total-order"></a> As we go, we will
incrementally specify a total order over `Value`s. Two values of the incrementally specify a total order over `Value`s. Two values of the
@ -215,14 +216,13 @@ label-`Value` followed by its field-`Value`s.
`Sequence`s are enclosed in square brackets. `Dictionary` values are `Sequence`s are enclosed in square brackets. `Dictionary` values are
curly-brace-enclosed colon-separated pairs of values. `Set`s are curly-brace-enclosed colon-separated pairs of values. `Set`s are
written either as one or more values enclosed in curly braces, or zero written as values enclosed by the tokens `#{` and
or more values enclosed by the tokens `#set{` and
`}`.[^printing-collections] It is an error for a set to contain `}`.[^printing-collections] It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys. duplicate elements or for a dictionary to contain duplicate keys.
Sequence = "[" *Value ws "]" Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}" Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}" Set = "#{" *Value ws "}"
[^printing-collections]: **Implementation note.** When implementing [^printing-collections]: **Implementation note.** When implementing
printing of `Value`s using the textual syntax, consider supporting printing of `Value`s using the textual syntax, consider supporting
@ -232,9 +232,10 @@ duplicate elements or for a dictionary to contain duplicate keys.
commas separating, and commas terminating elements or key/value commas separating, and commas terminating elements or key/value
pairs within a collection. pairs within a collection.
`Boolean`s are the simple literal strings `#true` and `#false`. `Boolean`s are the simple literal strings `#t` and `#f` for true and
false, respectively.
Boolean = %s"#true" / %s"#false" Boolean = %s"#t" / %s"#f"
Numeric data follow the Numeric data follow the
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with [JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
@ -310,9 +311,10 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
[^escaping-surrogate-pairs]: In particular, note JSON's rules around [^escaping-surrogate-pairs]: In particular, note JSON's rules around
the use of surrogate pairs for code points not in the Basic the use of surrogate pairs for code points not in the Basic
Multilingual Plane. We encourage implementations to avoid escaping Multilingual Plane. We encourage implementations to avoid using
such characters when producing output, and instead to rely on the `\u` escapes when producing output, and instead to rely on the
UTF-8 encoding of the entire document to handle them correctly. UTF-8 encoding of the entire document to handle non-ASCII
codepoints correctly.
A `ByteString` may be written in any of three different forms. A `ByteString` may be written in any of three different forms.
@ -327,16 +329,16 @@ value with `\x`.
binunescaped = %x20-21 / %x23-5B / %x5D-7E binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs of hexadecimal digits interleaved The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by `#hex{` and `}`. with whitespace and surrounded by `#x"` and `"`.
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}" ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
The third is as a sequence of The third is as a sequence of
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved [Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
with whitespace and surrounded by `#base64{` and `}`. Plain and with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
URL-safe Base64 characters are allowed. Base64 characters are allowed.
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" / ByteString =/ "#[" *(ws / base64char) ws "]" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "=" base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
@ -365,10 +367,10 @@ double quote mark.
Finally, any `Value` may be represented by escaping from the textual Finally, any `Value` may be represented by escaping from the textual
syntax to the [compact binary syntax](#compact-binary-syntax) by syntax to the [compact binary syntax](#compact-binary-syntax) by
prefixing a `ByteString` containing the binary representation of the prefixing a `ByteString` containing the binary representation of the
`Value` with `#value`.[^rationale-switch-to-binary] `Value` with `#`.[^rationale-switch-to-binary]
[^no-literal-binary-in-text] [^compact-value-annotations] [^no-literal-binary-in-text] [^compact-value-annotations]
Compact = %s"#value" ws ByteString Compact = "#" ws ByteString
[^rationale-switch-to-binary]: **Rationale.** The textual syntax [^rationale-switch-to-binary]: **Rationale.** The textual syntax
cannot express every `Value`: specifically, it cannot express the cannot express every `Value`: specifically, it cannot express the
@ -387,8 +389,8 @@ prefixing a `ByteString` containing the binary representation of the
access the representation of the text from within the text itself. access the representation of the text from within the text itself.
[^compact-value-annotations]: Any text-syntax annotations preceding [^compact-value-annotations]: Any text-syntax annotations preceding
the `#value` are prepended to any binary-syntax annotations the `#` are prepended to any binary-syntax annotations yielded by
yielded by decoding the `ByteString`. decoding the `ByteString`.
### Annotations. ### Annotations.
@ -403,6 +405,17 @@ Each annotation is preceded by `@`; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal follows its annotations. Here we extend only the syntactic nonterminal
named “`Value`” without altering the semantic class of `Value`s. named “`Value`” without altering the semantic class of `Value`s.
**Comments.** Strings annotating a `Value` are conventionally
interpreted as comments associated with that value. Comments are
sufficiently common that special syntax exists for them.
Value =/ ws
";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline
Value
When written this way, everything between the `;` and the newline is
included in the string annotating the `Value`.
**Equivalence.** Annotations appear within syntax denoting a `Value`; **Equivalence.** Annotations appear within syntax denoting a `Value`;
however, the annotations are not part of the denoted value. They are however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in only part of the syntax. Annotations do not play a part in
@ -421,86 +434,25 @@ different.
## Compact Binary Syntax ## Compact Binary Syntax
A `Repr` is a binary-syntax encoding, or representation, of either a A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
`Value` or an annotation on a `Repr`. For a value `v`, we write `«v»` for the `Repr` of v.
Each `Repr` comprises one or more bytes describing the kind of
represented information and the length of the representation, followed
by the encoded details.
For a value `v`, we write `[[v]]` for the `Repr` of v.
### Type and Length representation. ### Type and Length representation.
Each `Repr` takes one of three possible forms: Each `Repr` starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.
- (A) type-specific form, used for simple values such as `Boolean`s tag (simple atomic data and small integers)
or `Float`s as well as for introducing annotations. tag ++ binarydata (most integers)
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
tag ++ repr ++ ... ++ endtag (compound data)
- (B) a variable-length form with length specified up-front, used for The unique end tag is byte value `0x84`.
compound and variable-length atomic data structures when their
sizes are known at the time serialization begins.
- (C) a variable-length streaming form with unknown or unpredictable If present after a tag, the length of a following piece of binary data
length, used in cases when serialization begins before the number is formatted as a [base 128 varint][varint].[^see-also-leb128] We
of elements or bytes in the corresponding `Value` is known. write `varint(m)` for the varint-encoding of `m`. Quoting the
Applications may choose between formats B and C depending on their
needs at serialization time.
#### The lead byte.
Every `Repr` starts with a *lead byte*, constructed by
`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:
leadbyte(t,n,m) = [t*64 + n*16 + m]
The arguments `t`, `n` and `m` describe the rest of the
representation.[^some-encodings-unused]
[^some-encodings-unused]: Some encodings are unused. All such
encodings are reserved for future versions of this specification.
| `t` | `n` | `m` | Meaning |
| --- | --- | --- | ------- |
| 0 | 0 | 03 | (format A) An `Atom` with fixed-length binary representation |
| 0 | 0 | 4 | (format C) Stream end |
| 0 | 0 | 5 | (format A) Annotation |
| 0 | 2 | | (format C) Stream start |
| 0 | 3 | | (format A) Certain small `SignedInteger`s |
| 1 | | | (format B) An `Atom` with variable-length binary representation |
| 2 | | | (format B) A `Compound` with variable-length representation |
| 3 | 3 | 15 | (format A) 0xFF byte; no-op |
#### Encoding data of type-specific length (format A).
Each type of data defines its own rules for this format.
Of particular note is lead byte `0xFF`, which is a no-op byte acting
as a kind of pseudo-whitespace in a binary-syntax encoding.
#### Encoding data of known length (format B).
Format B is used where the length `l` of the `Value` to be encoded is
known when serialization begins. Format B `Repr`s use `m` in
`leadbyte` to encode `l`. The length counts *bytes* for atomic
`Value`s, but counts *contained values* for compound `Value`s.
- A length `l` between 0 and 14 is represented using `leadbyte` with
`m=l`.
- A length of 15 or greater is represented by `m=15` and additional
bytes describing the length following the lead byte.
The function `header(t,n,m)` yields an appropriate sequence of bytes
describing a `Repr`'s type and length when `t`, `n` and `m` are
appropriate non-negative integers:
header(t,n,m) = leadbyte(t,n,m) when m < 15
or leadbyte(t,n,15) ++ varint(m) otherwise
The additional length bytes are formatted as
[base 128 varints][varint].[^see-also-leb128] We write `varint(m)` for
the varint-encoding of `m`. Quoting the
[Google Protocol Buffers][varint] definition, [Google Protocol Buffers][varint] definition,
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned [^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
@ -515,174 +467,114 @@ the varint-encoding of `m`. Quoting the
The following table illustrates varint-encoding. The following table illustrates varint-encoding.
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes | | Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
| ------ | ------------------- | ------------ | | ------ | ------------------- | ------------ |
| 15 | `0001111` | 15 | | 15 | `0001111` | 15 |
| 300 | `0000010 0101100` | 172 2 | | 300 | `0000010 0101100` | 172 2 |
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 | | 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
It is an error for a varint-encoded `m` in a `Repr` to be anything It is an error for a varint-encoded `m` in a `Repr` to be anything
other than the unique shortest encoding for that `m`. That is, a other than the unique shortest encoding for that `m`. That is, a
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. However, varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
the `varint(m)` encoding of a length *MUST NOT* be used when `m`<15,
meaning that a `Repr` *MUST NOT* contain any varint-encoding with
final byte `0`.
#### Streaming data of unknown length (format C). ### Records, Sequences, Sets and Dictionaries.
A `Repr` where the length of the `Value` to be encoded is variable and «<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
not known at the time serialization of the `Value` starts is encoded «[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
by a single Stream Start (“open”) byte, followed by zero or more «#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
*chunks*, followed by a matching Stream End (“close”) byte: «{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
close() = leadbyte(0,0, 4) = [0x04]
For a format C `Repr` of an atomic `Value`, each chunk is to be a
format B `Repr` of a `ByteString`, no matter the type of the overall
`Value`. Annotations are not allowed on these individual chunks.
For a format C `Repr` of a compound `Value`, each chunk is to be a
single `Repr`, which may itself be annotated.
Each chunk within a format C `Repr` *MUST* have non-zero length.
Software that decodes `Repr`s *MUST* reject `Repr`s that include
zero-length chunks.
### Records.
Format B (known length):
[[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
For `m` fields, `m+1` is supplied to `header`, to account for the
encoding of the record label.
Format C (streaming):
[[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
Applications *SHOULD* prefer the known-length format for encoding
`Record`s.
### Sequences, Sets and Dictionaries.
Format B (known length):
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]]
Note that `m*2` is given to `header` for a `Dictionary`, since there
are two `Value`s in each key-value pair.
Format C (streaming):
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
[[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
++ [[K_m]] ++ [[V_m]] ++ close()
Applications may use whichever format suits their needs on a
case-by-case basis.
There is *no* ordering requirement on the `E_i` elements or There is *no* ordering requirement on the `E_i` elements or
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any `K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
addition, implementations *SHOULD* default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
serializing in some other implementation-defined order.
[^no-sorting-rationale]: In the BitTorrent encoding format, [^no-sorting-rationale]: In the BitTorrent encoding format,
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding), [bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
dictionary key/value pairs must be sorted by key. This is a dictionary key/value pairs must be sorted by key. This is a
necessary step for ensuring serialization of `Value`s is necessary step for ensuring serialization of `Value`s is
canonical. We do not require that key/value pairs (or set canonical. We do not require that key/value pairs (or set
elements) be in sorted order for serialized `Value`s, because (a) elements) be in sorted order for serialized `Value`s; however, a
where canonicalization is used for cryptographic signatures, it is [canonical form][canonical] for `Repr`s does exist where a sorted
more reliable to simply retain the exact binary form of the signed ordering is required.
document than to depend on canonical de- and re-serialization, and
(b) sorting keys or elements makes no sense in streaming
serialization formats.
However, a quality implementation may wish to offer the programmer [^not-sorted-semantically]: It's important to note that the sort
the option of serializing with set elements and dictionary keys in ordering for writing out set elements and dictionary key/value
sorted order. pairs is *not* the same as the sort ordering implied by the
semantic ordering of those elements or keys. For example, the
`Repr` of a negative number very far from zero will start with
byte that is *greater* than the byte which starts the `Repr` of
zero, making it sort lexicographically later by `Repr`, despite
being semantically *less than* zero.
**Rationale**. This is for ease-of-implementation reasons: not all
languages can easily represent sorted sets or sorted dictionaries,
but encoding and then sorting byte strings is much more likely to
be within easy reach.
### SignedIntegers. ### SignedIntegers.
Format B/A (known length/fixed-size): «x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
([0xA0] + x) if (-3≤x≤-1)
([0x90] + x) if ( 0≤x≤12)
where m = |intbytes(x)|
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 13x Integers in the range [-3,12] are compactly represented with tags
header(0,3,x+16) if -3≤x<0 between `0x90` and `0x9F` because they are so frequently used.
header(0,3,x) if 0≤x<13 Integers up to 16 bytes long are represented with a single-byte tag
encoding the length of the integer. Larger integers are represented
Integers in the range [-3,12] are compactly represented using format A with an explicit varint length. Every `SignedInteger` *MUST* be
because they are so frequently used. Other integers are represented represented with its shortest possible encoding.
using format B.
Format C *MUST NOT* be used for `SignedInteger`s. Format A *MUST* be
used for integers in the range -3 to 12, inclusive.
The function `intbytes(x)` gives the big-endian two's-complement The function `intbytes(x)` gives the big-endian two's-complement
binary representation of `x`, taking exactly as many whole bytes as binary representation of `x`, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and `m = needed to unambiguously identify the value and its sign, and `m =
|intbytes(x)|`. The most-significant bit in the first byte in |intbytes(x)|`. The most-significant bit in the first byte in
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] `intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
example,
«87112285931760246646623899502532662132736»
= B0 12 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
[^zero-intbytes]: The value 0 needs zero bytes to identify the [^zero-intbytes]: The value 0 needs zero bytes to identify the
value, so `intbytes(0)` is the empty byte string. Non-zero values value, so `intbytes(0)` is the empty byte string. Non-zero values
need at least one byte. need at least one byte.
For example,
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
### Strings, ByteStrings and Symbols. ### Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the value of `n` supplied Syntax for these three types varies only in the tag used. For `String`
to `header` and `open`. In each case, the payload following the header and `Symbol`, the data following the tag is a UTF-8 encoding of the
is a binary sequence; for `String` and `Symbol`, it is a UTF-8 `Value`'s code points, while for `ByteString` it is the raw data
encoding of the `Value`'s code points, while for `ByteString` it is contained within the `Value` unmodified.
the raw data contained within the `Value` unmodified.
Format B (known length): «S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
[[ S ]] = header(1,n,m) ++ encode(S) ### Booleans.
where m = |encode(S)|
and (n,encode(S)) = (1,utf8(S)) if S ∈ String
(2,S) if S ∈ ByteString
(3,utf8(S)) if S ∈ Symbol
To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and «#f» = [0x80]
then a sequence of zero or more format B chunks, followed by «#t» = [0x81]
`close()`. Every chunk must be a `ByteString`, and no chunk may be
annotated.
While the overall content of a streamed `String` or `Symbol` must be ### Floats and Doubles.
valid UTF-8, individual chunks do not have to conform to UTF-8.
### Fixed-length Atoms. «F» when F ∈ Float = [0x82] ++ binary32(F)
«D» when D ∈ Double = [0x83] ++ binary64(D)
Fixed-length atoms all use format A, and do not have a length
representation. They repurpose the bits that format B `Repr`s use to
specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
for any `n`.
#### Booleans.
[[ #false ]] = header(0,0,0) = [0x00]
[[ #true ]] = header(0,0,1) = [0x01]
#### Floats and Doubles.
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
8-byte IEEE 754 binary representations of `F` and `D`, respectively. 8-byte IEEE 754 binary representations of `F` and `D`, respectively.
@ -690,40 +582,43 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
### Annotations. ### Annotations.
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
`[0x05] ++ [[v]]`. `[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
`a` and `b`, is
For example, the `Repr` corresponding to textual syntax `@a@b[]`, «@a @b []»
i.e. an empty sequence annotated with two symbols, `a` and `b`, is = [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
[[ @a @b [] ]]
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
## Examples ## Examples
### Ordering.
The total ordering specified [above](#total-order) means that the following statements are true:
"bzz" < "c" < "caa"
#t < 3.0f < 3.0 < 3 < "3" < |3| < []
### Simple examples. ### Simple examples.
<!-- TODO: Give some examples of large and small Preserves, perhaps --> <!-- TODO: Give some examples of large and small Preserves, perhaps -->
<!-- translated from various JSON blobs floating around the internet. --> <!-- translated from various JSON blobs floating around the internet. -->
| Value | Encoded byte sequence | | Value | Encoded byte sequence |
|---------------------------------------------------|-------------------------------------------------------------------------------------| |-----------------------------|---------------------------------------------------------------------------------|
| `<capture <discard>>` | 82 77 'c' 'a' 'p' 't' 'u' 'r' 'e' 81 77 'd' 'i' 's' 'c' 'a' 'r' 'd' | | `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
| `[1 2 3 4]` (format B) | 94 31 32 33 34 | | `[1 2 3 4]` | B5 91 92 93 94 84 |
| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 | | `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
| `[-2 -1 0 1]` | 94 3E 3F 30 31 | | `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
| `"hello"` (format B) | 55 'h' 'e' 'l' 'l' 'o' | | `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
| `"hello"` (format C, 2 chunks) | 25 62 'h' 'e' 63 'l' 'l' 'o' 35 | | `-257` | A1 FE FF |
| `"hello"` (format C, 5 chunks) | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 | | `-1` | 9F |
| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 | | `0` | 90 |
| `-257` | 42 FE FF | | `1` | 91 |
| `-1` | 3F | | `255` | A1 00 FF |
| `0` | 30 | | `1.0f` | 82 3F 80 00 00 |
| `1` | 31 | | `1.0` | 83 3F F0 00 00 00 00 00 00 |
| `255` | 42 00 FF | | `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 |
| `1.0f` | 02 3F 80 00 00 |
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
| `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 |
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record` The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
@ -731,21 +626,24 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R
encodes to encodes to
85 ;; Record, generic, 4+1 B4 ;; Record
95 ;; Sequence, 5 B5 ;; Sequence
76 74 69 74 6C 65 64 ;; Symbol, "titled" B3 06 74 69 74 6C 65 64 ;; Symbol, "titled"
76 70 65 72 73 6F 6E ;; Symbol, "person" B3 06 70 65 72 73 6F 6E ;; Symbol, "person"
32 ;; SignedInteger, "2" 92 ;; SignedInteger, "2"
75 74 68 69 6E 67 ;; Symbol, "thing" B3 05 74 68 69 6E 67 ;; Symbol, "thing"
31 ;; SignedInteger, "1" 91 ;; SignedInteger, "1"
41 65 ;; SignedInteger, "101" 84 ;; End (sequence)
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell" A0 65 ;; SignedInteger, "101"
84 ;; Record, generic, 3+1 B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
74 64 61 74 65 ;; Symbol, "date" B4 ;; Record
42 07 1D ;; SignedInteger, "1821" B3 04 64 61 74 65 ;; Symbol, "date"
32 ;; SignedInteger, "2" A1 07 1D ;; SignedInteger, "1821"
33 ;; SignedInteger, "3" 92 ;; SignedInteger, "2"
52 44 72 ;; String, "Dr" 93 ;; SignedInteger, "3"
84 ;; End (record)
B1 02 44 72 ;; String, "Dr"
84 ;; End (record)
[^extensibility2]: It happens to line up with Racket's [^extensibility2]: It happens to line up with Racket's
representation of a record label for an inheritance hierarchy representation of a record label for an inheritance hierarchy
@ -785,23 +683,27 @@ read as `Symbol`s. The first example:
encodes to binary as follows: encodes to binary as follows:
B2 B7
55 "Image" B1 05 "Image"
BC B7
55 "Width" 42 03 20 B1 05 "Title" B1 14 "View from 15th Floor"
55 "Title" 5F 14 "View from 15th Floor" B1 05 "Width" A1 03 20
58 "Animated" 75 "false" B1 06 "Height" A1 02 58
56 "Height" 42 02 58 B1 08 "Animated" B3 05 "false"
59 "Thumbnail" B1 09 "Thumbnail"
B6 B7
55 "Width" 41 64 B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
53 "Url" 5F 26 "http://www.example.com/image/481989943" B1 03 "IDs" B5
56 "Height" 41 7D A0 74
53 "IDs" 94 A1 03 AF
41 74 A1 00 EA
42 03 AF A2 00 97 89
42 00 EA 84
43 00 97 89 B1 05 "Width" A0 64
B1 06 "Height" A0 7D
84
84
84
and the second example: and the second example:
@ -830,55 +732,51 @@ and the second example:
encodes to binary as follows: encodes to binary as follows:
92 B5
BF 10 B7
59 "precision" 53 "zip" B1 03 "Zip" B1 05 "94107"
58 "Latitude" 03 40 42 E2 26 80 9D 49 52 B1 04 "City" B1 0D "SAN FRANCISCO"
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21 B1 05 "State" B1 02 "CA"
57 "Address" 50 B1 07 "Address" B1 00
54 "City" 5D "SAN FRANCISCO" B1 07 "Country" B1 02 "US"
55 "State" 52 "CA" B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52
53 "Zip" 55 "94107" B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21
57 "Country" 52 "US" B1 09 "precision" B1 03 "zip"
BF 10 84
59 "precision" 53 "zip" B7
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03 B1 03 "Zip" B1 05 "94085"
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF B1 04 "City" B1 09 "SUNNYVALE"
57 "Address" 50 B1 05 "State" B1 02 "CA"
54 "City" 59 "SUNNYVALE" B1 07 "Address" B1 00
55 "State" 52 "CA" B1 07 "Country" B1 02 "US"
53 "Zip" 55 "94085" B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03
57 "Country" 52 "US" B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF
B1 09 "precision" B1 03 "zip"
84
84
## Security Considerations ## Security Considerations
**Empty chunks.** Chunks of zero length are prohibited in streamed **Whitespace.** The textual format allows arbitrary whitespace in many
(format C) `Repr`s. However, a malicious or broken encoder may include positions. Consider optional restrictions on the amount of consecutive
them nonetheless. This opens up a possibility for denial-of-service: whitespace that may appear.
an attacker may begin streaming a `String`, for example, sending an
endless sequence of zero length chunks, appearing to make progress but
not actually doing so. Implementations *MUST* reject zero length
chunks when decoding, and *MUST NOT* produce them when encoding.
**Whitespace and no-ops.** Similarly, the binary format allows `0xFF` **Annotations.** Similarly, in modes where a `Value` is being read
no-ops and the textual format allows arbitrary whitespace in many while annotations are skipped, an endless sequence of annotations may
positions. In streaming transfer situations, consider optional give an illusion of progress.
restrictions on the amount of consecutive whitespace or the number of
consecutive no-ops that may appear.
**Annotations.** Also similarly, in modes where a `Value` is being **Canonical form for cryptographic hashing and signing.** No canonical
read while annotations are skipped, an endless sequence of annotations textual encoding of a `Value` is specified. A
may give an illusion of progress. [canonical form][canonical] exists for binary encoded `Value`s, and
implementations *SHOULD* produce canonical binary encodings by
**Canonical form for cryptographic hashing and signing.** As default; however, an implementation *MAY* permit two serializations of
specified, neither the textual nor the compact binary encoding rules the same `Value` to yield different binary `Repr`s.
for `Value`s force canonical serializations. Two serializations of the
same `Value` may yield different binary `Repr`s.
## Acknowledgements ## Acknowledgements
The use of low-order bits of each lead byte for the length of short The use of the low-order bits in certain SignedInteger tags for the
values is inspired by a similar feature of [CBOR](http://cbor.io/). length of the following data is inspired by a similar feature of
[CBOR](http://cbor.io/).
The treatment of commas as whitespace in the text syntax is inspired The treatment of commas as whitespace in the text syntax is inspired
by the same feature of [EDN](https://github.com/edn-format/edn). by the same feature of [EDN](https://github.com/edn-format/edn).
@ -889,126 +787,42 @@ syntax.
## Appendix. Autodetection of textual or binary syntax ## Appendix. Autodetection of textual or binary syntax
Whitespace characters `0x09` (ASCII HT (tab)), `0x0A` (LF), `0x0D` Every tag byte in a binary Preserves `Document` falls within the range
(CR), `0x20` (space) and `0x2C` (comma) are ignored at the start of a [`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
textual-syntax Preserves `Document`, and their UTF-8 encodings are bytes*, and will never occur as the first byte of a UTF-8 encoded code
reserved lead byte values in binary-syntax Preserves. point. This means no binary-encoded document can be misinterpreted as
valid UTF-8.
The byte `0xFF`, signifying a no-op in binary-syntax Preserves, has no Conversely, a UTF-8 document must start with a valid codepoint,
meaning in either 7-bit ASCII or UTF-8, and therefore cannot appear in meaning in particular that it must not start with a byte in the range
a valid textual-syntax Preserves `Document`. [`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.
If applications prefix their textual-syntax documents with e.g. a Examination of the top two bits of the first byte of a document gives
space or newline character, and their binary-syntax documents with a its syntax: if the top two bits are `10`, it should be interpreted as
`0xFF` byte, consumers of these documents may reliably autodetect the a binary-syntax document; otherwise, it should be interpreted as text.
syntax being used. In a network protocol supporting this kind of
autodetection, clients may transmit LF or `0xFF` to select text or
binary syntax, respectively.
Furthermore, if an application consistently uses `Record`s for its ## Appendix. Table of tag values
top-level messages,[^records-and-nonatoms] eschewing `Atom`s in
particular, then autodetection of the encoding used for a given input
can be done as follows:
| First byte of encoded input | Encoding | Other conclusions | 80 - False
| --- | --- | --- | 81 - True
| `0x80`--`0x8F` | binary | `Record` (format B) | 82 - Float
| `0x28` | binary | `Record` (format C) | 83 - Double
| `0x05` | binary | annotated value (presumably a `Record`) | 84 - End marker
| `0xFF` | binary | no-op; value will follow | 85 - Annotation
| --- | --- | --- | (8x) RESERVED 86-8F
| `0x7B` ("<") | text | `Record` |
| `0x40` ("@") | text | annotated value (presumably a `Record`) |
| `0x09`, `0x0A`, `0x0D`, `0x20` or `0x2C` | text | whitespace; value will follow |
[^records-and-nonatoms]: Similar reasoning can be used to permit 9x - Small integers 0..12,-3..-1
unambiguous detection of encoding when `Collection`s are allowed An - Small integers, (n+1) bytes long
as top-level messages as well as `Record`s. B0 - Small integers, variable length
B1 - String
B2 - ByteString
B3 - Symbol
## Appendix. Table of lead byte values B4 - Record
B5 - Sequence
00 - False B6 - Set
01 - True B7 - Dictionary
02 - Float
03 - Double
04 - End stream
05 - Annotation
(0x) RESERVED 06-0F (NB. 09, 0A, 0D specially reserved)
(1x) RESERVED
2x - Start Stream (NB. 20, 2C specially reserved)
3x - Small integers 0..12,-3..-1
4x - SignedInteger
5x - String
6x - ByteString
7x - Symbol
8x - Record
9x - Sequence
Ax - Set
Bx - Dictionary
(Cx) RESERVED C0-CF
(Dx) RESERVED D0-DF
(Ex) RESERVED E0-EF
(Fx) RESERVED F0-FE
FF No-op
## Appendix. Bit fields within lead byte values
tt nn mmmm contents
---------- ---------
00 00 0000 False
00 00 0001 True
00 00 0010 Float, 32 bits big-endian binary
00 00 0011 Double, 64 bits big-endian binary
00 00 0100 End Stream (to match a previous Start Stream)
00 00 0101 Annotation; two more Reprs follow
00 00 1001 (ASCII HT (tab)) \
00 00 1010 (ASCII LF) |- Reserved: may be used to indicate
00 00 1101 (ASCII CR) / use of text encoding
00 01 xxxx error, RESERVED
00 10 ttnn Start Stream <tt,nn>
When tt = 00 --> error
When nn = 00 --> (ASCII space)
Reserved: may be used to indicate
use of text encoding
otherwise --> error
01 --> each chunk is a ByteString
10 --> each chunk is a single encoded Value
11 --> error (RESERVED)
When nn = 00 --> (ASCII comma)
Reserved: may be used to indicate
use of text encoding
otherwise --> error
00 11 xxxx Small integers 0..12,-3..-1
01 00 mmmm SignedInteger, big-endian binary
01 01 mmmm String, UTF-8 binary
01 10 mmmm ByteString
01 11 mmmm Symbol, UTF-8 binary
10 00 mmmm Record
10 01 mmmm Sequence
10 10 mmmm Set
10 11 mmmm Dictionary
11 00 xxxx error, RESERVED
11 01 xxxx error, RESERVED
11 10 xxxx error, RESERVED
11 11 1111 no-op; unambiguous indication of binary Preserves format
Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
decoding the varint that follows.
Then, `l` is the length of the body that follows, counted in bytes for
`tt`=`01` and in `Repr`s for `tt`=`10`.
## Appendix. Binary SignedInteger representation ## Appendix. Binary SignedInteger representation
@ -1016,17 +830,17 @@ Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary `SignedInteger` following table useful in encoding and decoding binary `SignedInteger`
values. values.
| Integer range | Bytes required | Encoding (hex) | | Integer range | Bytes required | Encoding (hex) |
| --- | --- | --- | | --- | --- | --- |
| -3 ≤ n < 13 (numbers -3..12 encoded specially) | 1 | `3X` | | -3 ≤ n ≤ 12 | 1 | `3X` |
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `41` `XX` | | -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `42` `XX` `XX` | | -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `43` `XX` `XX` `XX` | | -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `44` `XX` `XX` `XX` `XX` | | -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `45` `XX` `XX` `XX` `XX` `XX` | | -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `46` `XX` `XX` `XX` `XX` `XX` `XX` | | -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `47` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | | -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `48` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` | | -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
<!-- Heading to visually offset the footnotes from the main document: --> <!-- Heading to visually offset the footnotes from the main document: -->
## Notes ## Notes

View File

@ -29,16 +29,3 @@ not. There's only one (?) at the moment, the `%i"f"` in `Float`;
should it be changed to case-sensitive? should it be changed to case-sensitive?
Q. Should `IOList`s be wrapped in an identifying unary record constructor? Q. Should `IOList`s be wrapped in an identifying unary record constructor?
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
TODO: Probably should add a canonicalized subset. Consider adding
explicit "I promise this is canonical" marker, like a BOM, which
identifies a binary value as (first) binary and (second, optionally)
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
text; this might be a good candidate for a marker sequence.
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
link escape"; it is not a printable ASCII character, and is disallowed
in the textual Preserves grammar; and it is also mnemonic for "version
0", since it is the Preserves binary encoding of the small integer
zero.))