MUCH simpler binary format, inspired by Syrup; alterations to text format
This commit is contained in:
parent
ccf4f97ed8
commit
5d719c2c6f
2
NOTICE
2
NOTICE
|
@ -1,2 +1,2 @@
|
||||||
Preserves: an Expressive Data Language
|
Preserves: an Expressive Data Language
|
||||||
Copyright 2018-2019 Tony Garnock-Jones
|
Copyright 2018-2020 Tony Garnock-Jones
|
||||||
|
|
191
TUTORIAL.md
191
TUTORIAL.md
|
@ -38,7 +38,7 @@ For that, see the [Preserves specification](preserves.html).
|
||||||
|
|
||||||
If you're familiar with JSON, Preserves looks fairly similar:
|
If you're familiar with JSON, Preserves looks fairly similar:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
{"name": "Missy Rose",
|
{"name": "Missy Rose",
|
||||||
"species": "Felis Catus",
|
"species": "Felis Catus",
|
||||||
"age": 13,
|
"age": 13,
|
||||||
|
@ -49,35 +49,35 @@ Preserves also has something we can use for debugging/development
|
||||||
information called "annotations"; they aren't actually read in as data
|
information called "annotations"; they aren't actually read in as data
|
||||||
but we can use them for comments.
|
but we can use them for comments.
|
||||||
(They can also be used for other development tools and are not
|
(They can also be used for other development tools and are not
|
||||||
restricted to strings; more on this later, but for now interpret them
|
restricted to strings; more on this later, but for now, we will stick
|
||||||
as comments.)
|
to the special comment annotation syntax.)
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"I'm an annotation... basically a comment. Ignore me!"
|
;I'm an annotation... basically a comment. Ignore me!
|
||||||
"I'm data! Don't ignore me!"
|
"I'm data! Don't ignore me!"
|
||||||
```
|
```
|
||||||
|
|
||||||
Preserves supports some data types you're probably already familiar
|
Preserves supports some data types you're probably already familiar
|
||||||
with from JSON, and which look fairly similar in the textual format:
|
with from JSON, and which look fairly similar in the textual format:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"booleans"
|
;booleans
|
||||||
#true
|
#t
|
||||||
#false
|
#f
|
||||||
|
|
||||||
@"various kinds of numbers:"
|
;various kinds of numbers:
|
||||||
42
|
42
|
||||||
123556789012345678901234567890
|
123556789012345678901234567890
|
||||||
-10
|
-10
|
||||||
13.5
|
13.5
|
||||||
|
|
||||||
@"strings"
|
;strings
|
||||||
"I'm feeling stringy!"
|
"I'm feeling stringy!"
|
||||||
|
|
||||||
@"sequences (lists)"
|
;sequences (lists)
|
||||||
["cat", "dog", "mouse", "goldfish"]
|
["cat", "dog", "mouse", "goldfish"]
|
||||||
|
|
||||||
@"dictionaries (hashmaps)"
|
;dictionaries (hashmaps)
|
||||||
{"cat": "meow",
|
{"cat": "meow",
|
||||||
"dog": "woof",
|
"dog": "woof",
|
||||||
"goldfish": "glub glub",
|
"goldfish": "glub glub",
|
||||||
|
@ -90,16 +90,16 @@ with from JSON, and which look fairly similar in the textual format:
|
||||||
## Going beyond JSON
|
## Going beyond JSON
|
||||||
|
|
||||||
We can observe a few differences from JSON already; it's possible to
|
We can observe a few differences from JSON already; it's possible to
|
||||||
express numbers of arbitrary length in Preserves, and booleans look a little
|
*reliably* express integers of arbitrary length in Preserves, and booleans look a little
|
||||||
bit different.
|
bit different.
|
||||||
A few more interesting differences:
|
A few more interesting differences:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"Preserves treats commas as whitespace, so these are the same"
|
;Preserves treats commas as whitespace, so these are the same
|
||||||
["cat", "dog", "mouse", "goldfish"]
|
["cat", "dog", "mouse", "goldfish"]
|
||||||
["cat" "dog" "mouse" "goldfish"]
|
["cat" "dog" "mouse" "goldfish"]
|
||||||
|
|
||||||
@"We can use anything as keys in dictionaries, not just strings"
|
;We can use anything as keys in dictionaries, not just strings
|
||||||
{1: "the loneliest number",
|
{1: "the loneliest number",
|
||||||
["why", "was", 6, "afraid", "of", 7]: "because 7 8 9",
|
["why", "was", 6, "afraid", "of", 7]: "because 7 8 9",
|
||||||
{"dictionaries": "as keys???"}: "well, why not?"}
|
{"dictionaries": "as keys???"}: "well, why not?"}
|
||||||
|
@ -107,17 +107,17 @@ A few more interesting differences:
|
||||||
|
|
||||||
Preserves technically provides a few types of numbers:
|
Preserves technically provides a few types of numbers:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"Signed Integers"
|
;Signed Integers
|
||||||
42
|
42
|
||||||
-42
|
-42
|
||||||
5907212309572059846509324862304968273468909473609826340
|
5907212309572059846509324862304968273468909473609826340
|
||||||
-5907212309572059846509324862304968273468909473609826340
|
-5907212309572059846509324862304968273468909473609826340
|
||||||
|
|
||||||
@"Floats (Single-precision IEEE floats) (notice the trailing f)"
|
;Floats (Single-precision IEEE floats) (notice the trailing f)
|
||||||
3.1415927f
|
3.1415927f
|
||||||
|
|
||||||
@"Doubles (Double-precision IEEE floats)"
|
;Doubles (Double-precision IEEE floats)
|
||||||
3.141592653589793
|
3.141592653589793
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -129,33 +129,33 @@ Often they're meant to be used for something that has symbolic importance
|
||||||
to the program, but not textual importance (other than to guide the
|
to the program, but not textual importance (other than to guide the
|
||||||
programmer… not unlike variable names).
|
programmer… not unlike variable names).
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"A symbol (NOT a string!)"
|
;A symbol (NOT a string!)
|
||||||
JustASymbol
|
JustASymbol
|
||||||
|
|
||||||
@"You can do mixedCase or CamelCase too of course, pick your poison"
|
;You can do mixedCase or CamelCase too of course, pick your poison
|
||||||
@"(but be consistent, for the sake of your collaborators!"
|
;(but be consistent, for the sake of your collaborators!)
|
||||||
iAmASymbol
|
iAmASymbol
|
||||||
i-am-a-symbol
|
i-am-a-symbol
|
||||||
|
|
||||||
@"A list of symbols"
|
;A list of symbols
|
||||||
[GET, PUT, POST, DELETE]
|
[GET, PUT, POST, DELETE]
|
||||||
|
|
||||||
@"A symbol with spaces in it"
|
;A symbol with spaces in it
|
||||||
|this is just one symbol believe it or not|
|
|this is just one symbol believe it or not|
|
||||||
```
|
```
|
||||||
|
|
||||||
We can also add binary data, aka ByteStrings:
|
We can also add binary data, aka ByteStrings:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"Some binary data, base64 encoded"
|
;Some binary data, base64 encoded
|
||||||
#base64{cGljdHVyZSBvZiBhIGNhdA==}
|
#[cGljdHVyZSBvZiBhIGNhdA==]
|
||||||
|
|
||||||
@"Some other binary data, hexadecimal encoded"
|
;Some other binary data, hexadecimal encoded
|
||||||
#hex{616263}
|
#x"616263"
|
||||||
|
|
||||||
@"Same binary data as above, base64 encoded"
|
;Same binary data as above, base64 encoded
|
||||||
#base64{YWJj}
|
#[YWJj]
|
||||||
```
|
```
|
||||||
|
|
||||||
What's neat about this is that we don't have to "pay the cost" of
|
What's neat about this is that we don't have to "pay the cost" of
|
||||||
|
@ -165,48 +165,41 @@ the length of the binary data is the length of the binary data.
|
||||||
Conveniently, Preserves also includes Sets, which are collections of
|
Conveniently, Preserves also includes Sets, which are collections of
|
||||||
unique elements where ordering of items is unimportant.
|
unique elements where ordering of items is unimportant.
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
#set{flour, salt, water}
|
#{flour, salt, water}
|
||||||
```
|
```
|
||||||
|
|
||||||
<a id="orgefafe56"></a>
|
<a id="orgefafe56"></a>
|
||||||
|
|
||||||
## Total ordering and canonicalization
|
## Canonicalization
|
||||||
|
|
||||||
This is a good time to mention that even though from a semantic
|
This is a good time to mention that even though from a semantic
|
||||||
perspective sets and dictionaries do not carry information about the
|
perspective sets and dictionaries do not carry information about the
|
||||||
ordering of their elements (and Preserves doesn't care what order we
|
ordering of their elements (and Preserves doesn't care what order we
|
||||||
enter them in for our hand-written-as-text Preserves documents),
|
enter them in for our hand-written-as-text Preserves documents),
|
||||||
Preserves has a well-defined "total ordering".
|
[Preserves provides support for canonical ordering](canonical-binary.html)
|
||||||
|
when serializing.
|
||||||
|
|
||||||
Based on this total ordering, Preserves provides support for canonical
|
In canonicalizing output mode, Preserves will always write out a given
|
||||||
ordering when serializing; in this mode, Preserves will always write
|
value using exactly the same bytes, every time. This is important and
|
||||||
out the elements in the same order, every time.
|
useful for many contexts, but especially for cryptographic signatures
|
||||||
When combined with binary serialization, this is Preserves' "canonical
|
and hashing.
|
||||||
form".
|
|
||||||
This is important and useful for many contexts, but especially for
|
|
||||||
cryptographic signatures and hashing.
|
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"This hand-typed Preserves document..."
|
;This hand-typed Preserves document...
|
||||||
{monkey: {"noise": "ooh-ooh",
|
{monkey: {"noise": "ooh-ooh",
|
||||||
"eats": #set{"bananas", "berries"}}
|
"eats": #{"bananas", "berries"}}
|
||||||
cat: {"noise": "meow",
|
cat: {"noise": "meow",
|
||||||
"eats": #set{"kibble", "cat treats", "tinned meat"}}}
|
"eats": #{"kibble", "cat treats", "tinned meat"}}}
|
||||||
|
|
||||||
@"Will always, always be written out in this order when canonicalized:"
|
;Will always, always be written out in this order (except in
|
||||||
{cat: {"eats": #set{"cat treats", "kibble", "tinned meat"},
|
;binary, of course) when canonicalized:
|
||||||
|
{cat: {"eats": #{"cat treats", "kibble", "tinned meat"},
|
||||||
"noise": "meow"}
|
"noise": "meow"}
|
||||||
monkey: {"eats": #set{"bananas", "berries"},
|
monkey: {"eats": #{"bananas", "berries"},
|
||||||
"noise": "ooh-ooh"}}
|
"noise": "ooh-ooh"}}
|
||||||
```
|
```
|
||||||
|
|
||||||
Clever implementations can get canonicalized output for free by
|
|
||||||
carefully ordering set elements and dictionary entries at construction
|
|
||||||
time, but even in simple implementations, canonical serialization is
|
|
||||||
almost as cheap as normal serialization.
|
|
||||||
|
|
||||||
|
|
||||||
<a id="org0366627"></a>
|
<a id="org0366627"></a>
|
||||||
|
|
||||||
## Defining our own types using Records
|
## Defining our own types using Records
|
||||||
|
@ -216,7 +209,7 @@ sense, it's a meta-type.
|
||||||
`Record` objects have a label and a series of arguments (or "fields").
|
`Record` objects have a label and a series of arguments (or "fields").
|
||||||
For example, we can make a `Date` record:
|
For example, we can make a `Date` record:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
<Date 2019 8 15>
|
<Date 2019 8 15>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -228,7 +221,7 @@ We could instead just decide to encode our date data in a string,
|
||||||
like "2019-08-15".
|
like "2019-08-15".
|
||||||
A document using such a date structure might look like so:
|
A document using such a date structure might look like so:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
{"name": "Gregor Samsa",
|
{"name": "Gregor Samsa",
|
||||||
"description": "humanoid trapped in an insect body",
|
"description": "humanoid trapped in an insect body",
|
||||||
"born": "1915-10-04"}
|
"born": "1915-10-04"}
|
||||||
|
@ -243,13 +236,13 @@ know the date exactly.
|
||||||
This causes a problem.
|
This causes a problem.
|
||||||
Now we might have two kinds of entries:
|
Now we might have two kinds of entries:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"Exact date known"
|
;Exact date known
|
||||||
{"name": "Gregor Samsa",
|
{"name": "Gregor Samsa",
|
||||||
"description": "humanoid trapped in an insect body",
|
"description": "humanoid trapped in an insect body",
|
||||||
"born": "1915-10-04"}
|
"born": "1915-10-04"}
|
||||||
|
|
||||||
@"Not sure about exact date..."
|
;Not sure about exact date...
|
||||||
{"name": "Gregor Samsa",
|
{"name": "Gregor Samsa",
|
||||||
"description": "humanoid trapped in an insect body",
|
"description": "humanoid trapped in an insect body",
|
||||||
"born": "Sometime in October 1915? Or was that when he became an insect?"}
|
"born": "Sometime in October 1915? Or was that when he became an insect?"}
|
||||||
|
@ -261,13 +254,13 @@ like a date", but doing this kind of thing is prone to errors and weird
|
||||||
edge cases.
|
edge cases.
|
||||||
No, it's better to be able to have a separate type:
|
No, it's better to be able to have a separate type:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"Exact date known"
|
;Exact date known
|
||||||
{"name": "Gregor Samsa",
|
{"name": "Gregor Samsa",
|
||||||
"description": "humanoid trapped in an insect body",
|
"description": "humanoid trapped in an insect body",
|
||||||
"born": <Date 1915 10 04>}
|
"born": <Date 1915 10 04>}
|
||||||
|
|
||||||
@"Not sure about exact date..."
|
;Not sure about exact date...
|
||||||
{"name": "Gregor Samsa",
|
{"name": "Gregor Samsa",
|
||||||
"description": "humanoid trapped in an insect body",
|
"description": "humanoid trapped in an insect body",
|
||||||
"born": <Unknown "Sometime in October 1915? Or was that when he became an insect?">}
|
"born": <Unknown "Sometime in October 1915? Or was that when he became an insect?">}
|
||||||
|
@ -285,7 +278,7 @@ the meaning the label signifies for it to be of use.
|
||||||
Still, there are plenty of interesting labels we can define.
|
Still, there are plenty of interesting labels we can define.
|
||||||
Here is one for an "iri", a hyperlink:
|
Here is one for an "iri", a hyperlink:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
<iri "https://dustycloud.org/blog/">
|
<iri "https://dustycloud.org/blog/">
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -294,11 +287,11 @@ Records are usually symbols but aren't necessarily so.
|
||||||
They can also be strings or numbers or even dictionaries.
|
They can also be strings or numbers or even dictionaries.
|
||||||
And very interestingly, they can also be other records:
|
And very interestingly, they can also be other records:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
<<iri "https://www.w3.org/ns/activitystreams#Note">
|
< <iri "https://www.w3.org/ns/activitystreams#Note">
|
||||||
{"to": [<iri "https://chatty.example/ben/">],
|
{"to": [<iri "https://chatty.example/ben/">],
|
||||||
"attributedTo": <iri "https://social.example/alyssa/">,
|
"attributedTo": <iri "https://social.example/alyssa/">,
|
||||||
"content": "Say, did you finish reading that book I lent you?"}>
|
"content": "Say, did you finish reading that book I lent you?"} >
|
||||||
```
|
```
|
||||||
|
|
||||||
Do you see it? This Record's label is… an `iri` Record!
|
Do you see it? This Record's label is… an `iri` Record!
|
||||||
|
@ -327,16 +320,18 @@ Annotations are not strictly a necessary feature, but they are useful
|
||||||
in some circumstances.
|
in some circumstances.
|
||||||
We have previously shown them used as comments:
|
We have previously shown them used as comments:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"I'm a comment!"
|
;I'm a comment!
|
||||||
"I am not a comment, I am data!"
|
"I am not a comment, I am data!"
|
||||||
```
|
```
|
||||||
|
|
||||||
Annotations annotate the values they precede.
|
Annotations annotate the values they precede.
|
||||||
It is possible to have multiple annotations on a value.
|
It is possible to have multiple annotations on a value.
|
||||||
|
The `;`-based comment syntax is syntactic sugar for the general
|
||||||
|
`@`-prefixed string annotation syntax.
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@"I am annotating this number"
|
;I am annotating this number
|
||||||
@"And so am I!"
|
@"And so am I!"
|
||||||
42
|
42
|
||||||
```
|
```
|
||||||
|
@ -349,7 +344,7 @@ Many implementations will, in the same mode, also supply line number
|
||||||
and column information attached to each read value.
|
and column information attached to each read value.
|
||||||
|
|
||||||
So what's the point of them then?
|
So what's the point of them then?
|
||||||
If annotations were just for comments, there would be indeed hardly
|
If annotations were just for comments, there would be indeed hardly any
|
||||||
point at all… it would be simpler to just provide a comment syntax.
|
point at all… it would be simpler to just provide a comment syntax.
|
||||||
|
|
||||||
However, annotations can be used for more than just comments.
|
However, annotations can be used for more than just comments.
|
||||||
|
@ -360,13 +355,17 @@ For instance, here's a reply from an HTTP API service running in
|
||||||
"debug" mode annotated with the time it took to produce the reply and
|
"debug" mode annotated with the time it took to produce the reply and
|
||||||
the internal name of the server that produced the response:
|
the internal name of the server that produced the response:
|
||||||
|
|
||||||
``` javascript
|
```
|
||||||
@<ResponseTime <Milliseconds 64.4>>
|
@<ResponseTime <Milliseconds 64.4>>
|
||||||
@<BackendServer "humpty-dumpty.example.com">
|
@<BackendServer "humpty-dumpty.example.com">
|
||||||
<Success
|
<Success
|
||||||
<Employees [
|
<Employees [
|
||||||
<Employee "Alyssa P. Hacker" #set{<Role Programmer>, <Role Manager>}, <Date 2018, 1, 24>>
|
<Employee "Alyssa P. Hacker"
|
||||||
<Employee "Ben Bitdiddle" #set{<Role Programmer>}, <Date 2019, 2, 13>> ]>>
|
#{<Role Programmer>, <Role Manager>}
|
||||||
|
<Date 2018, 1, 24>>
|
||||||
|
<Employee "Ben Bitdiddle"
|
||||||
|
#{<Role Programmer>}
|
||||||
|
<Date 2019, 2, 13>> ]>>
|
||||||
```
|
```
|
||||||
|
|
||||||
The annotations aren't related to the data requested, which is all
|
The annotations aren't related to the data requested, which is all
|
||||||
|
|
|
@ -20,22 +20,17 @@ are equal.
|
||||||
This document specifies canonical form for the Preserves compact
|
This document specifies canonical form for the Preserves compact
|
||||||
binary syntax.
|
binary syntax.
|
||||||
|
|
||||||
**General rules.**
|
**Annotations.**
|
||||||
Streaming formats ("format C") *MUST NOT* be used.
|
|
||||||
Annotations *MUST NOT* be present.
|
Annotations *MUST NOT* be present.
|
||||||
Whenever there is a choice between fixed-length ("format A") or
|
|
||||||
variable-length ("format B") formats, the fixed-length format *MUST* be
|
|
||||||
used.
|
|
||||||
|
|
||||||
**Sets.**
|
**Sets.**
|
||||||
The elements of a `Set` *MUST* be serialized sorted in ascending order
|
The elements of a `Set` *MUST* be serialized sorted in ascending order
|
||||||
following the total order relation defined in the
|
by comparing their canonical encoded binary representations.
|
||||||
[Preserves specification][spec].
|
|
||||||
|
|
||||||
**Dictionaries.**
|
**Dictionaries.**
|
||||||
The key-value pairs in a `Dictionary` *MUST* be serialized sorted in
|
The key-value pairs in a `Dictionary` *MUST* be serialized sorted in
|
||||||
ascending order by key, following the total order relation defined in
|
ascending order by comparing the canonical encoded binary
|
||||||
the [Preserves specification][spec].[^no-need-for-by-value]
|
representations of their keys.[^no-need-for-by-value]
|
||||||
|
|
||||||
[^no-need-for-by-value]: There is no need to order by (key, value)
|
[^no-need-for-by-value]: There is no need to order by (key, value)
|
||||||
pair, since a `Dictionary` has no duplicate keys.
|
pair, since a `Dictionary` has no duplicate keys.
|
||||||
|
@ -43,7 +38,9 @@ the [Preserves specification][spec].[^no-need-for-by-value]
|
||||||
**Other kinds of `Value`.**
|
**Other kinds of `Value`.**
|
||||||
There are no special canonicalization restrictions on
|
There are no special canonicalization restrictions on
|
||||||
`SignedInteger`s, `String`s, `ByteString`s, `Symbol`s, `Boolean`s,
|
`SignedInteger`s, `String`s, `ByteString`s, `Symbol`s, `Boolean`s,
|
||||||
`Float`s, `Double`s, `Record`s, or `Sequence`s.
|
`Float`s, `Double`s, `Record`s, or `Sequence`s. The constraints given
|
||||||
|
for these `Value`s in the [specification][spec] suffice to ensure
|
||||||
|
canonicity.
|
||||||
|
|
||||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||||
## Notes
|
## Notes
|
||||||
|
|
|
@ -65,28 +65,29 @@ interior portions of a tree.
|
||||||
## Comments.
|
## Comments.
|
||||||
|
|
||||||
`String` values used as annotations are conventionally interpreted as
|
`String` values used as annotations are conventionally interpreted as
|
||||||
comments.
|
comments. Special syntax exists for such string annotations, though
|
||||||
|
the usual `@`-prefixed annotation notation can also be used.
|
||||||
|
|
||||||
@"I am a comment for the Dictionary"
|
;I am a comment for the Dictionary
|
||||||
{
|
{
|
||||||
@"I am a comment for the key"
|
;I am a comment for the key
|
||||||
key: @"I am a comment for the value"
|
key: ;I am a comment for the value
|
||||||
value
|
value
|
||||||
}
|
}
|
||||||
|
|
||||||
@"I am a comment for this entire IOList"
|
;I am a comment for this entire IOList
|
||||||
[
|
[
|
||||||
#hex{00010203}
|
#x"00010203"
|
||||||
@"I am a comment for the middle half of the IOList"
|
;I am a comment for the middle half of the IOList
|
||||||
@"A second comment for the same portion of the IOList"
|
;A second comment for the same portion of the IOList
|
||||||
@ @"I am the first and only comment for the following comment"
|
@ ;I am the first and only comment for the following comment
|
||||||
"A third (itself commented!) comment for the same part of the IOList"
|
"A third (itself commented!) comment for the same part of the IOList"
|
||||||
[
|
[
|
||||||
@"I am a comment for the following ByteString"
|
;"I am a comment for the following ByteString"
|
||||||
#hex{04050607}
|
#x"04050607"
|
||||||
#hex{08090A0B}
|
#x"08090A0B"
|
||||||
]
|
]
|
||||||
#hex{0C0D0E0F}
|
#x"0C0D0E0F"
|
||||||
]
|
]
|
||||||
|
|
||||||
## MIME-type tagged binary data.
|
## MIME-type tagged binary data.
|
||||||
|
@ -105,12 +106,17 @@ such media types following the general rules for ordering of
|
||||||
|
|
||||||
**Examples.**
|
**Examples.**
|
||||||
|
|
||||||
| Value | Encoded hexadecimal byte sequence |
|
«<mime application/octet-stream #"abcde">»
|
||||||
|--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|
|
= B4 B3 04 "mime" B3 18 "application/octet-stream" B2 05 "abcde"
|
||||||
| `<mime application/octet-stream #"abcde">` | 83 74 6D 69 6D 65 7F 18 61 70 70 6C 69 63 61 74 69 6F 6E 2F 6F 63 74 65 74 2D 73 74 72 65 61 6D 65 61 62 63 64 65 |
|
|
||||||
| `<mime text/plain #"ABC">` | 83 74 6D 69 6D 65 7A 74 65 78 74 2F 70 6C 61 69 6E 63 41 42 43 |
|
«<mime text/plain #"ABC">»
|
||||||
| `<mime application/xml #"<xhtml/>">` | 83 74 6D 69 6D 65 7F 0F 61 70 70 6C 69 63 61 74 69 6F 6E 2F 78 6D 6C 68 3C 78 68 74 6D 6C 2F 3E |
|
= B4 B3 04 "mime" B3 0A "text/plain" B2 03 "ABC" 84
|
||||||
| `<mime text/csv #"123,234,345">` | 83 74 6D 69 6D 65 78 74 65 78 74 2F 63 73 76 6B 31 32 33 2C 32 33 34 2C 33 34 35 |
|
|
||||||
|
«<mime application/xml #"<xhtml/>">»
|
||||||
|
= B4 B3 04 "mime" B3 0F "application/xml" B2 08 "<xhtml/>" 84
|
||||||
|
|
||||||
|
«<mime text/csv #"123,234,345">»
|
||||||
|
= B4 B3 04 "mime" B3 08 "text/csv" B2 0B "123,234,345" 84
|
||||||
|
|
||||||
## Unicode normalization forms.
|
## Unicode normalization forms.
|
||||||
|
|
||||||
|
|
728
preserves.md
728
preserves.md
|
@ -4,7 +4,7 @@ title: "Preserves: an Expressive Data Language"
|
||||||
---
|
---
|
||||||
|
|
||||||
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
||||||
May 2020. Version 0.0.8.
|
Jan 2021. Version 0.4.0.
|
||||||
|
|
||||||
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
||||||
[spki]: http://world.std.com/~cme/html/spki.html
|
[spki]: http://world.std.com/~cme/html/spki.html
|
||||||
|
@ -12,6 +12,7 @@ May 2020. Version 0.0.8.
|
||||||
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
[LEB128]: https://en.wikipedia.org/wiki/LEB128
|
||||||
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
[erlang-map]: http://erlang.org/doc/reference_manual/data_types.html#map
|
||||||
[abnf]: https://tools.ietf.org/html/rfc7405
|
[abnf]: https://tools.ietf.org/html/rfc7405
|
||||||
|
[canonical]: canonical-binary.html
|
||||||
|
|
||||||
This document proposes a data model and serialization format called
|
This document proposes a data model and serialization format called
|
||||||
*Preserves*.
|
*Preserves*.
|
||||||
|
@ -42,20 +43,20 @@ Our `Value`s fall into two broad categories: *atomic* and *compound*
|
||||||
data. Every `Value` is finite and non-cyclic.
|
data. Every `Value` is finite and non-cyclic.
|
||||||
|
|
||||||
Value = Atom
|
Value = Atom
|
||||||
| Compound
|
| Compound
|
||||||
|
|
||||||
Atom = Boolean
|
Atom = Boolean
|
||||||
| Float
|
| Float
|
||||||
| Double
|
| Double
|
||||||
| SignedInteger
|
| SignedInteger
|
||||||
| String
|
| String
|
||||||
| ByteString
|
| ByteString
|
||||||
| Symbol
|
| Symbol
|
||||||
|
|
||||||
Compound = Record
|
Compound = Record
|
||||||
| Sequence
|
| Sequence
|
||||||
| Set
|
| Set
|
||||||
| Dictionary
|
| Dictionary
|
||||||
|
|
||||||
**Total order.**<a name="total-order"></a> As we go, we will
|
**Total order.**<a name="total-order"></a> As we go, we will
|
||||||
incrementally specify a total order over `Value`s. Two values of the
|
incrementally specify a total order over `Value`s. Two values of the
|
||||||
|
@ -215,14 +216,13 @@ label-`Value` followed by its field-`Value`s.
|
||||||
|
|
||||||
`Sequence`s are enclosed in square brackets. `Dictionary` values are
|
`Sequence`s are enclosed in square brackets. `Dictionary` values are
|
||||||
curly-brace-enclosed colon-separated pairs of values. `Set`s are
|
curly-brace-enclosed colon-separated pairs of values. `Set`s are
|
||||||
written either as one or more values enclosed in curly braces, or zero
|
written as values enclosed by the tokens `#{` and
|
||||||
or more values enclosed by the tokens `#set{` and
|
|
||||||
`}`.[^printing-collections] It is an error for a set to contain
|
`}`.[^printing-collections] It is an error for a set to contain
|
||||||
duplicate elements or for a dictionary to contain duplicate keys.
|
duplicate elements or for a dictionary to contain duplicate keys.
|
||||||
|
|
||||||
Sequence = "[" *Value ws "]"
|
Sequence = "[" *Value ws "]"
|
||||||
Dictionary = "{" *(Value ws ":" Value) ws "}"
|
Dictionary = "{" *(Value ws ":" Value) ws "}"
|
||||||
Set = %s"#set{" *Value ws "}" / "{" 1*Value ws "}"
|
Set = "#{" *Value ws "}"
|
||||||
|
|
||||||
[^printing-collections]: **Implementation note.** When implementing
|
[^printing-collections]: **Implementation note.** When implementing
|
||||||
printing of `Value`s using the textual syntax, consider supporting
|
printing of `Value`s using the textual syntax, consider supporting
|
||||||
|
@ -232,9 +232,10 @@ duplicate elements or for a dictionary to contain duplicate keys.
|
||||||
commas separating, and commas terminating elements or key/value
|
commas separating, and commas terminating elements or key/value
|
||||||
pairs within a collection.
|
pairs within a collection.
|
||||||
|
|
||||||
`Boolean`s are the simple literal strings `#true` and `#false`.
|
`Boolean`s are the simple literal strings `#t` and `#f` for true and
|
||||||
|
false, respectively.
|
||||||
|
|
||||||
Boolean = %s"#true" / %s"#false"
|
Boolean = %s"#t" / %s"#f"
|
||||||
|
|
||||||
Numeric data follow the
|
Numeric data follow the
|
||||||
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
|
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
|
||||||
|
@ -310,9 +311,10 @@ same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
|
||||||
|
|
||||||
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
|
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
|
||||||
the use of surrogate pairs for code points not in the Basic
|
the use of surrogate pairs for code points not in the Basic
|
||||||
Multilingual Plane. We encourage implementations to avoid escaping
|
Multilingual Plane. We encourage implementations to avoid using
|
||||||
such characters when producing output, and instead to rely on the
|
`\u` escapes when producing output, and instead to rely on the
|
||||||
UTF-8 encoding of the entire document to handle them correctly.
|
UTF-8 encoding of the entire document to handle non-ASCII
|
||||||
|
codepoints correctly.
|
||||||
|
|
||||||
A `ByteString` may be written in any of three different forms.
|
A `ByteString` may be written in any of three different forms.
|
||||||
|
|
||||||
|
@ -327,16 +329,16 @@ value with `\x`.
|
||||||
binunescaped = %x20-21 / %x23-5B / %x5D-7E
|
binunescaped = %x20-21 / %x23-5B / %x5D-7E
|
||||||
|
|
||||||
The second is as a sequence of pairs of hexadecimal digits interleaved
|
The second is as a sequence of pairs of hexadecimal digits interleaved
|
||||||
with whitespace and surrounded by `#hex{` and `}`.
|
with whitespace and surrounded by `#x"` and `"`.
|
||||||
|
|
||||||
ByteString =/ %s"#hex{" *(ws / 2HEXDIG) ws "}"
|
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
|
||||||
|
|
||||||
The third is as a sequence of
|
The third is as a sequence of
|
||||||
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
|
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
|
||||||
with whitespace and surrounded by `#base64{` and `}`. Plain and
|
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
|
||||||
URL-safe Base64 characters are allowed.
|
Base64 characters are allowed.
|
||||||
|
|
||||||
ByteString =/ %s"#base64{" *(ws / base64char) ws "}" /
|
ByteString =/ "#[" *(ws / base64char) ws "]" /
|
||||||
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
|
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
|
||||||
|
|
||||||
A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
|
A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
|
||||||
|
@ -365,10 +367,10 @@ double quote mark.
|
||||||
Finally, any `Value` may be represented by escaping from the textual
|
Finally, any `Value` may be represented by escaping from the textual
|
||||||
syntax to the [compact binary syntax](#compact-binary-syntax) by
|
syntax to the [compact binary syntax](#compact-binary-syntax) by
|
||||||
prefixing a `ByteString` containing the binary representation of the
|
prefixing a `ByteString` containing the binary representation of the
|
||||||
`Value` with `#value`.[^rationale-switch-to-binary]
|
`Value` with `#`.[^rationale-switch-to-binary]
|
||||||
[^no-literal-binary-in-text] [^compact-value-annotations]
|
[^no-literal-binary-in-text] [^compact-value-annotations]
|
||||||
|
|
||||||
Compact = %s"#value" ws ByteString
|
Compact = "#" ws ByteString
|
||||||
|
|
||||||
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
|
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
|
||||||
cannot express every `Value`: specifically, it cannot express the
|
cannot express every `Value`: specifically, it cannot express the
|
||||||
|
@ -387,8 +389,8 @@ prefixing a `ByteString` containing the binary representation of the
|
||||||
access the representation of the text from within the text itself.
|
access the representation of the text from within the text itself.
|
||||||
|
|
||||||
[^compact-value-annotations]: Any text-syntax annotations preceding
|
[^compact-value-annotations]: Any text-syntax annotations preceding
|
||||||
the `#value` are prepended to any binary-syntax annotations
|
the `#` are prepended to any binary-syntax annotations yielded by
|
||||||
yielded by decoding the `ByteString`.
|
decoding the `ByteString`.
|
||||||
|
|
||||||
### Annotations.
|
### Annotations.
|
||||||
|
|
||||||
|
@ -403,6 +405,17 @@ Each annotation is preceded by `@`; the underlying annotated value
|
||||||
follows its annotations. Here we extend only the syntactic nonterminal
|
follows its annotations. Here we extend only the syntactic nonterminal
|
||||||
named “`Value`” without altering the semantic class of `Value`s.
|
named “`Value`” without altering the semantic class of `Value`s.
|
||||||
|
|
||||||
|
**Comments.** Strings annotating a `Value` are conventionally
|
||||||
|
interpreted as comments associated with that value. Comments are
|
||||||
|
sufficiently common that special syntax exists for them.
|
||||||
|
|
||||||
|
Value =/ ws
|
||||||
|
";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline
|
||||||
|
Value
|
||||||
|
|
||||||
|
When written this way, everything between the `;` and the newline is
|
||||||
|
included in the string annotating the `Value`.
|
||||||
|
|
||||||
**Equivalence.** Annotations appear within syntax denoting a `Value`;
|
**Equivalence.** Annotations appear within syntax denoting a `Value`;
|
||||||
however, the annotations are not part of the denoted value. They are
|
however, the annotations are not part of the denoted value. They are
|
||||||
only part of the syntax. Annotations do not play a part in
|
only part of the syntax. Annotations do not play a part in
|
||||||
|
@ -421,86 +434,25 @@ different.
|
||||||
|
|
||||||
## Compact Binary Syntax
|
## Compact Binary Syntax
|
||||||
|
|
||||||
A `Repr` is a binary-syntax encoding, or representation, of either a
|
A `Repr` is a binary-syntax encoding, or representation, of a `Value`.
|
||||||
`Value` or an annotation on a `Repr`.
|
For a value `v`, we write `«v»` for the `Repr` of v.
|
||||||
|
|
||||||
Each `Repr` comprises one or more bytes describing the kind of
|
|
||||||
represented information and the length of the representation, followed
|
|
||||||
by the encoded details.
|
|
||||||
|
|
||||||
For a value `v`, we write `[[v]]` for the `Repr` of v.
|
|
||||||
|
|
||||||
### Type and Length representation.
|
### Type and Length representation.
|
||||||
|
|
||||||
Each `Repr` takes one of three possible forms:
|
Each `Repr` starts with a tag byte, describing the kind of information
|
||||||
|
represented. Depending on the tag, a length indicator, further encoded
|
||||||
|
information, and/or an ending tag may follow.
|
||||||
|
|
||||||
- (A) type-specific form, used for simple values such as `Boolean`s
|
tag (simple atomic data and small integers)
|
||||||
or `Float`s as well as for introducing annotations.
|
tag ++ binarydata (most integers)
|
||||||
|
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
|
||||||
|
tag ++ repr ++ ... ++ endtag (compound data)
|
||||||
|
|
||||||
- (B) a variable-length form with length specified up-front, used for
|
The unique end tag is byte value `0x84`.
|
||||||
compound and variable-length atomic data structures when their
|
|
||||||
sizes are known at the time serialization begins.
|
|
||||||
|
|
||||||
- (C) a variable-length streaming form with unknown or unpredictable
|
If present after a tag, the length of a following piece of binary data
|
||||||
length, used in cases when serialization begins before the number
|
is formatted as a [base 128 varint][varint].[^see-also-leb128] We
|
||||||
of elements or bytes in the corresponding `Value` is known.
|
write `varint(m)` for the varint-encoding of `m`. Quoting the
|
||||||
|
|
||||||
Applications may choose between formats B and C depending on their
|
|
||||||
needs at serialization time.
|
|
||||||
|
|
||||||
#### The lead byte.
|
|
||||||
|
|
||||||
Every `Repr` starts with a *lead byte*, constructed by
|
|
||||||
`leadbyte(t,n,m)`, where `t`,`n`∈{0,1,2,3} and 0≤`m`<16:
|
|
||||||
|
|
||||||
leadbyte(t,n,m) = [t*64 + n*16 + m]
|
|
||||||
|
|
||||||
The arguments `t`, `n` and `m` describe the rest of the
|
|
||||||
representation.[^some-encodings-unused]
|
|
||||||
|
|
||||||
[^some-encodings-unused]: Some encodings are unused. All such
|
|
||||||
encodings are reserved for future versions of this specification.
|
|
||||||
|
|
||||||
| `t` | `n` | `m` | Meaning |
|
|
||||||
| --- | --- | --- | ------- |
|
|
||||||
| 0 | 0 | 0–3 | (format A) An `Atom` with fixed-length binary representation |
|
|
||||||
| 0 | 0 | 4 | (format C) Stream end |
|
|
||||||
| 0 | 0 | 5 | (format A) Annotation |
|
|
||||||
| 0 | 2 | | (format C) Stream start |
|
|
||||||
| 0 | 3 | | (format A) Certain small `SignedInteger`s |
|
|
||||||
| 1 | | | (format B) An `Atom` with variable-length binary representation |
|
|
||||||
| 2 | | | (format B) A `Compound` with variable-length representation |
|
|
||||||
| 3 | 3 | 15 | (format A) 0xFF byte; no-op |
|
|
||||||
|
|
||||||
#### Encoding data of type-specific length (format A).
|
|
||||||
|
|
||||||
Each type of data defines its own rules for this format.
|
|
||||||
|
|
||||||
Of particular note is lead byte `0xFF`, which is a no-op byte acting
|
|
||||||
as a kind of pseudo-whitespace in a binary-syntax encoding.
|
|
||||||
|
|
||||||
#### Encoding data of known length (format B).
|
|
||||||
|
|
||||||
Format B is used where the length `l` of the `Value` to be encoded is
|
|
||||||
known when serialization begins. Format B `Repr`s use `m` in
|
|
||||||
`leadbyte` to encode `l`. The length counts *bytes* for atomic
|
|
||||||
`Value`s, but counts *contained values* for compound `Value`s.
|
|
||||||
|
|
||||||
- A length `l` between 0 and 14 is represented using `leadbyte` with
|
|
||||||
`m=l`.
|
|
||||||
- A length of 15 or greater is represented by `m=15` and additional
|
|
||||||
bytes describing the length following the lead byte.
|
|
||||||
|
|
||||||
The function `header(t,n,m)` yields an appropriate sequence of bytes
|
|
||||||
describing a `Repr`'s type and length when `t`, `n` and `m` are
|
|
||||||
appropriate non-negative integers:
|
|
||||||
|
|
||||||
header(t,n,m) = leadbyte(t,n,m) when m < 15
|
|
||||||
or leadbyte(t,n,15) ++ varint(m) otherwise
|
|
||||||
|
|
||||||
The additional length bytes are formatted as
|
|
||||||
[base 128 varints][varint].[^see-also-leb128] We write `varint(m)` for
|
|
||||||
the varint-encoding of `m`. Quoting the
|
|
||||||
[Google Protocol Buffers][varint] definition,
|
[Google Protocol Buffers][varint] definition,
|
||||||
|
|
||||||
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
|
[^see-also-leb128]: Also known as [LEB128][] encoding, for unsigned
|
||||||
|
@ -515,174 +467,114 @@ the varint-encoding of `m`. Quoting the
|
||||||
|
|
||||||
The following table illustrates varint-encoding.
|
The following table illustrates varint-encoding.
|
||||||
|
|
||||||
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
| Number, `m` | `m` in binary, grouped into 7-bit chunks | `varint(m)` bytes |
|
||||||
| ------ | ------------------- | ------------ |
|
| ------ | ------------------- | ------------ |
|
||||||
| 15 | `0001111` | 15 |
|
| 15 | `0001111` | 15 |
|
||||||
| 300 | `0000010 0101100` | 172 2 |
|
| 300 | `0000010 0101100` | 172 2 |
|
||||||
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
| 1000000000 | `0000011 1011100 1101011 0010100 0000000` | 128 148 235 220 3 |
|
||||||
|
|
||||||
It is an error for a varint-encoded `m` in a `Repr` to be anything
|
It is an error for a varint-encoded `m` in a `Repr` to be anything
|
||||||
other than the unique shortest encoding for that `m`. That is, a
|
other than the unique shortest encoding for that `m`. That is, a
|
||||||
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0. However,
|
varint-encoding of `m` *MUST NOT* end in `0` unless `m`=0.
|
||||||
the `varint(m)` encoding of a length *MUST NOT* be used when `m`<15,
|
|
||||||
meaning that a `Repr` *MUST NOT* contain any varint-encoding with
|
|
||||||
final byte `0`.
|
|
||||||
|
|
||||||
#### Streaming data of unknown length (format C).
|
### Records, Sequences, Sets and Dictionaries.
|
||||||
|
|
||||||
A `Repr` where the length of the `Value` to be encoded is variable and
|
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
|
||||||
not known at the time serialization of the `Value` starts is encoded
|
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
|
||||||
by a single Stream Start (“open”) byte, followed by zero or more
|
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
|
||||||
*chunks*, followed by a matching Stream End (“close”) byte:
|
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
|
||||||
|
|
||||||
open(t,n) = leadbyte(0,2, t*4 + n) = [0x20 + t*4 + n]
|
|
||||||
close() = leadbyte(0,0, 4) = [0x04]
|
|
||||||
|
|
||||||
For a format C `Repr` of an atomic `Value`, each chunk is to be a
|
|
||||||
format B `Repr` of a `ByteString`, no matter the type of the overall
|
|
||||||
`Value`. Annotations are not allowed on these individual chunks.
|
|
||||||
|
|
||||||
For a format C `Repr` of a compound `Value`, each chunk is to be a
|
|
||||||
single `Repr`, which may itself be annotated.
|
|
||||||
|
|
||||||
Each chunk within a format C `Repr` *MUST* have non-zero length.
|
|
||||||
Software that decodes `Repr`s *MUST* reject `Repr`s that include
|
|
||||||
zero-length chunks.
|
|
||||||
|
|
||||||
### Records.
|
|
||||||
|
|
||||||
Format B (known length):
|
|
||||||
|
|
||||||
[[ <L F_1...F_m> ]] = header(2,0,m+1) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]]
|
|
||||||
|
|
||||||
For `m` fields, `m+1` is supplied to `header`, to account for the
|
|
||||||
encoding of the record label.
|
|
||||||
|
|
||||||
Format C (streaming):
|
|
||||||
|
|
||||||
[[ <L F_1...F_m> ]] = open(2,0) ++ [[L]] ++ [[F_1]] ++...++ [[F_m]] ++ close()
|
|
||||||
|
|
||||||
Applications *SHOULD* prefer the known-length format for encoding
|
|
||||||
`Record`s.
|
|
||||||
|
|
||||||
### Sequences, Sets and Dictionaries.
|
|
||||||
|
|
||||||
Format B (known length):
|
|
||||||
|
|
||||||
[[ [X_1...X_m] ]] = header(2,1,m) ++ [[X_1]] ++...++ [[X_m]]
|
|
||||||
[[ #set{X_1...X_m} ]] = header(2,2,m) ++ [[X_1]] ++...++ [[X_m]]
|
|
||||||
[[ {K_1:V_1...K_m:V_m} ]] = header(2,3,m*2) ++ [[K_1]] ++ [[V_1]] ++...
|
|
||||||
++ [[K_m]] ++ [[V_m]]
|
|
||||||
|
|
||||||
Note that `m*2` is given to `header` for a `Dictionary`, since there
|
|
||||||
are two `Value`s in each key-value pair.
|
|
||||||
|
|
||||||
Format C (streaming):
|
|
||||||
|
|
||||||
[[ [X_1...X_m] ]] = open(2,1) ++ [[X_1]] ++...++ [[X_m]] ++ close()
|
|
||||||
[[ #set{E_1...E_m} ]] = open(2,2) ++ [[E_1]] ++...++ [[E_m]] ++ close()
|
|
||||||
[[ {K_1:V_1...K_m:V_m} ]] = open(2,3) ++ [[K_1]] ++ [[V_1]] ++...
|
|
||||||
++ [[K_m]] ++ [[V_m]] ++ close()
|
|
||||||
|
|
||||||
Applications may use whichever format suits their needs on a
|
|
||||||
case-by-case basis.
|
|
||||||
|
|
||||||
There is *no* ordering requirement on the `E_i` elements or
|
There is *no* ordering requirement on the `E_i` elements or
|
||||||
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
`K_i`/`V_i` pairs.[^no-sorting-rationale] They may appear in any
|
||||||
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct.
|
order. However, the `E_i` and `K_i` *MUST* be pairwise distinct. In
|
||||||
|
addition, implementations *SHOULD* default to writing set elements and
|
||||||
|
dictionary key/value pairs in order sorted lexicographically by their
|
||||||
|
`Repr`s[^not-sorted-semantically], and *MAY* offer the option of
|
||||||
|
serializing in some other implementation-defined order.
|
||||||
|
|
||||||
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
[^no-sorting-rationale]: In the BitTorrent encoding format,
|
||||||
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
[bencoding](http://www.bittorrent.org/beps/bep_0003.html#bencoding),
|
||||||
dictionary key/value pairs must be sorted by key. This is a
|
dictionary key/value pairs must be sorted by key. This is a
|
||||||
necessary step for ensuring serialization of `Value`s is
|
necessary step for ensuring serialization of `Value`s is
|
||||||
canonical. We do not require that key/value pairs (or set
|
canonical. We do not require that key/value pairs (or set
|
||||||
elements) be in sorted order for serialized `Value`s, because (a)
|
elements) be in sorted order for serialized `Value`s; however, a
|
||||||
where canonicalization is used for cryptographic signatures, it is
|
[canonical form][canonical] for `Repr`s does exist where a sorted
|
||||||
more reliable to simply retain the exact binary form of the signed
|
ordering is required.
|
||||||
document than to depend on canonical de- and re-serialization, and
|
|
||||||
(b) sorting keys or elements makes no sense in streaming
|
|
||||||
serialization formats.
|
|
||||||
|
|
||||||
However, a quality implementation may wish to offer the programmer
|
[^not-sorted-semantically]: It's important to note that the sort
|
||||||
the option of serializing with set elements and dictionary keys in
|
ordering for writing out set elements and dictionary key/value
|
||||||
sorted order.
|
pairs is *not* the same as the sort ordering implied by the
|
||||||
|
semantic ordering of those elements or keys. For example, the
|
||||||
|
`Repr` of a negative number very far from zero will start with
|
||||||
|
byte that is *greater* than the byte which starts the `Repr` of
|
||||||
|
zero, making it sort lexicographically later by `Repr`, despite
|
||||||
|
being semantically *less than* zero.
|
||||||
|
|
||||||
|
**Rationale**. This is for ease-of-implementation reasons: not all
|
||||||
|
languages can easily represent sorted sets or sorted dictionaries,
|
||||||
|
but encoding and then sorting byte strings is much more likely to
|
||||||
|
be within easy reach.
|
||||||
|
|
||||||
### SignedIntegers.
|
### SignedIntegers.
|
||||||
|
|
||||||
Format B/A (known length/fixed-size):
|
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
|
||||||
|
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
|
||||||
|
([0xA0] + x) if (-3≤x≤-1)
|
||||||
|
([0x90] + x) if ( 0≤x≤12)
|
||||||
|
where m = |intbytes(x)|
|
||||||
|
|
||||||
[[ x ]] when x ∈ SignedInteger = header(1,0,m) ++ intbytes(x) if x<-3 ∨ 13≤x
|
Integers in the range [-3,12] are compactly represented with tags
|
||||||
header(0,3,x+16) if -3≤x<0
|
between `0x90` and `0x9F` because they are so frequently used.
|
||||||
header(0,3,x) if 0≤x<13
|
Integers up to 16 bytes long are represented with a single-byte tag
|
||||||
|
encoding the length of the integer. Larger integers are represented
|
||||||
Integers in the range [-3,12] are compactly represented using format A
|
with an explicit varint length. Every `SignedInteger` *MUST* be
|
||||||
because they are so frequently used. Other integers are represented
|
represented with its shortest possible encoding.
|
||||||
using format B.
|
|
||||||
|
|
||||||
Format C *MUST NOT* be used for `SignedInteger`s. Format A *MUST* be
|
|
||||||
used for integers in the range -3 to 12, inclusive.
|
|
||||||
|
|
||||||
The function `intbytes(x)` gives the big-endian two's-complement
|
The function `intbytes(x)` gives the big-endian two's-complement
|
||||||
binary representation of `x`, taking exactly as many whole bytes as
|
binary representation of `x`, taking exactly as many whole bytes as
|
||||||
needed to unambiguously identify the value and its sign, and `m =
|
needed to unambiguously identify the value and its sign, and `m =
|
||||||
|intbytes(x)|`. The most-significant bit in the first byte in
|
|intbytes(x)|`. The most-significant bit in the first byte in
|
||||||
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes]
|
`intbytes(x)` <!-- for `x`≠0 --> is the sign bit.[^zero-intbytes] For
|
||||||
|
example,
|
||||||
|
|
||||||
|
«87112285931760246646623899502532662132736»
|
||||||
|
= B0 12 01 00 00 00 00 00 00 00
|
||||||
|
00 00 00 00 00 00 00 00
|
||||||
|
00 00
|
||||||
|
|
||||||
|
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
|
||||||
|
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
|
||||||
|
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
|
||||||
|
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
|
||||||
|
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
|
||||||
|
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
|
||||||
|
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
|
||||||
|
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
|
||||||
|
|
||||||
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
[^zero-intbytes]: The value 0 needs zero bytes to identify the
|
||||||
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
value, so `intbytes(0)` is the empty byte string. Non-zero values
|
||||||
need at least one byte.
|
need at least one byte.
|
||||||
|
|
||||||
For example,
|
|
||||||
|
|
||||||
[[ -257 ]] = 42 FE FF [[ -3 ]] = 3D [[ 128 ]] = 42 00 80
|
|
||||||
[[ -256 ]] = 42 FF 00 [[ -2 ]] = 3E [[ 255 ]] = 42 00 FF
|
|
||||||
[[ -255 ]] = 42 FF 01 [[ -1 ]] = 3F [[ 256 ]] = 42 01 00
|
|
||||||
[[ -254 ]] = 42 FF 02 [[ 0 ]] = 30 [[ 32767 ]] = 42 7F FF
|
|
||||||
[[ -129 ]] = 42 FF 7F [[ 1 ]] = 31 [[ 32768 ]] = 43 00 80 00
|
|
||||||
[[ -128 ]] = 41 80 [[ 12 ]] = 3C [[ 65535 ]] = 43 00 FF FF
|
|
||||||
[[ -127 ]] = 41 81 [[ 13 ]] = 41 0D [[ 65536 ]] = 43 01 00 00
|
|
||||||
[[ -4 ]] = 41 FC [[ 127 ]] = 41 7F [[ 131072 ]] = 43 02 00 00
|
|
||||||
|
|
||||||
### Strings, ByteStrings and Symbols.
|
### Strings, ByteStrings and Symbols.
|
||||||
|
|
||||||
Syntax for these three types varies only in the value of `n` supplied
|
Syntax for these three types varies only in the tag used. For `String`
|
||||||
to `header` and `open`. In each case, the payload following the header
|
and `Symbol`, the data following the tag is a UTF-8 encoding of the
|
||||||
is a binary sequence; for `String` and `Symbol`, it is a UTF-8
|
`Value`'s code points, while for `ByteString` it is the raw data
|
||||||
encoding of the `Value`'s code points, while for `ByteString` it is
|
contained within the `Value` unmodified.
|
||||||
the raw data contained within the `Value` unmodified.
|
|
||||||
|
|
||||||
Format B (known length):
|
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
|
||||||
|
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
|
||||||
|
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
|
||||||
|
|
||||||
[[ S ]] = header(1,n,m) ++ encode(S)
|
### Booleans.
|
||||||
where m = |encode(S)|
|
|
||||||
and (n,encode(S)) = (1,utf8(S)) if S ∈ String
|
|
||||||
(2,S) if S ∈ ByteString
|
|
||||||
(3,utf8(S)) if S ∈ Symbol
|
|
||||||
|
|
||||||
To stream a `String`, `ByteString` or `Symbol`, emit `open(1,n)` and
|
«#f» = [0x80]
|
||||||
then a sequence of zero or more format B chunks, followed by
|
«#t» = [0x81]
|
||||||
`close()`. Every chunk must be a `ByteString`, and no chunk may be
|
|
||||||
annotated.
|
|
||||||
|
|
||||||
While the overall content of a streamed `String` or `Symbol` must be
|
### Floats and Doubles.
|
||||||
valid UTF-8, individual chunks do not have to conform to UTF-8.
|
|
||||||
|
|
||||||
### Fixed-length Atoms.
|
«F» when F ∈ Float = [0x82] ++ binary32(F)
|
||||||
|
«D» when D ∈ Double = [0x83] ++ binary64(D)
|
||||||
Fixed-length atoms all use format A, and do not have a length
|
|
||||||
representation. They repurpose the bits that format B `Repr`s use to
|
|
||||||
specify lengths. Applications *MUST NOT* use format C with `open(0,n)`
|
|
||||||
for any `n`.
|
|
||||||
|
|
||||||
#### Booleans.
|
|
||||||
|
|
||||||
[[ #false ]] = header(0,0,0) = [0x00]
|
|
||||||
[[ #true ]] = header(0,0,1) = [0x01]
|
|
||||||
|
|
||||||
#### Floats and Doubles.
|
|
||||||
|
|
||||||
[[ F ]] when F ∈ Float = header(0,0,2) ++ binary32(F)
|
|
||||||
[[ D ]] when D ∈ Double = header(0,0,3) ++ binary64(D)
|
|
||||||
|
|
||||||
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||||||
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
8-byte IEEE 754 binary representations of `F` and `D`, respectively.
|
||||||
|
@ -690,40 +582,43 @@ The functions `binary32(F)` and `binary64(D)` yield big-endian 4- and
|
||||||
### Annotations.
|
### Annotations.
|
||||||
|
|
||||||
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
To annotate a `Repr` `r` with some `Value` `v`, prepend `r` with
|
||||||
`[0x05] ++ [[v]]`.
|
`[0x85] ++ «v»`. For example, the `Repr` corresponding to textual
|
||||||
|
syntax `@a@b[]`, i.e. an empty sequence annotated with two symbols,
|
||||||
|
`a` and `b`, is
|
||||||
|
|
||||||
For example, the `Repr` corresponding to textual syntax `@a@b[]`,
|
«@a @b []»
|
||||||
i.e. an empty sequence annotated with two symbols, `a` and `b`, is
|
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
|
||||||
|
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
|
||||||
[[ @a @b [] ]]
|
|
||||||
= [0x05] ++ [[a]] ++ [0x05] ++ [[b]] ++ [[ [] ]]
|
|
||||||
= [0x05, 0x71, 0x61, 0x05, 0x71, 0x62, 0x90]
|
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
|
### Ordering.
|
||||||
|
|
||||||
|
The total ordering specified [above](#total-order) means that the following statements are true:
|
||||||
|
|
||||||
|
"bzz" < "c" < "caa"
|
||||||
|
#t < 3.0f < 3.0 < 3 < "3" < |3| < []
|
||||||
|
|
||||||
### Simple examples.
|
### Simple examples.
|
||||||
|
|
||||||
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
<!-- TODO: Give some examples of large and small Preserves, perhaps -->
|
||||||
<!-- translated from various JSON blobs floating around the internet. -->
|
<!-- translated from various JSON blobs floating around the internet. -->
|
||||||
|
|
||||||
| Value | Encoded byte sequence |
|
| Value | Encoded byte sequence |
|
||||||
|---------------------------------------------------|-------------------------------------------------------------------------------------|
|
|-----------------------------|---------------------------------------------------------------------------------|
|
||||||
| `<capture <discard>>` | 82 77 'c' 'a' 'p' 't' 'u' 'r' 'e' 81 77 'd' 'i' 's' 'c' 'a' 'r' 'd' |
|
| `<capture <discard>>` | B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
|
||||||
| `[1 2 3 4]` (format B) | 94 31 32 33 34 |
|
| `[1 2 3 4]` | B5 91 92 93 94 84 |
|
||||||
| `[1 2 3 4]` (format C) | 29 31 32 33 34 04 |
|
| `[-2 -1 0 1]` | B5 9E 9F 90 91 84 |
|
||||||
| `[-2 -1 0 1]` | 94 3E 3F 30 31 |
|
| `"hello"` (format B) | B1 05 'h' 'e' 'l' 'l' 'o' |
|
||||||
| `"hello"` (format B) | 55 'h' 'e' 'l' 'l' 'o' |
|
| `["a" b #"c" [] #{} #t #f]` | B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
|
||||||
| `"hello"` (format C, 2 chunks) | 25 62 'h' 'e' 63 'l' 'l' 'o' 35 |
|
| `-257` | A1 FE FF |
|
||||||
| `"hello"` (format C, 5 chunks) | 25 61 'h' 61 'e' 61 'l' 61 'l' 61 'o' 35 |
|
| `-1` | 9F |
|
||||||
| `["hello" there #"world" [] #set{} #true #false]` | 97 55 'h' 'e' 'l' 'l' 'o' 75 't' 'h' 'e' 'r' 'e' 65 'w' 'o' 'r' 'l' 'd' 90 A0 01 00 |
|
| `0` | 90 |
|
||||||
| `-257` | 42 FE FF |
|
| `1` | 91 |
|
||||||
| `-1` | 3F |
|
| `255` | A1 00 FF |
|
||||||
| `0` | 30 |
|
| `1.0f` | 82 3F 80 00 00 |
|
||||||
| `1` | 31 |
|
| `1.0` | 83 3F F0 00 00 00 00 00 00 |
|
||||||
| `255` | 42 00 FF |
|
| `-1.202e300` | 83 FE 3C B7 B7 59 BF 04 26 |
|
||||||
| `1.0f` | 02 3F 80 00 00 |
|
|
||||||
| `1.0` | 03 3F F0 00 00 00 00 00 00 |
|
|
||||||
| `-1.202e300` | 03 FE 3C B7 B7 59 BF 04 26 |
|
|
||||||
|
|
||||||
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
|
The next example uses a non-`Symbol` label for a record.[^extensibility2] The `Record`
|
||||||
|
|
||||||
|
@ -731,21 +626,24 @@ The next example uses a non-`Symbol` label for a record.[^extensibility2] The `R
|
||||||
|
|
||||||
encodes to
|
encodes to
|
||||||
|
|
||||||
85 ;; Record, generic, 4+1
|
B4 ;; Record
|
||||||
95 ;; Sequence, 5
|
B5 ;; Sequence
|
||||||
76 74 69 74 6C 65 64 ;; Symbol, "titled"
|
B3 06 74 69 74 6C 65 64 ;; Symbol, "titled"
|
||||||
76 70 65 72 73 6F 6E ;; Symbol, "person"
|
B3 06 70 65 72 73 6F 6E ;; Symbol, "person"
|
||||||
32 ;; SignedInteger, "2"
|
92 ;; SignedInteger, "2"
|
||||||
75 74 68 69 6E 67 ;; Symbol, "thing"
|
B3 05 74 68 69 6E 67 ;; Symbol, "thing"
|
||||||
31 ;; SignedInteger, "1"
|
91 ;; SignedInteger, "1"
|
||||||
41 65 ;; SignedInteger, "101"
|
84 ;; End (sequence)
|
||||||
59 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
|
A0 65 ;; SignedInteger, "101"
|
||||||
84 ;; Record, generic, 3+1
|
B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
|
||||||
74 64 61 74 65 ;; Symbol, "date"
|
B4 ;; Record
|
||||||
42 07 1D ;; SignedInteger, "1821"
|
B3 04 64 61 74 65 ;; Symbol, "date"
|
||||||
32 ;; SignedInteger, "2"
|
A1 07 1D ;; SignedInteger, "1821"
|
||||||
33 ;; SignedInteger, "3"
|
92 ;; SignedInteger, "2"
|
||||||
52 44 72 ;; String, "Dr"
|
93 ;; SignedInteger, "3"
|
||||||
|
84 ;; End (record)
|
||||||
|
B1 02 44 72 ;; String, "Dr"
|
||||||
|
84 ;; End (record)
|
||||||
|
|
||||||
[^extensibility2]: It happens to line up with Racket's
|
[^extensibility2]: It happens to line up with Racket's
|
||||||
representation of a record label for an inheritance hierarchy
|
representation of a record label for an inheritance hierarchy
|
||||||
|
@ -785,23 +683,27 @@ read as `Symbol`s. The first example:
|
||||||
|
|
||||||
encodes to binary as follows:
|
encodes to binary as follows:
|
||||||
|
|
||||||
B2
|
B7
|
||||||
55 "Image"
|
B1 05 "Image"
|
||||||
BC
|
B7
|
||||||
55 "Width" 42 03 20
|
B1 05 "Title" B1 14 "View from 15th Floor"
|
||||||
55 "Title" 5F 14 "View from 15th Floor"
|
B1 05 "Width" A1 03 20
|
||||||
58 "Animated" 75 "false"
|
B1 06 "Height" A1 02 58
|
||||||
56 "Height" 42 02 58
|
B1 08 "Animated" B3 05 "false"
|
||||||
59 "Thumbnail"
|
B1 09 "Thumbnail"
|
||||||
B6
|
B7
|
||||||
55 "Width" 41 64
|
B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
|
||||||
53 "Url" 5F 26 "http://www.example.com/image/481989943"
|
B1 03 "IDs" B5
|
||||||
56 "Height" 41 7D
|
A0 74
|
||||||
53 "IDs" 94
|
A1 03 AF
|
||||||
41 74
|
A1 00 EA
|
||||||
42 03 AF
|
A2 00 97 89
|
||||||
42 00 EA
|
84
|
||||||
43 00 97 89
|
B1 05 "Width" A0 64
|
||||||
|
B1 06 "Height" A0 7D
|
||||||
|
84
|
||||||
|
84
|
||||||
|
84
|
||||||
|
|
||||||
and the second example:
|
and the second example:
|
||||||
|
|
||||||
|
@ -830,55 +732,51 @@ and the second example:
|
||||||
|
|
||||||
encodes to binary as follows:
|
encodes to binary as follows:
|
||||||
|
|
||||||
92
|
B5
|
||||||
BF 10
|
B7
|
||||||
59 "precision" 53 "zip"
|
B1 03 "Zip" B1 05 "94107"
|
||||||
58 "Latitude" 03 40 42 E2 26 80 9D 49 52
|
B1 04 "City" B1 0D "SAN FRANCISCO"
|
||||||
59 "Longitude" 03 C0 5E 99 56 6C F4 1F 21
|
B1 05 "State" B1 02 "CA"
|
||||||
57 "Address" 50
|
B1 07 "Address" B1 00
|
||||||
54 "City" 5D "SAN FRANCISCO"
|
B1 07 "Country" B1 02 "US"
|
||||||
55 "State" 52 "CA"
|
B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52
|
||||||
53 "Zip" 55 "94107"
|
B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21
|
||||||
57 "Country" 52 "US"
|
B1 09 "precision" B1 03 "zip"
|
||||||
BF 10
|
84
|
||||||
59 "precision" 53 "zip"
|
B7
|
||||||
58 "Latitude" 03 40 42 AF 9D 66 AD B4 03
|
B1 03 "Zip" B1 05 "94085"
|
||||||
59 "Longitude" 03 C0 5E 81 AA 4F CA 42 AF
|
B1 04 "City" B1 09 "SUNNYVALE"
|
||||||
57 "Address" 50
|
B1 05 "State" B1 02 "CA"
|
||||||
54 "City" 59 "SUNNYVALE"
|
B1 07 "Address" B1 00
|
||||||
55 "State" 52 "CA"
|
B1 07 "Country" B1 02 "US"
|
||||||
53 "Zip" 55 "94085"
|
B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03
|
||||||
57 "Country" 52 "US"
|
B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF
|
||||||
|
B1 09 "precision" B1 03 "zip"
|
||||||
|
84
|
||||||
|
84
|
||||||
|
|
||||||
## Security Considerations
|
## Security Considerations
|
||||||
|
|
||||||
**Empty chunks.** Chunks of zero length are prohibited in streamed
|
**Whitespace.** The textual format allows arbitrary whitespace in many
|
||||||
(format C) `Repr`s. However, a malicious or broken encoder may include
|
positions. Consider optional restrictions on the amount of consecutive
|
||||||
them nonetheless. This opens up a possibility for denial-of-service:
|
whitespace that may appear.
|
||||||
an attacker may begin streaming a `String`, for example, sending an
|
|
||||||
endless sequence of zero length chunks, appearing to make progress but
|
|
||||||
not actually doing so. Implementations *MUST* reject zero length
|
|
||||||
chunks when decoding, and *MUST NOT* produce them when encoding.
|
|
||||||
|
|
||||||
**Whitespace and no-ops.** Similarly, the binary format allows `0xFF`
|
**Annotations.** Similarly, in modes where a `Value` is being read
|
||||||
no-ops and the textual format allows arbitrary whitespace in many
|
while annotations are skipped, an endless sequence of annotations may
|
||||||
positions. In streaming transfer situations, consider optional
|
give an illusion of progress.
|
||||||
restrictions on the amount of consecutive whitespace or the number of
|
|
||||||
consecutive no-ops that may appear.
|
|
||||||
|
|
||||||
**Annotations.** Also similarly, in modes where a `Value` is being
|
**Canonical form for cryptographic hashing and signing.** No canonical
|
||||||
read while annotations are skipped, an endless sequence of annotations
|
textual encoding of a `Value` is specified. A
|
||||||
may give an illusion of progress.
|
[canonical form][canonical] exists for binary encoded `Value`s, and
|
||||||
|
implementations *SHOULD* produce canonical binary encodings by
|
||||||
**Canonical form for cryptographic hashing and signing.** As
|
default; however, an implementation *MAY* permit two serializations of
|
||||||
specified, neither the textual nor the compact binary encoding rules
|
the same `Value` to yield different binary `Repr`s.
|
||||||
for `Value`s force canonical serializations. Two serializations of the
|
|
||||||
same `Value` may yield different binary `Repr`s.
|
|
||||||
|
|
||||||
## Acknowledgements
|
## Acknowledgements
|
||||||
|
|
||||||
The use of low-order bits of each lead byte for the length of short
|
The use of the low-order bits in certain SignedInteger tags for the
|
||||||
values is inspired by a similar feature of [CBOR](http://cbor.io/).
|
length of the following data is inspired by a similar feature of
|
||||||
|
[CBOR](http://cbor.io/).
|
||||||
|
|
||||||
The treatment of commas as whitespace in the text syntax is inspired
|
The treatment of commas as whitespace in the text syntax is inspired
|
||||||
by the same feature of [EDN](https://github.com/edn-format/edn).
|
by the same feature of [EDN](https://github.com/edn-format/edn).
|
||||||
|
@ -889,126 +787,42 @@ syntax.
|
||||||
|
|
||||||
## Appendix. Autodetection of textual or binary syntax
|
## Appendix. Autodetection of textual or binary syntax
|
||||||
|
|
||||||
Whitespace characters `0x09` (ASCII HT (tab)), `0x0A` (LF), `0x0D`
|
Every tag byte in a binary Preserves `Document` falls within the range
|
||||||
(CR), `0x20` (space) and `0x2C` (comma) are ignored at the start of a
|
[`0x80`, `0xBF`]. These bytes, interpreted as UTF-8, are *continuation
|
||||||
textual-syntax Preserves `Document`, and their UTF-8 encodings are
|
bytes*, and will never occur as the first byte of a UTF-8 encoded code
|
||||||
reserved lead byte values in binary-syntax Preserves.
|
point. This means no binary-encoded document can be misinterpreted as
|
||||||
|
valid UTF-8.
|
||||||
|
|
||||||
The byte `0xFF`, signifying a no-op in binary-syntax Preserves, has no
|
Conversely, a UTF-8 document must start with a valid codepoint,
|
||||||
meaning in either 7-bit ASCII or UTF-8, and therefore cannot appear in
|
meaning in particular that it must not start with a byte in the range
|
||||||
a valid textual-syntax Preserves `Document`.
|
[`0x80`, `0xBF`]. This means that no UTF-8 encoded textual-syntax
|
||||||
|
Preserves document can be misinterpreted as a binary-syntax document.
|
||||||
|
|
||||||
If applications prefix their textual-syntax documents with e.g. a
|
Examination of the top two bits of the first byte of a document gives
|
||||||
space or newline character, and their binary-syntax documents with a
|
its syntax: if the top two bits are `10`, it should be interpreted as
|
||||||
`0xFF` byte, consumers of these documents may reliably autodetect the
|
a binary-syntax document; otherwise, it should be interpreted as text.
|
||||||
syntax being used. In a network protocol supporting this kind of
|
|
||||||
autodetection, clients may transmit LF or `0xFF` to select text or
|
|
||||||
binary syntax, respectively.
|
|
||||||
|
|
||||||
Furthermore, if an application consistently uses `Record`s for its
|
## Appendix. Table of tag values
|
||||||
top-level messages,[^records-and-nonatoms] eschewing `Atom`s in
|
|
||||||
particular, then autodetection of the encoding used for a given input
|
|
||||||
can be done as follows:
|
|
||||||
|
|
||||||
| First byte of encoded input | Encoding | Other conclusions |
|
80 - False
|
||||||
| --- | --- | --- |
|
81 - True
|
||||||
| `0x80`--`0x8F` | binary | `Record` (format B) |
|
82 - Float
|
||||||
| `0x28` | binary | `Record` (format C) |
|
83 - Double
|
||||||
| `0x05` | binary | annotated value (presumably a `Record`) |
|
84 - End marker
|
||||||
| `0xFF` | binary | no-op; value will follow |
|
85 - Annotation
|
||||||
| --- | --- | --- |
|
(8x) RESERVED 86-8F
|
||||||
| `0x7B` ("<") | text | `Record` |
|
|
||||||
| `0x40` ("@") | text | annotated value (presumably a `Record`) |
|
|
||||||
| `0x09`, `0x0A`, `0x0D`, `0x20` or `0x2C` | text | whitespace; value will follow |
|
|
||||||
|
|
||||||
[^records-and-nonatoms]: Similar reasoning can be used to permit
|
9x - Small integers 0..12,-3..-1
|
||||||
unambiguous detection of encoding when `Collection`s are allowed
|
An - Small integers, (n+1) bytes long
|
||||||
as top-level messages as well as `Record`s.
|
B0 - Small integers, variable length
|
||||||
|
B1 - String
|
||||||
|
B2 - ByteString
|
||||||
|
B3 - Symbol
|
||||||
|
|
||||||
## Appendix. Table of lead byte values
|
B4 - Record
|
||||||
|
B5 - Sequence
|
||||||
00 - False
|
B6 - Set
|
||||||
01 - True
|
B7 - Dictionary
|
||||||
02 - Float
|
|
||||||
03 - Double
|
|
||||||
04 - End stream
|
|
||||||
05 - Annotation
|
|
||||||
(0x) RESERVED 06-0F (NB. 09, 0A, 0D specially reserved)
|
|
||||||
(1x) RESERVED
|
|
||||||
2x - Start Stream (NB. 20, 2C specially reserved)
|
|
||||||
3x - Small integers 0..12,-3..-1
|
|
||||||
|
|
||||||
4x - SignedInteger
|
|
||||||
5x - String
|
|
||||||
6x - ByteString
|
|
||||||
7x - Symbol
|
|
||||||
|
|
||||||
8x - Record
|
|
||||||
9x - Sequence
|
|
||||||
Ax - Set
|
|
||||||
Bx - Dictionary
|
|
||||||
|
|
||||||
(Cx) RESERVED C0-CF
|
|
||||||
(Dx) RESERVED D0-DF
|
|
||||||
(Ex) RESERVED E0-EF
|
|
||||||
(Fx) RESERVED F0-FE
|
|
||||||
FF No-op
|
|
||||||
|
|
||||||
## Appendix. Bit fields within lead byte values
|
|
||||||
|
|
||||||
tt nn mmmm contents
|
|
||||||
---------- ---------
|
|
||||||
|
|
||||||
00 00 0000 False
|
|
||||||
00 00 0001 True
|
|
||||||
00 00 0010 Float, 32 bits big-endian binary
|
|
||||||
00 00 0011 Double, 64 bits big-endian binary
|
|
||||||
00 00 0100 End Stream (to match a previous Start Stream)
|
|
||||||
00 00 0101 Annotation; two more Reprs follow
|
|
||||||
|
|
||||||
00 00 1001 (ASCII HT (tab)) \
|
|
||||||
00 00 1010 (ASCII LF) |- Reserved: may be used to indicate
|
|
||||||
00 00 1101 (ASCII CR) / use of text encoding
|
|
||||||
|
|
||||||
00 01 xxxx error, RESERVED
|
|
||||||
|
|
||||||
00 10 ttnn Start Stream <tt,nn>
|
|
||||||
When tt = 00 --> error
|
|
||||||
When nn = 00 --> (ASCII space)
|
|
||||||
Reserved: may be used to indicate
|
|
||||||
use of text encoding
|
|
||||||
otherwise --> error
|
|
||||||
01 --> each chunk is a ByteString
|
|
||||||
10 --> each chunk is a single encoded Value
|
|
||||||
11 --> error (RESERVED)
|
|
||||||
When nn = 00 --> (ASCII comma)
|
|
||||||
Reserved: may be used to indicate
|
|
||||||
use of text encoding
|
|
||||||
otherwise --> error
|
|
||||||
|
|
||||||
00 11 xxxx Small integers 0..12,-3..-1
|
|
||||||
|
|
||||||
01 00 mmmm SignedInteger, big-endian binary
|
|
||||||
01 01 mmmm String, UTF-8 binary
|
|
||||||
01 10 mmmm ByteString
|
|
||||||
01 11 mmmm Symbol, UTF-8 binary
|
|
||||||
|
|
||||||
10 00 mmmm Record
|
|
||||||
10 01 mmmm Sequence
|
|
||||||
10 10 mmmm Set
|
|
||||||
10 11 mmmm Dictionary
|
|
||||||
|
|
||||||
11 00 xxxx error, RESERVED
|
|
||||||
11 01 xxxx error, RESERVED
|
|
||||||
11 10 xxxx error, RESERVED
|
|
||||||
11 11 1111 no-op; unambiguous indication of binary Preserves format
|
|
||||||
|
|
||||||
Where `mmmm` appears, interpret it as an unsigned 4-bit number `m`. If
|
|
||||||
`m`<15, let `l`=`m`. Otherwise, `m`=15; let `l` be the result of
|
|
||||||
decoding the varint that follows.
|
|
||||||
|
|
||||||
Then, `l` is the length of the body that follows, counted in bytes for
|
|
||||||
`tt`=`01` and in `Repr`s for `tt`=`10`.
|
|
||||||
|
|
||||||
## Appendix. Binary SignedInteger representation
|
## Appendix. Binary SignedInteger representation
|
||||||
|
|
||||||
|
@ -1016,17 +830,17 @@ Languages that provide fixed-width machine word types may find the
|
||||||
following table useful in encoding and decoding binary `SignedInteger`
|
following table useful in encoding and decoding binary `SignedInteger`
|
||||||
values.
|
values.
|
||||||
|
|
||||||
| Integer range | Bytes required | Encoding (hex) |
|
| Integer range | Bytes required | Encoding (hex) |
|
||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| -3 ≤ n < 13 (numbers -3..12 encoded specially) | 1 | `3X` |
|
| -3 ≤ n ≤ 12 | 1 | `3X` |
|
||||||
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `41` `XX` |
|
| -2<sup>7</sup> ≤ n < 2<sup>7</sup> (i8) | 2 | `A0` `XX` |
|
||||||
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `42` `XX` `XX` |
|
| -2<sup>15</sup> ≤ n < 2<sup>15</sup> (i16) | 3 | `A1` `XX` `XX` |
|
||||||
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `43` `XX` `XX` `XX` |
|
| -2<sup>23</sup> ≤ n < 2<sup>23</sup> (i24) | 4 | `A2` `XX` `XX` `XX` |
|
||||||
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `44` `XX` `XX` `XX` `XX` |
|
| -2<sup>31</sup> ≤ n < 2<sup>31</sup> (i32) | 5 | `A3` `XX` `XX` `XX` `XX` |
|
||||||
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `45` `XX` `XX` `XX` `XX` `XX` |
|
| -2<sup>39</sup> ≤ n < 2<sup>39</sup> (i40) | 6 | `A4` `XX` `XX` `XX` `XX` `XX` |
|
||||||
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `46` `XX` `XX` `XX` `XX` `XX` `XX` |
|
| -2<sup>47</sup> ≤ n < 2<sup>47</sup> (i48) | 7 | `A5` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||||
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `47` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
| -2<sup>55</sup> ≤ n < 2<sup>55</sup> (i56) | 8 | `A6` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||||
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `48` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
| -2<sup>63</sup> ≤ n < 2<sup>63</sup> (i64) | 9 | `A7` `XX` `XX` `XX` `XX` `XX` `XX` `XX` `XX` |
|
||||||
|
|
||||||
<!-- Heading to visually offset the footnotes from the main document: -->
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
||||||
## Notes
|
## Notes
|
||||||
|
|
13
questions.md
13
questions.md
|
@ -29,16 +29,3 @@ not. There's only one (?) at the moment, the `%i"f"` in `Float`;
|
||||||
should it be changed to case-sensitive?
|
should it be changed to case-sensitive?
|
||||||
|
|
||||||
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
|
Q. Should `IOList`s be wrapped in an identifying unary record constructor?
|
||||||
|
|
||||||
TODO: Examples of the ordering. `"bzz" < "c" < "caa"`; `#true < 3 < "3" < |3|`
|
|
||||||
|
|
||||||
TODO: Probably should add a canonicalized subset. Consider adding
|
|
||||||
explicit "I promise this is canonical" marker, like a BOM, which
|
|
||||||
identifies a binary value as (first) binary and (second, optionally)
|
|
||||||
as canonical. UTF-8 disallows byte `0xFF` from appearing anywhere in a
|
|
||||||
text; this might be a good candidate for a marker sequence.
|
|
||||||
((Actually, perhaps `0x10` would be good! It corresponds to DLE, "data
|
|
||||||
link escape"; it is not a printable ASCII character, and is disallowed
|
|
||||||
in the textual Preserves grammar; and it is also mnemonic for "version
|
|
||||||
0", since it is the Preserves binary encoding of the small integer
|
|
||||||
zero.))
|
|
||||||
|
|
Loading…
Reference in New Issue