36 KiB
no_site_title | title |
---|---|
true | Preserves: an Expressive Data Language |
Tony Garnock-Jones tonyg@leastfixedpoint.com
Jan 2021. Version 0.4.0.
This document proposes a data model and serialization format called Preserves.
Preserves supports records with user-defined labels. This relieves the confusion caused by encoding records as dictionaries, seen in most data languages in use on the web. It also allows Preserves to easily represent the labelled sums of products as seen in many functional programming languages.
Preserves also supports the usual suite of atomic and compound data types, in particular including binary data as a distinct type from text strings. Its annotations allow separation of data from metadata such as comments, trace information, and provenance information.
Finally, Preserves defines precisely how to compare two values. Comparison is based on the data model, not on syntax or on data structures of any particular implementation language.
Starting with Semantics
Taking inspiration from functional programming, we start with a definition of the values that we want to work with and give them meaning independent of their syntax.
Our Value
s fall into two broad categories: atomic and compound
data. Every Value
is finite and non-cyclic.
Value = Atom
| Compound
Atom = Boolean
| Float
| Double
| SignedInteger
| String
| ByteString
| Symbol
Compound = Record
| Sequence
| Set
| Dictionary
Total order. As we go, we will
incrementally specify a total order over Value
s. Two values of the
same kind are compared using kind-specific rules. The ordering among
values of different kinds is essentially arbitrary, but having a total
order is convenient for many tasks, so we define it as
follows:
(Values) Atom < Compound
(Compounds) Record < Sequence < Set < Dictionary
(Atoms) Boolean < Float < Double < SignedInteger
< String < ByteString < Symbol
Equivalence. Two Value
s are equal if
neither is less than the other according to the total order.
Signed integers.
A SignedInteger
is a signed integer of arbitrary width.
SignedInteger
s are compared as mathematical integers.
Unicode strings.
A String
is a sequence of Unicode
code-points. String
s
are compared lexicographically, code-point by
code-point.1
Binary data.
A ByteString
is a sequence of octets. ByteString
s are compared
lexicographically.
Symbols.
Programming languages like Lisp and Prolog frequently use string-like
values called symbols. Here, a Symbol
is, like a String
, a
sequence of Unicode code-points representing an identifier of some
kind. Symbol
s are also compared lexicographically by code-point.
Booleans.
There are two Boolean
s, “false” and “true”. The “false” value is
less-than the “true” value.
IEEE floating-point values.
Float
s and Double
s are single- and double-precision IEEE 754
floating-point values, respectively. Float
s, Double
s and
SignedInteger
s are disjoint; by the rules above,
every Float
is less than every Double
, and every SignedInteger
is greater than both. Two Float
s or two Double
s are to be ordered
by the totalOrder
predicate defined in section 5.10 of
IEEE Std 754-2008.
Records.
A Record
is a labelled tuple of Value
s, the record's fields. A
label can be any Value
, but is usually a Symbol
.2
3 Record
s are compared lexicographically: first by
label, then by field sequence.
Sequences.
A Sequence
is a sequence of Value
s. Sequence
s are compared
lexicographically.
Sets.
A Set
is an unordered finite set of Value
s. It contains no
duplicate values, following the equivalence relation
induced by the total order on Value
s. Two Set
s are compared by
sorting their elements ascending using the total order
and comparing the resulting Sequence
s.
Dictionaries.
A Dictionary
is an unordered finite collection of pairs of Value
s.
Each pair comprises a key and a value. Keys in a Dictionary
are
pairwise distinct. Instances of Dictionary
are compared by
lexicographic comparison of the sequences resulting from ordering each
Dictionary
's pairs in ascending order by key.
Textual Syntax
Now we have discussed Value
s and their meanings, we may turn to
techniques for representing Value
s for communication or storage.
In this section, we use case-sensitive ABNF to define a textual syntax that is easy for people to read and write.4 Most of the examples in this document are written using this syntax. In the following section, we will define an equivalent compact machine-readable syntax.
Character set.
ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points.
Textual syntax for a Value
SHOULD be encoded using UTF-8 where
possible.
Whitespace.
Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas.
ws = *(%x20 / %x09 / newline / ",")
newline = CR / LF
Grammar.
Standalone documents may have trailing whitespace.
Document = Value ws
Any Value
may be preceded by whitespace.
Value = ws (Record / Collection / Atom / Compact)
Collection = Sequence / Dictionary / Set
Atom = Boolean / Float / Double / SignedInteger /
String / ByteString / Symbol
Each Record
is an angle-bracket enclosed grouping of its
label-Value
followed by its field-Value
s.
Record = "<" Value *Value ws ">"
Sequence
s are enclosed in square brackets. Dictionary
values are
curly-brace-enclosed colon-separated pairs of values. Set
s are
written as values enclosed by the tokens #{
and
}
.5 It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys.
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = "#{" *Value ws "}"
Boolean
s are the simple literal strings #t
and #f
for true and
false, respectively.
Boolean = %s"#t" / %s"#f"
Numeric data follow the
JSON grammar, with
the addition of a trailing “f” distinguishing Float
from Double
values. Float
s and Double
s always have either a fractional part or
an exponent part, where SignedInteger
s never have
either.6
7
Float = flt %i"f"
Double = flt
SignedInteger = int
digit1-9 = %x31-39
nat = %x30 / ( digit1-9 *DIGIT )
int = ["-"] nat
frac = "." 1*DIGIT
exp = %i"e" ["-"/"+"] 1*DIGIT
flt = int (frac exp / frac / exp)
String
s are,
as in JSON, possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.8 9
String = %x22 *char %x22
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
escape = %x5C ; \
escaped = ( %x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 ) ; t tab U+0009
A ByteString
may be written in any of three different forms.
The first is similar to a String
, but prepended with a hash sign
#
. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a ByteString
; other
byte values must be escaped by prepending a two-digit hexadecimal
value with \x
.
ByteString = "#" %x22 *binchar %x22
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by #x"
and "
.
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
The third is as a sequence of
Base64 characters, interleaved
with whitespace and surrounded by #[
and ]
. Plain and URL-safe
Base64 characters are allowed.
ByteString =/ "#[" *(ws / base64char) ws "]" /
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A Symbol
may be written in a “bare” form10 so long as
it conforms to certain restrictions on the characters appearing in the
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for String
s, including embedded
escape syntax, except using a bar or pipe character (|
) instead of a
double quote mark.
Symbol = symstart *symcont / "|" *symchar "|"
symstart = ALPHA / sympunct / symustart
symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "/" / "."
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
symustart = <any code point greater than 127 whose Unicode
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
Pc, Po, Sc, Sm, Sk, So, or Co>
symucont = <any code point greater than 127 whose Unicode
category is Nd, Nl, No, or Pd>
Finally, any Value
may be represented by escaping from the textual
syntax to the compact binary syntax by
prefixing a ByteString
containing the binary representation of the
Value
with #=
.11
12 13
Compact = "#=" ws ByteString
Annotations.
Syntax. When written down, a Value
may have an associated
sequence of annotations carrying “out-of-band” contextual metadata
about the value. Each annotation is, in turn, a Value
, and may
itself have annotations.
Value =/ ws "@" Value Value
Each annotation is preceded by @
; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “Value
” without altering the semantic class of Value
s.
Comments. Strings annotating a Value
are conventionally
interpreted as comments associated with that value. Comments are
sufficiently common that special syntax exists for them.
Value =/ ws
";" *(%x00-09 / %x0B-0C / %x0E-%x10FFFF) newline
Value
When written this way, everything between the ;
and the newline is
included in the string annotating the Value
.
Equivalence. Annotations appear within syntax denoting a Value
;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of Value
s.
Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process Value
s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data describing Value
s, and are not in the domain
of any specific application of Value
s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.
Compact Binary Syntax
A Repr
is a binary-syntax encoding, or representation, of a Value
.
For a value v
, we write «v»
for the Repr
of v.
Type and Length representation.
Each Repr
starts with a tag byte, describing the kind of information
represented. Depending on the tag, a length indicator, further encoded
information, and/or an ending tag may follow.
tag (simple atomic data and small integers)
tag ++ binarydata (most integers)
tag ++ length ++ binarydata (large integers, strings, symbols, and binary)
tag ++ repr ++ ... ++ endtag (compound data)
The unique end tag is byte value 0x84
.
If present after a tag, the length of a following piece of binary data
is formatted as a base 128 varint.14 We
write varint(m)
for the varint-encoding of m
. Quoting the
Google Protocol Buffers definition,
Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.
The following table illustrates varint-encoding.
Number, m |
m in binary, grouped into 7-bit chunks |
varint(m) bytes |
---|---|---|
15 | 0001111 |
15 |
300 | 0000010 0101100 |
172 2 |
1000000000 | 0000011 1011100 1101011 0010100 0000000 |
128 148 235 220 3 |
It is an error for a varint-encoded m
in a Repr
to be anything
other than the unique shortest encoding for that m
. That is, a
varint-encoding of m
MUST NOT end in 0
unless m
=0.
Records, Sequences, Sets and Dictionaries.
«<L F_1...F_m>» = [0xB4] ++ «L» ++ «F_1» ++...++ «F_m» ++ [0x84]
«[X_1...X_m]» = [0xB5] ++ «X_1» ++...++ «X_m» ++ [0x84]
«#{E_1...E_m}» = [0xB6] ++ «E_1» ++...++ «E_m» ++ [0x84]
«{K_1:V_1...K_m:V_m}» = [0xB7] ++ «K_1» ++ «V_1» ++...++ «K_m» ++ «V_m» ++ [0x84]
There is no ordering requirement on the E_i
elements or
K_i
/V_i
pairs.15 They may appear in any
order. However, the E_i
and K_i
MUST be pairwise distinct. In
addition, implementations SHOULD default to writing set elements and
dictionary key/value pairs in order sorted lexicographically by their
Repr
s16, and MAY offer the option of
serializing in some other implementation-defined order.
SignedIntegers.
«x» when x ∈ SignedInteger = [0xB0] ++ varint(m) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m>16
([0xA0] + m - 1) ++ intbytes(x) if ¬(-3≤x≤12) ∧ m≤16
([0xA0] + x) if (-3≤x≤-1)
([0x90] + x) if ( 0≤x≤12)
where m = |intbytes(x)|
Integers in the range [-3,12] are compactly represented with tags
between 0x90
and 0x9F
because they are so frequently used.
Integers up to 16 bytes long are represented with a single-byte tag
encoding the length of the integer. Larger integers are represented
with an explicit varint length. Every SignedInteger
MUST be
represented with its shortest possible encoding.
The function intbytes(x)
gives the big-endian two's-complement
binary representation of x
, taking exactly as many whole bytes as
needed to unambiguously identify the value and its sign, and m = |intbytes(x)|
. The most-significant bit in the first byte in
intbytes(x)
is the sign bit.17 For
example,
«87112285931760246646623899502532662132736»
= B0 12 01 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00
«-257» = A1 FE FF «-3» = 9D «128» = A1 00 80
«-256» = A1 FF 00 «-2» = 9E «255» = A1 00 FF
«-255» = A1 FF 01 «-1» = 9F «256» = A1 01 00
«-254» = A1 FF 02 «0» = 90 «32767» = A1 7F FF
«-129» = A1 FF 7F «1» = 91 «32768» = A2 00 80 00
«-128» = A0 80 «12» = 9C «65535» = A2 00 FF FF
«-127» = A0 81 «13» = A0 0D «65536» = A2 01 00 00
«-4» = A0 FC «127» = A0 7F «131072» = A2 02 00 00
Strings, ByteStrings and Symbols.
Syntax for these three types varies only in the tag used. For String
and Symbol
, the data following the tag is a UTF-8 encoding of the
Value
's code points, while for ByteString
it is the raw data
contained within the Value
unmodified.
«S» = [0xB1] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ String
[0xB2] ++ varint(|S|) ++ S if S ∈ ByteString
[0xB3] ++ varint(|utf8(S)|) ++ utf8(S) if S ∈ Symbol
Booleans.
«#f» = [0x80]
«#t» = [0x81]
Floats and Doubles.
«F» when F ∈ Float = [0x82] ++ binary32(F)
«D» when D ∈ Double = [0x83] ++ binary64(D)
The functions binary32(F)
and binary64(D)
yield big-endian 4- and
8-byte IEEE 754 binary representations of F
and D
, respectively.
Annotations.
To annotate a Repr
r
with some Value
v
, prepend r
with
[0x85] ++ «v»
. For example, the Repr
corresponding to textual
syntax @a@b[]
, i.e. an empty sequence annotated with two symbols,
a
and b
, is
«@a @b []»
= [0x85] ++ «a» ++ [0x85] ++ «b» ++ «[]»
= [0x85, 0xB3, 0x01, 0x61, 0x85, 0xB3, 0x01, 0x62, 0xB5, 0x84]
Examples
Ordering.
The total ordering specified above means that the following statements are true:
"bzz" < "c" < "caa"
#t < 3.0f < 3.0 < 3 < "3" < |3| < []
Simple examples.
Value | Encoded byte sequence |
---|---|
<capture <discard>> |
B4 B3 07 'c' 'a' 'p' 't' 'u' 'r' 'e' B4 B3 07 'd' 'i' 's' 'c' 'a' 'r' 'd' 84 84 |
[1 2 3 4] |
B5 91 92 93 94 84 |
[-2 -1 0 1] |
B5 9E 9F 90 91 84 |
"hello" (format B) |
B1 05 'h' 'e' 'l' 'l' 'o' |
["a" b #"c" [] #{} #t #f] |
B5 B1 01 'a' B3 01 'b' B2 01 'c' B5 84 B6 84 81 80 84 |
-257 |
A1 FE FF |
-1 |
9F |
0 |
90 |
1 |
91 |
255 |
A1 00 FF |
1.0f |
82 3F 80 00 00 |
1.0 |
83 3F F0 00 00 00 00 00 00 |
-1.202e300 |
83 FE 3C B7 B7 59 BF 04 26 |
The next example uses a non-Symbol
label for a record.18 The Record
<[titled person 2 thing 1] 101 "Blackwell" <date 1821 2 3> "Dr">
encodes to
B4 ;; Record
B5 ;; Sequence
B3 06 74 69 74 6C 65 64 ;; Symbol, "titled"
B3 06 70 65 72 73 6F 6E ;; Symbol, "person"
92 ;; SignedInteger, "2"
B3 05 74 68 69 6E 67 ;; Symbol, "thing"
91 ;; SignedInteger, "1"
84 ;; End (sequence)
A0 65 ;; SignedInteger, "101"
B1 09 42 6C 61 63 6B 77 65 6C 6C ;; String, "Blackwell"
B4 ;; Record
B3 04 64 61 74 65 ;; Symbol, "date"
A1 07 1D ;; SignedInteger, "1821"
92 ;; SignedInteger, "2"
93 ;; SignedInteger, "3"
84 ;; End (record)
B1 02 44 72 ;; String, "Dr"
84 ;; End (record)
JSON examples.
The examples from
RFC 8259 read as
valid Preserves, though the JSON literals true
, false
and null
read as Symbol
s. The first example:
{
"Image": {
"Width": 800,
"Height": 600,
"Title": "View from 15th Floor",
"Thumbnail": {
"Url": "http://www.example.com/image/481989943",
"Height": 125,
"Width": 100
},
"Animated" : false,
"IDs": [116, 943, 234, 38793]
}
}
encodes to binary as follows:
B7
B1 05 "Image"
B7
B1 03 "IDs" B5
A0 74
A1 03 AF
A1 00 EA
A2 00 97 89
84
B1 05 "Title" B1 14 "View from 15th Floor"
B1 05 "Width" A1 03 20
B1 06 "Height" A1 02 58
B1 08 "Animated" B3 05 "false"
B1 09 "Thumbnail"
B7
B1 03 "Url" B1 26 "http://www.example.com/image/481989943"
B1 05 "Width" A0 64
B1 06 "Height" A0 7D
84
84
84
and the second example:
[
{
"precision": "zip",
"Latitude": 37.7668,
"Longitude": -122.3959,
"Address": "",
"City": "SAN FRANCISCO",
"State": "CA",
"Zip": "94107",
"Country": "US"
},
{
"precision": "zip",
"Latitude": 37.371991,
"Longitude": -122.026020,
"Address": "",
"City": "SUNNYVALE",
"State": "CA",
"Zip": "94085",
"Country": "US"
}
]
encodes to binary as follows:
B5
B7
B1 03 "Zip" B1 05 "94107"
B1 04 "City" B1 0D "SAN FRANCISCO"
B1 05 "State" B1 02 "CA"
B1 07 "Address" B1 00
B1 07 "Country" B1 02 "US"
B1 08 "Latitude" 83 40 42 E2 26 80 9D 49 52
B1 09 "Longitude" 83 C0 5E 99 56 6C F4 1F 21
B1 09 "precision" B1 03 "zip"
84
B7
B1 03 "Zip" B1 05 "94085"
B1 04 "City" B1 09 "SUNNYVALE"
B1 05 "State" B1 02 "CA"
B1 07 "Address" B1 00
B1 07 "Country" B1 02 "US"
B1 08 "Latitude" 83 40 42 AF 9D 66 AD B4 03
B1 09 "Longitude" 83 C0 5E 81 AA 4F CA 42 AF
B1 09 "precision" B1 03 "zip"
84
84
Security Considerations
Whitespace. The textual format allows arbitrary whitespace in many positions. Consider optional restrictions on the amount of consecutive whitespace that may appear.
Annotations. Similarly, in modes where a Value
is being read
while annotations are skipped, an endless sequence of annotations may
give an illusion of progress.
Canonical form for cryptographic hashing and signing. No canonical
textual encoding of a Value
is specified. A
canonical form exists for binary encoded Value
s, and
implementations SHOULD produce canonical binary encodings by
default; however, an implementation MAY permit two serializations of
the same Value
to yield different binary Repr
s.
Acknowledgements
The use of the low-order bits in certain SignedInteger tags for the length of the following data is inspired by a similar feature of CBOR.
The treatment of commas as whitespace in the text syntax is inspired by the same feature of EDN.
The text syntax for Boolean
s, Symbol
s, and ByteString
s is
directly inspired by Racket's lexical
syntax.
Appendix. Autodetection of textual or binary syntax
Every tag byte in a binary Preserves Document
falls within the range
[0x80
, 0xBF
]. These bytes, interpreted as UTF-8, are continuation
bytes, and will never occur as the first byte of a UTF-8 encoded code
point. This means no binary-encoded document can be misinterpreted as
valid UTF-8.
Conversely, a UTF-8 document must start with a valid codepoint,
meaning in particular that it must not start with a byte in the range
[0x80
, 0xBF
]. This means that no UTF-8 encoded textual-syntax
Preserves document can be misinterpreted as a binary-syntax document.
Examination of the top two bits of the first byte of a document gives
its syntax: if the top two bits are 10
, it should be interpreted as
a binary-syntax document; otherwise, it should be interpreted as text.
Appendix. Table of tag values
80 - False
81 - True
82 - Float
83 - Double
84 - End marker
85 - Annotation
(8x) RESERVED 86-8F
9x - Small integers 0..12,-3..-1
An - Small integers, (n+1) bytes long
B0 - Small integers, variable length
B1 - String
B2 - ByteString
B3 - Symbol
B4 - Record
B5 - Sequence
B6 - Set
B7 - Dictionary
Appendix. Binary SignedInteger representation
Languages that provide fixed-width machine word types may find the
following table useful in encoding and decoding binary SignedInteger
values.
Integer range | Bytes required | Encoding (hex) |
---|---|---|
-3 ≤ n ≤ 12 | 1 | 3X |
-27 ≤ n < 27 (i8) | 2 | A0 XX |
-215 ≤ n < 215 (i16) | 3 | A1 XX XX |
-223 ≤ n < 223 (i24) | 4 | A2 XX XX XX |
-231 ≤ n < 231 (i32) | 5 | A3 XX XX XX XX |
-239 ≤ n < 239 (i40) | 6 | A4 XX XX XX XX XX |
-247 ≤ n < 247 (i48) | 7 | A5 XX XX XX XX XX XX |
-255 ≤ n < 255 (i56) | 8 | A6 XX XX XX XX XX XX XX |
-263 ≤ n < 263 (i64) | 9 | A7 XX XX XX XX XX XX XX XX |
Notes
-
Happily, the design of UTF-8 is such that this gives the same result as a lexicographic byte-by-byte comparison of the UTF-8 encoding of a string! ↩︎
-
The Racket programming language defines “prefab” structure types, which map well to our
Record
s. Racket supports record extensibility by encoding record supertypes into record labels as specially-formatted lists. ↩︎ -
It is occasionally (but seldom) necessary to interpret such
Symbol
labels as UTF-8 encoded IRIs. Where a label can be read as a relative IRI, it is notionally interpreted with respect to the IRIurn:uuid:6bf094a6-20f1-4887-ada7-46834a9b5b34
; where a label can be read as an absolute IRI, it stands for that IRI; and otherwise, it cannot be read as an IRI at all, and so the label simply stands for itself—for its ownValue
. ↩︎ -
The grammar of the textual syntax is a superset of JSON, with the slightly unusual feature that
true
,false
, andnull
are all read asSymbol
s, and thatSignedInteger
s are never read asDouble
s. ↩︎ -
Implementation note. When implementing printing of
Value
s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset ofValue
that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. ↩︎ -
Implementation note. Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer:
Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93557.
Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93559.
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. http://arxiv.org/abs/1310.8121. ↩︎
-
Implementation note. Be aware when implementing reading and writing of
SignedInteger
s that the data model requires arbitrary-precision integers. Your implementation may (but, ideally, should not) truncate precision when reading or writing aSignedInteger
; however, if it does so, it should (a) signal its client that truncation has occurred, and (b) make it clear to the client that comparing such truncated values for equality or ordering will not yield results that match the expected semantics of the data model. ↩︎ -
The grammar for
String
has the same effect as the JSON grammar forstring
. Some auxiliary definitions (e.g.escaped
) are lifted largely unmodified from the text of RFC 8259. ↩︎ -
In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid using
\u
escapes when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle non-ASCII codepoints correctly. ↩︎ -
Compare with the SPKI S-expression definition of “token representation”, and with the R6RS definition of identifiers. ↩︎
-
Rationale. The textual syntax cannot express every
Value
: specifically, it cannot express the several million floating-point NaNs, or the two floating-point Infinities. Since the compact binary format forValue
s expresses eachValue
with precision, embedding binaryValue
s solves the problem. ↩︎ -
Every text is ultimately physically stored as bytes; therefore, it might seem possible to escape to the raw binary form of compact binary encoding from within a pieces of textual syntax. However, while bytes must be involved in any representation of text, the text itself is logically a sequence of code points and is not intrinsically a binary structure at all. It would be incoherent to expect to be able to access the representation of the text from within the text itself. ↩︎
-
Any text-syntax annotations preceding the
#
are prepended to any binary-syntax annotations yielded by decoding theByteString
. ↩︎ -
Also known as LEB128 encoding, for unsigned integers. Varints and LEB128-encoded integers differ only for signed integers, which are not used in Preserves. ↩︎
-
In the BitTorrent encoding format, bencoding, dictionary key/value pairs must be sorted by key. This is a necessary step for ensuring serialization of
Value
s is canonical. We do not require that key/value pairs (or set elements) be in sorted order for serializedValue
s; however, a canonical form forRepr
s does exist where a sorted ordering is required. ↩︎ -
It's important to note that the sort ordering for writing out set elements and dictionary key/value pairs is not the same as the sort ordering implied by the semantic ordering of those elements or keys. For example, the
Repr
of a negative number very far from zero will start with byte that is greater than the byte which starts theRepr
of zero, making it sort lexicographically later byRepr
, despite being semantically less than zero.Rationale. This is for ease-of-implementation reasons: not all languages can easily represent sorted sets or sorted dictionaries, but encoding and then sorting byte strings is much more likely to be within easy reach. ↩︎
-
The value 0 needs zero bytes to identify the value, so
intbytes(0)
is the empty byte string. Non-zero values need at least one byte. ↩︎ -
It happens to line up with Racket's representation of a record label for an inheritance hierarchy where
titled
extendsperson
extendsthing
:(struct date (year month day) #:prefab) (struct thing (id) #:prefab) (struct person thing (name date-of-birth) #:prefab) (struct titled person (title) #:prefab)
For more detail on Racket's representations of record labels, see the Racket documentation for
make-prefab-struct
. ↩︎