298 lines
13 KiB
Markdown
298 lines
13 KiB
Markdown
|
---
|
|||
|
no_site_title: true
|
|||
|
title: "Preserves: Text Syntax"
|
|||
|
---
|
|||
|
|
|||
|
Tony Garnock-Jones <tonyg@leastfixedpoint.com>
|
|||
|
{{ site.version_date }}. Version {{ site.version }}.
|
|||
|
|
|||
|
[sexp.txt]: http://people.csail.mit.edu/rivest/Sexp.txt
|
|||
|
[abnf]: https://tools.ietf.org/html/rfc7405
|
|||
|
|
|||
|
*Preserves* is a data model, with associated serialization formats. This
|
|||
|
document defines one of those formats: a textual syntax for `Value`s
|
|||
|
from the [Preserves data model](preserves.html) that is easy for people
|
|||
|
to read and write. An [equivalent machine-oriented binary
|
|||
|
syntax](preserves-binary.html) also exists.
|
|||
|
|
|||
|
## Preliminaries
|
|||
|
|
|||
|
The definition uses [case-sensitive ABNF][abnf].
|
|||
|
|
|||
|
ABNF allows easy definition of US-ASCII-based languages. However,
|
|||
|
Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as
|
|||
|
a grammar for recognising sequences of Unicode code points.
|
|||
|
|
|||
|
**Encoding.** Textual syntax for a `Value` *SHOULD* be encoded using
|
|||
|
UTF-8 where possible.
|
|||
|
|
|||
|
**Whitespace.** Whitespace is defined as any number of spaces, tabs,
|
|||
|
carriage returns, line feeds, or commas.
|
|||
|
|
|||
|
ws = *(%x20 / %x09 / newline / ",")
|
|||
|
newline = CR / LF
|
|||
|
|
|||
|
## Grammar
|
|||
|
|
|||
|
Standalone documents may have trailing whitespace.
|
|||
|
|
|||
|
Document = Value ws
|
|||
|
|
|||
|
Any `Value` may be preceded by whitespace.
|
|||
|
|
|||
|
Value = ws (Record / Collection / Atom / Embedded / Machine)
|
|||
|
Collection = Sequence / Dictionary / Set
|
|||
|
Atom = Boolean / Float / Double / SignedInteger /
|
|||
|
String / ByteString / Symbol
|
|||
|
|
|||
|
Each `Record` is an angle-bracket enclosed grouping of its
|
|||
|
label-`Value` followed by its field-`Value`s.
|
|||
|
|
|||
|
Record = "<" Value *Value ws ">"
|
|||
|
|
|||
|
`Sequence`s are enclosed in square brackets. `Dictionary` values are
|
|||
|
curly-brace-enclosed colon-separated pairs of values. `Set`s are
|
|||
|
written as values enclosed by the tokens `#{` and
|
|||
|
`}`.[^printing-collections] It is an error for a set to contain
|
|||
|
duplicate elements or for a dictionary to contain duplicate keys.
|
|||
|
|
|||
|
Sequence = "[" *Value ws "]"
|
|||
|
Dictionary = "{" *(Value ws ":" Value) ws "}"
|
|||
|
Set = "#{" *Value ws "}"
|
|||
|
|
|||
|
[^printing-collections]: **Implementation note.** When implementing
|
|||
|
printing of `Value`s using the textual syntax, consider supporting
|
|||
|
(a) optional pretty-printing with indentation, (b) optional
|
|||
|
JSON-compatible print mode for that subset of `Value` that is
|
|||
|
compatible with JSON, and (c) optional submodes for no commas,
|
|||
|
commas separating, and commas terminating elements or key/value
|
|||
|
pairs within a collection.
|
|||
|
|
|||
|
`Boolean`s are the simple literal strings `#t` and `#f` for true and
|
|||
|
false, respectively.
|
|||
|
|
|||
|
Boolean = %s"#t" / %s"#f"
|
|||
|
|
|||
|
Numeric data follow the
|
|||
|
[JSON grammar](https://tools.ietf.org/html/rfc8259#section-6), with
|
|||
|
the addition of a trailing “f” distinguishing `Float` from `Double`
|
|||
|
values. `Float`s and `Double`s always have either a fractional part or
|
|||
|
an exponent part, where `SignedInteger`s never have
|
|||
|
either.[^reading-and-writing-floats-accurately]
|
|||
|
[^arbitrary-precision-signedinteger]
|
|||
|
|
|||
|
Float = flt %i"f"
|
|||
|
Double = flt
|
|||
|
SignedInteger = int
|
|||
|
|
|||
|
digit1-9 = %x31-39
|
|||
|
nat = %x30 / ( digit1-9 *DIGIT )
|
|||
|
int = ["-"] nat
|
|||
|
frac = "." 1*DIGIT
|
|||
|
exp = %i"e" ["-"/"+"] 1*DIGIT
|
|||
|
flt = int (frac exp / frac / exp)
|
|||
|
|
|||
|
[^reading-and-writing-floats-accurately]: **Implementation note.**
|
|||
|
Your language's standard library likely has a good routine for
|
|||
|
converting between decimal notation and IEEE 754 floating-point.
|
|||
|
However, if not, or if you are interested in the challenges of
|
|||
|
accurately reading and writing floating point numbers, see the
|
|||
|
excellent matched pair of 1990 papers by Clinger and Steele &
|
|||
|
White, and a recent follow-up by Jaffer:
|
|||
|
|
|||
|
Clinger, William D. ‘How to Read Floating Point Numbers
|
|||
|
Accurately’. In Proc. PLDI. White Plains, New York, 1990.
|
|||
|
<https://doi.org/10.1145/93542.93557>.
|
|||
|
|
|||
|
Steele, Guy L., Jr., and Jon L. White. ‘How to Print
|
|||
|
Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains,
|
|||
|
New York, 1990. <https://doi.org/10.1145/93542.93559>.
|
|||
|
|
|||
|
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of
|
|||
|
Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013.
|
|||
|
<http://arxiv.org/abs/1310.8121>.
|
|||
|
|
|||
|
[^arbitrary-precision-signedinteger]: **Implementation note.** Be
|
|||
|
aware when implementing reading and writing of `SignedInteger`s
|
|||
|
that the data model *requires* arbitrary-precision integers. Your
|
|||
|
implementation may (but, ideally, should not) truncate precision
|
|||
|
when reading or writing a `SignedInteger`; however, if it does so,
|
|||
|
it should (a) signal its client that truncation has occurred, and
|
|||
|
(b) make it clear to the client that comparing such truncated
|
|||
|
values for equality or ordering will not yield results that match
|
|||
|
the expected semantics of the data model.
|
|||
|
|
|||
|
`String`s are,
|
|||
|
[as in JSON](https://tools.ietf.org/html/rfc8259#section-7), possibly
|
|||
|
escaped text surrounded by double quotes. The escaping rules are the
|
|||
|
same as for JSON.[^string-json-correspondence] [^escaping-surrogate-pairs]
|
|||
|
|
|||
|
String = %x22 *char %x22
|
|||
|
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
|
|||
|
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
|
|||
|
escape = %x5C ; \
|
|||
|
escaped = ( %x5C / ; \ reverse solidus U+005C
|
|||
|
%x2F / ; / solidus U+002F
|
|||
|
%x62 / ; b backspace U+0008
|
|||
|
%x66 / ; f form feed U+000C
|
|||
|
%x6E / ; n line feed U+000A
|
|||
|
%x72 / ; r carriage return U+000D
|
|||
|
%x74 ) ; t tab U+0009
|
|||
|
|
|||
|
[^string-json-correspondence]: The grammar for `String` has the same
|
|||
|
effect as the
|
|||
|
[JSON](https://tools.ietf.org/html/rfc8259#section-7) grammar for
|
|||
|
`string`. Some auxiliary definitions (e.g. `escaped`) are lifted
|
|||
|
largely unmodified from the text of RFC 8259.
|
|||
|
|
|||
|
[^escaping-surrogate-pairs]: In particular, note JSON's rules around
|
|||
|
the use of surrogate pairs for code points not in the Basic
|
|||
|
Multilingual Plane. We encourage implementations to avoid using
|
|||
|
`\u` escapes when producing output, and instead to rely on the
|
|||
|
UTF-8 encoding of the entire document to handle non-ASCII
|
|||
|
codepoints correctly.
|
|||
|
|
|||
|
A `ByteString` may be written in any of three different forms.
|
|||
|
|
|||
|
The first is similar to a `String`, but prepended with a hash sign
|
|||
|
`#`. In addition, only Unicode code points overlapping with printable
|
|||
|
7-bit ASCII are permitted unescaped inside such a `ByteString`; other
|
|||
|
byte values must be escaped by prepending a two-digit hexadecimal
|
|||
|
value with `\x`.
|
|||
|
|
|||
|
ByteString = "#" %x22 *binchar %x22
|
|||
|
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
|
|||
|
binunescaped = %x20-21 / %x23-5B / %x5D-7E
|
|||
|
|
|||
|
The second is as a sequence of pairs of hexadecimal digits interleaved
|
|||
|
with whitespace and surrounded by `#x"` and `"`.
|
|||
|
|
|||
|
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
|
|||
|
|
|||
|
The third is as a sequence of
|
|||
|
[Base64](https://tools.ietf.org/html/rfc4648) characters, interleaved
|
|||
|
with whitespace and surrounded by `#[` and `]`. Plain and URL-safe
|
|||
|
Base64 characters are allowed.
|
|||
|
|
|||
|
ByteString =/ "#[" *(ws / base64char) ws "]"
|
|||
|
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
|
|||
|
|
|||
|
A `Symbol` may be written in a “bare” form[^cf-sexp-token] so long as
|
|||
|
it conforms to certain restrictions on the characters appearing in the
|
|||
|
symbol. Alternatively, it may be written in a quoted form. The quoted
|
|||
|
form is much the same as the syntax for `String`s, including embedded
|
|||
|
escape syntax, except using a bar or pipe character (`|`) instead of a
|
|||
|
double quote mark.
|
|||
|
|
|||
|
Symbol = symstart *symcont / "|" *symchar "|"
|
|||
|
symstart = ALPHA / sympunct / symustart
|
|||
|
symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
|
|||
|
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
|
|||
|
"?" / "_" / "=" / "+" / "/" / "."
|
|||
|
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
|
|||
|
symustart = <any code point greater than 127 whose Unicode
|
|||
|
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
|
|||
|
Pc, Po, Sc, Sm, Sk, So, or Co>
|
|||
|
symucont = <any code point greater than 127 whose Unicode
|
|||
|
category is Nd, Nl, No, or Pd>
|
|||
|
|
|||
|
[^cf-sexp-token]: Compare with the [SPKI S-expression][sexp.txt]
|
|||
|
definition of “token representation”, and with the
|
|||
|
[R6RS definition of identifiers](http://www.r6rs.org/final/html/r6rs/r6rs-Z-H-7.html#node_sec_4.2.4).
|
|||
|
|
|||
|
An `Embedded` is written as a `Value` chosen to represent the denoted
|
|||
|
object, prefixed with `#!`.
|
|||
|
|
|||
|
Embedded = "#!" Value
|
|||
|
|
|||
|
Finally, any `Value` may be represented by escaping from the textual
|
|||
|
syntax to the [machine-oriented binary syntax](preserves-binary.html)
|
|||
|
by prefixing a `ByteString` containing the binary representation of the
|
|||
|
`Value` with `#=`.[^rationale-switch-to-binary]
|
|||
|
[^no-literal-binary-in-text] [^machine-value-annotations]
|
|||
|
|
|||
|
Machine = "#=" ws ByteString
|
|||
|
|
|||
|
[^rationale-switch-to-binary]: **Rationale.** The textual syntax
|
|||
|
cannot express every `Value`: specifically, it cannot express the
|
|||
|
several million floating-point NaNs, or the two floating-point
|
|||
|
Infinities. Since the machine-oriented binary format for `Value`s
|
|||
|
expresses each `Value` with precision, embedding binary `Value`s
|
|||
|
solves the problem.
|
|||
|
|
|||
|
[^no-literal-binary-in-text]: Every text is ultimately physically
|
|||
|
stored as bytes; therefore, it might seem possible to escape to the
|
|||
|
raw form of binary encoding from within a piece of textual syntax.
|
|||
|
However, while bytes must be involved in any *representation* of
|
|||
|
text, the text *itself* is logically a sequence of *code points* and
|
|||
|
is not *intrinsically* a binary structure at all. It would be
|
|||
|
incoherent to expect to be able to access the representation of the
|
|||
|
text from within the text itself.
|
|||
|
|
|||
|
[^machine-value-annotations]: Any text-syntax annotations preceding
|
|||
|
the `#` are prepended to any binary-syntax annotations yielded by
|
|||
|
decoding the `ByteString`.
|
|||
|
|
|||
|
## Annotations
|
|||
|
|
|||
|
When written down, a `Value` may have an associated sequence of
|
|||
|
*annotations* carrying “out-of-band” contextual metadata about the
|
|||
|
value. Each annotation is, in turn, a `Value`, and may itself have
|
|||
|
annotations. The ordering of annotations attached to a `Value` is
|
|||
|
significant.
|
|||
|
|
|||
|
Value =/ ws "@" Value Value
|
|||
|
|
|||
|
Each annotation is preceded by `@`; the underlying annotated value
|
|||
|
follows its annotations. Here we extend only the syntactic nonterminal
|
|||
|
named “`Value`” without altering the semantic class of `Value`s.
|
|||
|
|
|||
|
**Comments.** Strings annotating a `Value` are conventionally
|
|||
|
interpreted as comments associated with that value. Comments are
|
|||
|
sufficiently common that special syntax exists for them.
|
|||
|
|
|||
|
Value =/ ws
|
|||
|
";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline
|
|||
|
Value
|
|||
|
|
|||
|
When written this way, everything between the `;` and the newline is
|
|||
|
included in the string annotating the `Value`.
|
|||
|
|
|||
|
**Equivalence.** Annotations appear within syntax denoting a `Value`;
|
|||
|
however, the annotations are not part of the denoted value. They are
|
|||
|
only part of the syntax. Annotations do not play a part in
|
|||
|
equivalences and orderings of `Value`s.
|
|||
|
|
|||
|
Reflective tools such as debuggers, user interfaces, and message
|
|||
|
routers and relays---tools which process `Value`s generically---may
|
|||
|
use annotated inputs to tailor their operation, or may insert
|
|||
|
annotations in their outputs. By contrast, in ordinary programs, as a
|
|||
|
rule of thumb, the presence, absence or content of an annotation
|
|||
|
should not change the control flow or output of the program.
|
|||
|
Annotations are data *describing* `Value`s, and are not in the domain
|
|||
|
of any specific application of `Value`s. That is, an annotation will
|
|||
|
almost never cause a non-reflective program to do anything observably
|
|||
|
different.
|
|||
|
|
|||
|
## Security Considerations
|
|||
|
|
|||
|
**Whitespace.** The textual format allows arbitrary whitespace in many
|
|||
|
positions. Consider optional restrictions on the amount of consecutive
|
|||
|
whitespace that may appear.
|
|||
|
|
|||
|
**Annotations.** Similarly, in modes where a `Value` is being read
|
|||
|
while annotations are skipped, an endless sequence of annotations may
|
|||
|
give an illusion of progress.
|
|||
|
|
|||
|
## Acknowledgements
|
|||
|
|
|||
|
The treatment of commas as whitespace in the text syntax is inspired
|
|||
|
by the same feature of [EDN](https://github.com/edn-format/edn).
|
|||
|
|
|||
|
The text syntax for `Boolean`s, `Symbol`s, and `ByteString`s is
|
|||
|
directly inspired by [Racket](https://racket-lang.org/)'s lexical
|
|||
|
syntax.
|
|||
|
|
|||
|
<!-- Heading to visually offset the footnotes from the main document: -->
|
|||
|
## Notes
|