13 KiB
no_site_title | title |
---|---|
true | Preserves: Text Syntax |
Tony Garnock-Jones tonyg@leastfixedpoint.com
{{ site.version_date }}. Version {{ site.version }}.
Preserves is a data model, with associated serialization formats. This
document defines one of those formats: a textual syntax for Value
s
from the Preserves data model that is easy for people
to read and write. An equivalent machine-oriented binary
syntax also exists.
Preliminaries
The definition uses case-sensitive ABNF.
ABNF allows easy definition of US-ASCII-based languages. However, Preserves is a Unicode-based language. Therefore, we reinterpret ABNF as a grammar for recognising sequences of Unicode code points.
Encoding. Textual syntax for a Value
SHOULD be encoded using
UTF-8 where possible.
Whitespace. Whitespace is defined as any number of spaces, tabs, carriage returns, line feeds, or commas.
ws = *(%x20 / %x09 / newline / ",")
newline = CR / LF
Grammar
Standalone documents may have trailing whitespace.
Document = Value ws
Any Value
may be preceded by whitespace.
Value = ws (Record / Collection / Atom / Embedded / Machine)
Collection = Sequence / Dictionary / Set
Atom = Boolean / Float / Double / SignedInteger /
String / ByteString / Symbol
Each Record
is an angle-bracket enclosed grouping of its
label-Value
followed by its field-Value
s.
Record = "<" Value *Value ws ">"
Sequence
s are enclosed in square brackets. Dictionary
values are
curly-brace-enclosed colon-separated pairs of values. Set
s are
written as values enclosed by the tokens #{
and
}
.1 It is an error for a set to contain
duplicate elements or for a dictionary to contain duplicate keys.
Sequence = "[" *Value ws "]"
Dictionary = "{" *(Value ws ":" Value) ws "}"
Set = "#{" *Value ws "}"
Boolean
s are the simple literal strings #t
and #f
for true and
false, respectively.
Boolean = %s"#t" / %s"#f"
Numeric data follow the
JSON grammar, with
the addition of a trailing “f” distinguishing Float
from Double
values. Float
s and Double
s always have either a fractional part or
an exponent part, where SignedInteger
s never have
either.2
3
Float = flt %i"f"
Double = flt
SignedInteger = int
digit1-9 = %x31-39
nat = %x30 / ( digit1-9 *DIGIT )
int = ["-"] nat
frac = "." 1*DIGIT
exp = %i"e" ["-"/"+"] 1*DIGIT
flt = int (frac exp / frac / exp)
String
s are,
as in JSON, possibly
escaped text surrounded by double quotes. The escaping rules are the
same as for JSON.4 5
String = %x22 *char %x22
char = unescaped / %x7C / escape (escaped / %x22 / %s"u" 4HEXDIG)
unescaped = %x20-21 / %x23-5B / %x5D-7B / %x7D-10FFFF
escape = %x5C ; \
escaped = ( %x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 ) ; t tab U+0009
A ByteString
may be written in any of three different forms.
The first is similar to a String
, but prepended with a hash sign
#
. In addition, only Unicode code points overlapping with printable
7-bit ASCII are permitted unescaped inside such a ByteString
; other
byte values must be escaped by prepending a two-digit hexadecimal
value with \x
.
ByteString = "#" %x22 *binchar %x22
binchar = binunescaped / escape (escaped / %x22 / %s"x" 2HEXDIG)
binunescaped = %x20-21 / %x23-5B / %x5D-7E
The second is as a sequence of pairs of hexadecimal digits interleaved
with whitespace and surrounded by #x"
and "
.
ByteString =/ %s"#x" %x22 *(ws / 2HEXDIG) ws %x22
The third is as a sequence of
Base64 characters, interleaved
with whitespace and surrounded by #[
and ]
. Plain and URL-safe
Base64 characters are allowed.
ByteString =/ "#[" *(ws / base64char) ws "]"
base64char = %x41-5A / %x61-7A / %x30-39 / "+" / "/" / "-" / "_" / "="
A Symbol
may be written in a “bare” form6 so long as
it conforms to certain restrictions on the characters appearing in the
symbol. Alternatively, it may be written in a quoted form. The quoted
form is much the same as the syntax for String
s, including embedded
escape syntax, except using a bar or pipe character (|
) instead of a
double quote mark.
Symbol = symstart *symcont / "|" *symchar "|"
symstart = ALPHA / sympunct / symustart
symcont = ALPHA / sympunct / symustart / symucont / DIGIT / "-"
sympunct = "~" / "!" / "$" / "%" / "^" / "&" / "*" /
"?" / "_" / "=" / "+" / "/" / "."
symchar = unescaped / %x22 / escape (escaped / %x7C / %s"u" 4HEXDIG)
symustart = <any code point greater than 127 whose Unicode
category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me,
Pc, Po, Sc, Sm, Sk, So, or Co>
symucont = <any code point greater than 127 whose Unicode
category is Nd, Nl, No, or Pd>
An Embedded
is written as a Value
chosen to represent the denoted
object, prefixed with #!
.
Embedded = "#!" Value
Finally, any Value
may be represented by escaping from the textual
syntax to the machine-oriented binary syntax
by prefixing a ByteString
containing the binary representation of the
Value
with #=
.7
8 9
Machine = "#=" ws ByteString
Annotations
When written down, a Value
may have an associated sequence of
annotations carrying “out-of-band” contextual metadata about the
value. Each annotation is, in turn, a Value
, and may itself have
annotations. The ordering of annotations attached to a Value
is
significant.
Value =/ ws "@" Value Value
Each annotation is preceded by @
; the underlying annotated value
follows its annotations. Here we extend only the syntactic nonterminal
named “Value
” without altering the semantic class of Value
s.
Comments. Strings annotating a Value
are conventionally
interpreted as comments associated with that value. Comments are
sufficiently common that special syntax exists for them.
Value =/ ws
";" *(%x00-09 / %x0B-0C / %x0E-10FFFF) newline
Value
When written this way, everything between the ;
and the newline is
included in the string annotating the Value
.
Equivalence. Annotations appear within syntax denoting a Value
;
however, the annotations are not part of the denoted value. They are
only part of the syntax. Annotations do not play a part in
equivalences and orderings of Value
s.
Reflective tools such as debuggers, user interfaces, and message
routers and relays---tools which process Value
s generically---may
use annotated inputs to tailor their operation, or may insert
annotations in their outputs. By contrast, in ordinary programs, as a
rule of thumb, the presence, absence or content of an annotation
should not change the control flow or output of the program.
Annotations are data describing Value
s, and are not in the domain
of any specific application of Value
s. That is, an annotation will
almost never cause a non-reflective program to do anything observably
different.
Security Considerations
Whitespace. The textual format allows arbitrary whitespace in many positions. Consider optional restrictions on the amount of consecutive whitespace that may appear.
Annotations. Similarly, in modes where a Value
is being read
while annotations are skipped, an endless sequence of annotations may
give an illusion of progress.
Acknowledgements
The treatment of commas as whitespace in the text syntax is inspired by the same feature of EDN.
The text syntax for Boolean
s, Symbol
s, and ByteString
s is
directly inspired by Racket's lexical
syntax.
Notes
-
Implementation note. When implementing printing of
Value
s using the textual syntax, consider supporting (a) optional pretty-printing with indentation, (b) optional JSON-compatible print mode for that subset ofValue
that is compatible with JSON, and (c) optional submodes for no commas, commas separating, and commas terminating elements or key/value pairs within a collection. ↩︎ -
Implementation note. Your language's standard library likely has a good routine for converting between decimal notation and IEEE 754 floating-point. However, if not, or if you are interested in the challenges of accurately reading and writing floating point numbers, see the excellent matched pair of 1990 papers by Clinger and Steele & White, and a recent follow-up by Jaffer:
Clinger, William D. ‘How to Read Floating Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93557.
Steele, Guy L., Jr., and Jon L. White. ‘How to Print Floating-Point Numbers Accurately’. In Proc. PLDI. White Plains, New York, 1990. https://doi.org/10.1145/93542.93559.
Jaffer, Aubrey. ‘Easy Accurate Reading and Writing of Floating-Point Numbers’. ArXiv:1310.8121 [Cs], 27 October 2013. http://arxiv.org/abs/1310.8121. ↩︎
-
Implementation note. Be aware when implementing reading and writing of
SignedInteger
s that the data model requires arbitrary-precision integers. Your implementation may (but, ideally, should not) truncate precision when reading or writing aSignedInteger
; however, if it does so, it should (a) signal its client that truncation has occurred, and (b) make it clear to the client that comparing such truncated values for equality or ordering will not yield results that match the expected semantics of the data model. ↩︎ -
The grammar for
String
has the same effect as the JSON grammar forstring
. Some auxiliary definitions (e.g.escaped
) are lifted largely unmodified from the text of RFC 8259. ↩︎ -
In particular, note JSON's rules around the use of surrogate pairs for code points not in the Basic Multilingual Plane. We encourage implementations to avoid using
\u
escapes when producing output, and instead to rely on the UTF-8 encoding of the entire document to handle non-ASCII codepoints correctly. ↩︎ -
Compare with the SPKI S-expression definition of “token representation”, and with the R6RS definition of identifiers. ↩︎
-
Rationale. The textual syntax cannot express every
Value
: specifically, it cannot express the several million floating-point NaNs, or the two floating-point Infinities. Since the machine-oriented binary format forValue
s expresses eachValue
with precision, embedding binaryValue
s solves the problem. ↩︎ -
Every text is ultimately physically stored as bytes; therefore, it might seem possible to escape to the raw form of binary encoding from within a piece of textual syntax. However, while bytes must be involved in any representation of text, the text itself is logically a sequence of code points and is not intrinsically a binary structure at all. It would be incoherent to expect to be able to access the representation of the text from within the text itself. ↩︎
-
Any text-syntax annotations preceding the
#
are prepended to any binary-syntax annotations yielded by decoding theByteString
. ↩︎