diff --git a/cheatsheet.md b/cheatsheet.md index 99efff7..2df7324 100644 --- a/cheatsheet.md +++ b/cheatsheet.md @@ -12,12 +12,12 @@ For a value `v`, we write `«v»` for the binary encoding of `v`. «#f» = [0xA0] «#t» = [0xA1] - «F» = [0xA2] ++ binary32(F) if F ∈ Float - «D» = [0xA2] ++ binary64(D) if D ∈ Double - «x» = [0xA3] ++ intbytes(x) if x ∈ SignedInteger - «S» = [0xA4] ++ utf8(S) if S ∈ String - [0xA5] ++ S if S ∈ ByteString - [0xA6] ++ utf8(S) if S ∈ Symbol + «F» = [0xA2] ++ binary32(F) if F ∈ Float + «D» = [0xA2] ++ binary64(D) if D ∈ Double + «x» = [0xA3] ++ intbytes(x) if x ∈ SignedInteger + «S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String + [0xA5] ++ S if S ∈ ByteString + [0xA6] ++ utf8(S) if S ∈ Symbol «» = [0xA7] ++ seq(«L», «F_1», ..., «F_m») «[X_1...X_m]» = [0xA8] ++ seq(«X_1», ..., «X_m») diff --git a/preserves-binary.md b/preserves-binary.md index 7717beb..cc94369 100644 --- a/preserves-binary.md +++ b/preserves-binary.md @@ -153,14 +153,22 @@ For example, ### Strings, ByteStrings and Symbols. -Syntax for these three types varies only in the tag used. For `String` -and `Symbol`, the data following the tag is a UTF-8 encoding of the -`Value`'s code points, while for `ByteString` it is the raw data -contained within the `Value` unmodified. + «S» = [0xA4] ++ utf8(S) ++ [0] if S ∈ String + [0xA5] ++ S if S ∈ ByteString + [0xA6] ++ utf8(S) if S ∈ Symbol - «S» = [0xA4] ++ utf8(S) if S ∈ String - [0xA5] ++ S if S ∈ ByteString - [0xA6] ++ utf8(S) if S ∈ Symbol +For `String` and `Symbol`, the data following the tag is a UTF-8 +encoding of the `Value`'s code points, while for `ByteString` it is the +raw data contained within the `Value` unmodified. + +Each `String` has a trailing zero byte appended. This extra byte *MUST +NOT* be treated as part of the `Value`: it exists to permit zero-copy C +interoperability.[^zero-copy-c-string-interop] + + [^zero-copy-c-string-interop]: Some care must still be taken when + passing `String` `Repr`s directly to a C-style ABI, since `String`s + may contain the zero Unicode code point, which C library routines + will usually misinterpret as an end-of-string marker. ### Booleans. @@ -221,7 +229,8 @@ the same `Value` to yield different binary `Repr`s. ## Acknowledgements The exclusion of lengths from `Repr`s, placing lengths instead ahead of -contained values in sequences, is inspired by [argdata][]. +contained values in sequences, is inspired by [argdata][], as is the +inclusion of a `NUL` byte in `String` `Repr`s for C interoperability. ## Appendix. Autodetection of textual or binary syntax