JSON is awesome, but it's troublesome that it can't represent binary data (without a separate encoding, which requires metadata, and more code on both sides of the wire).
I "discovered" a format that easily solves this, which I call QSN (quoted string notation):
It's just Rust string literals with single quotes instead of double quotes. Rust string literals have \x01 for arbitrary bytes like C or Python string literals, but without the legacy of say \v and octal, and with just \u{} rather than \u and \U.
I use in Oil to unambiguously parse and print filenames, display argv arrays, etc. All of these are arbitrary binary data.
GNU tools have about 10 different ways to do this, but QSN makes it consistent and precise.
I'm expanding it to QTSV, a variant of TSV, but if you like you could also make some JSON variant. Technically using single quotes rather than double would make it backward compatible with JSON.
Interesting solution! I usually use base64 encoding unless I'm pushing lots of data, then unfortunately it's easier to make some kind of "file upload" separate from the json if you're going through http.
My demand for that too has slowed in the past couple years, mainly because it's getting easier to do more on the client. e.g. I don't need to upload an entire .docx file, if all my backend needs is ~50kB of values queried out of one of its contained .xml files. Not saying this is a _solution_ to any of the encoding questions, only that it's reduced my immediate need.
Hm it looks like it would satisfy a lot of binary data use cases, but so would MessagePack, CBOR, and probably a few dozen other formats.
A design goal of QSN is to be human readable like JSON. You can view it at the terminal, in an editor, in an e-mail, or in the browser. IMO that fits more naturally with Unix and the web.
So I guess I should make clear that it's not only about "binary data", but it's that + a few other constraints.
I looked at it more. It has both a string type (which has \xHH escapes oddly specified as code points), and a binary type (which is specified as base64 in text format -- this is bad because base64 isn't human-readable despite being ASCII).
The 2 different types makes it unsuitable for the use cases I'm targeting QSN at, which deal only with byte strings, which may be encoded in UTF-8 (like kernel APIs).
I don't see what Ion adds over MessagePack, CBOR, or several other such formats... but that's a different discussion :)
JSON is great and all, but we really, really need to agree on a binary format that can support a few more data type including binary and dates.
I get that JSON being text is easy to debug, but text is just a binary format that has viewers built into everything (utf8) if we agreed on a a more structured hierarchal binary format there would be a viewer for it everywhere.
Taking it further a text file should just be one of these fictional hierarchal binary format files with a single string node for the text, maybe with some agreed metadata nodes as well, same with most other file formats.
My current hypothesis is that it really depends on the viewer and other parts of its toolset. A format like this is easily defined (and there are already dozens that would qualify), but right now, text is in many cases a lot more convenient to use.
So to convince more people to use such a format, we'd need viewers and editors that allow you to exploit the structure of the format, that are easy, intuitively and convenient to use and that are readily available on whatever system you're working on.
Text is also more robust than binary formats: Even if your json/yaml/xml/whatever file got corrupted, you can still open it in a text editor, make sense of it and fix it manually if necessary. An equivalent binary format would need to have the same property.
I agree with all you have said, but I stick to my point, text is a just a binary format we have agreed upon to decode individual characters then secondary decoders are built upon that. Since we have agreed on this binary format for characters there is a viewer for this format built into almost everything, hence its convenience.
If we agreed upon a binary format that encodes hierarchal nodes/key/values with various data types (raw binary, strings, numbers, dates) then hopefully there would be a similar viewer everywhere and no need for most secondary decoders (json on top of text).
It should also be faster and more compact if sending binary or numbers or dates.
Properly designed it should still be able to handle corruption, but that just has a lot to with the decoder and how it handles the corruption.
Proper nan and inf support is by far the biggest trouble in my popular Perl library for JSON.
But it's easily done, just as extension to the standard. The standard lacks many things and has a lot of holes. Newer standards iterations also made things worse, not better. Bad, but still the best and easiest and most secure, if you ignore the latest nonsense.
I'm always impressed by this website's design and clear information. Obvious flow charts, the basic info, and all on one simple page. I know json is very simple, but I would love it if other technologies followed this design.
I'm thinking that the book I learned Pascal from back in 1981 used a similar notation (if not identical) for showing syntax. Of course with nearly 30 years intervening, I can't come up with the name of the book, let alone the book itself nor can I be 100% certain that I remember what the syntax diagrams looked like.
It's quite amazing that all these years on, SQL is still king. There are basically no rules for the syntax, how come nothing cleaner replaced it in RDBMSes?
Two things I think: it ultimately doesn't matter, because the key differentiator between RDBMS selection is the management system, not the query system
And second is that RDBMS's are historically closed-source enterprise tooling -- the vendors themselves have little interest rocking the boat, and there's not much freedom for the community to inject a new language into the system (except as ad-hoc, wonky transpilers, or framework wrappers like ORMs)
EDN[0] competes in the space, for some value of 'competes'. I think it's better, strictly speaking, but lacking JSON's ubiquity, and relentless simplicity, I expect it will always lack the network effects which are most of what make JSON so powerful.
But with that positive comes all the negatives of YAML, especially around arrays of structs vs regular arrays or arrays of structs which contain one key-value pair each, and especially the struggles around multi-line strings or other kinds of entries in YAML.
I don’t think it’s worth the cost and would rather pass some extra keys in JSON that my parser ignores (since additive changes should never cause bugs in data contracts), or a regular key ‘comment/description’.
> But with that positive comes all the negatives of YAML, especially around arrays of structs vs regular arrays or arrays of structs which contain one key-value pair each
Are you able to elaborate on this problem? I'm not going to defend the complexity of YAML but I've never ran into any issues with storing complex structures within in.
> and especially the struggles around multi-line strings or other kinds of entries in YAML.
YAML actually has pretty sophisticated handling of white space within strings. The problem isn't that whatever edge case you run into can't be done, the problem is that YAML covers so many edge cases with different parsing operators that it becomes a bit of a cryptic mess trying to remember which operate is needed when. Though in fairness, JSON was never intended to be human readable (it was meant to be machine generated and machine read) so it's not any better in the readable whitespace department.
> I don’t think it’s worth the cost and would rather pass some extra keys in JSON that my parser ignores (since additive changes should never cause bugs in data contracts), or a regular key ‘comment/description’.
A third option would be to use hash-prefixed comments (like in Bash) then run that JSON through a YAML parser since technically YAML is a superset of JSON (literally, valid JSON is also valid YAML). Though I do accept that would be an unattractive option to some because you end up with less strict format checking of your source JSON (less strict in the JSON sense).
> the problem is that YAML covers so many edge cases with different parsing operators that it becomes a bit of a cryptic mess trying to remember which operate is needed when.
This was exactly my point around multi-line strings. You look at a mess of >'s and |'s and it's absolutely not intuitive which one you should use if the configuration files for one of the languages you're required to use happens to use YAML. In json, there's virtually no ambiguity. Everything's either a string, a number or a bool or a struct, and they all have exactly the same shape with ... really no options to make things "easier"
As for structs and arrays, YAML doesn't really make them clear, in my opinion, due to its lack of opening and closing values.
So, if you're new to k8s and need to make a configuration change to something because the darn thing doesn't work, you're forced to learn yet another markup language when it could just be a very familiar and comparatively intuitive json blob.
For example,
options:
- key: value
foo: bar
thing: thing2
smell: apple
"Oh, so to fix this, I just need to add another entry to turn on the debug flag? And it's 'debug: true'? Oh, okay, so that's ...
options:
- key: value
foo: bar
thing: thing2
smell: apple
debug: true
right?
Oh, no? It's not... well what is it?
And then a long conversation with a coworker later, they explain, "Oh! No no no, it's this:"
item:
- key: value
foo: bar
thing: thing2
smell: apple
- key: value2
debug: true
Turns out debug was another option you needed to add.
Or, in other places in some syntax, you see a bunch of:
items:
- entry
- entry2
- entry3
or
item: 1
item2: 2
item3: 3
I've familiarized myself more with YAML over time; but, its learning curve is substantially more difficult than:
{
"everything is inside curly brackets": "keys and values can be strings",
"there's a comma at the end of everything: [
"arrays exist",
"they're also comma delimited",
1, "types don't matter"
]
}
To be honest I much much prefer YAML for manually handling multiline strings. And frankly JSON's strictness on commas after all except the final entry catches me out so many times (it doesn't help that most parsers aren't great at pointing out where the missing comma is).
Being pragmatic, I'd say neither serialisation format is better than the other. JSON does something things better (It's easier to grok nested structures and simpler to reason about the specification) but YAML does other things better (easier to embed multiline blocks of text, handles streaming better, supports comments).
Let's not also forget that most of the stuff that people dismiss in JSON is only solved by unofficial hacks (eg jsonlines) that might be widely supported but you cannot rely upon universally. So then you have two problems: a standard that doesn't support x and multiple different implementations that don't strictly support the standard. YAML is a hell of a lot better when it comes to removing undefined behaviour in parsers -- even if that does come at a cost to the complexity of the specification.
> especially around arrays of structs vs regular arrays or arrays of structs which contain one key-value pair each, and especially the struggles around multi-line strings or other kinds of entries in YAML.
Can you give examples? I don't see how they pose a problem.
So, besides not playing nice with other parsers, I wonder if there's much harm in adding `//` and `/*` style comments. For example, if you use a specific structure of json for a config file in project you make, you could add comment support to your parser.
I also wonder how much of a pain it would be to add comments to json, like a json v2
Comments in JSON are easy. You just strip them out before sending to a strict parser. Some parsers even support them out of the box.
In fact they're so easy they were already in JSON, and then later removed.
> I removed comments from JSON because I saw people were using them to hold parsing directives, a practice which would have destroyed interoperability. ~Douglas Crockford
However I think it goes too far. E.g. why support single quoted strings? That just makes parsing harder.
I prefer the format Microsoft uses in VSCode and Typescript - JSONC. It's just JSON but with trailing commas and comments. The downside is it isn't obvious when something is JSON and when it is JSONC because they use the same extension.
While there are many things that make any change to json changes challenging, single vs. double quote really isn’t a problem, the JSC JSON parser supprots them at an implementation level as it is also used for jsonp insanity, it’s just that in pure JSON mode (eg JSON.parse) it doesn’t allow single quotes or a few other similar “real” js language features it supports for the sake of jsonp
> why support single quoted strings? That just makes parsing harder.
And something that a lot of language designs ignore: it makes writing harder and unnecessarily contentious. People will use them inconsistently, which has the usual effects any inconsistency has on cognitive load: causes others to question why/how/where.
I like JSON a lot, it's my go to for ad-hoc data storage. I've found it extremely convenient because nearly all modern APIs can return json. Further, jq makes working with json extremely ergonomic.
For certain tasks (personal use) I hand roll my own json document storage database. I make a bunch of rest API calls to collect data and store it in flat files. While doing this, I use JQ to store subsets of that information (the stuff I really care about) in smaller json blobs. I then write a script to aggregate all of the smaller blobs into a larger json array.
That's almost no work and you already get a nice schema-less database. Write some commands (stored-procedures) to do any kind of filtering/modifications and you can immediately get whatever view you want of your data with one command. Write a wrapper script to identify documents by a field (primary key) and you can make data modifications in an ergonomic way. I run "modify-document <primary-key>" and it runs a tiny readline bash script prompting me for info which immediately modifies the corresponding database row.
A work flow could be...
1. Make API requests (with gnu parallel) storing the response json and a smaller json blob.
2. Aggregate the small blobs into an array.
3. Filter blobs for any that are missing information.
4. Manually update blobs missing information with readline script.
5. Filter for blobs that need processing.
6. Process blobs, use readline script to mark them as processed.
7. Continue until all blobs are processed.
This is the kind of thing I would use excel to do in the past. Hopefully I never make that mistake again.
Are you me? Bash and jq (and sendemail) go so far in reporting and analyzing data, it’s like magic. Haven’t figured out a great way to present the data so folks can draw their own conclusions, though.
Protocol buffers solve most of the issues mentioned by people here, on top of being typesafe, space efficient (no need to encode key names because everything is an ordered struct), and having great tooling.
I "discovered" a format that easily solves this, which I call QSN (quoted string notation):
http://www.oilshell.org/release/latest/doc/qsn.html
It's just Rust string literals with single quotes instead of double quotes. Rust string literals have \x01 for arbitrary bytes like C or Python string literals, but without the legacy of say \v and octal, and with just \u{} rather than \u and \U.
I use in Oil to unambiguously parse and print filenames, display argv arrays, etc. All of these are arbitrary binary data.
GNU tools have about 10 different ways to do this, but QSN makes it consistent and precise.
I'm expanding it to QTSV, a variant of TSV, but if you like you could also make some JSON variant. Technically using single quotes rather than double would make it backward compatible with JSON.