JSON/YAML/TOML/etc. parsing performance

classic Classic list List threaded Threaded
73 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

JSON/YAML/TOML/etc. parsing performance

Ted Zlatanov
I wanted to ask if there's any chance of improving the parsing
performance of JSON, YAML, TOML, and similar data formats. It's pretty
poor today.

That could be done in the core with C code, improved Lisp code,
integration with an external library, or a mix of those.

Ted


Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Mark Oteiza

Ted Zlatanov <[hidden email]> writes:

> I wanted to ask if there's any chance of improving the parsing
> performance of JSON, YAML, TOML, and similar data formats. It's pretty
> poor today.
>
> That could be done in the core with C code, improved Lisp code,
> integration with an external library, or a mix of those.

There are a ton of external libraries for parsing JSON, many of which
have lots of high level functions for dealing with it--at the cost of
being forced to use their object system.  I had a go at integrating
jansson but had issues.

I'm fond of this JSON tokenizer, alas AIUI we cannot use it without
copyright assignment:
https://github.com/zserge/jsmn

I'm interested in JSON parsing in core, and to that end I'm learning
Ragel to generate a parser.  This could take a while, though :)

Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Richard Stallman
In reply to this post by Ted Zlatanov
[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > I wanted to ask if there's any chance of improving the parsing
  > performance of JSON, YAML, TOML, and similar data formats. It's pretty
  > poor today.

Can you design a primitive general enough to speed up parsing of all those
formats?  That would make it feasible to handle them all without
too much work.


--
Dr Richard Stallman
President, Free Software Foundation (gnu.org, fsf.org)
Internet Hall-of-Famer (internethalloffame.org)
Skype: No way! See stallman.org/skype.html.


Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Richard Stallman
In reply to this post by Mark Oteiza
[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > I'm fond of this JSON tokenizer, alas AIUI we cannot use it without
  > copyright assignment:
  > https://github.com/zserge/jsmn

If it is an external package, not specifically for Emacs,
we can include it along with Emacs without copyright papers
if its license is compatible with GPLv3+.
--
Dr Richard Stallman
President, Free Software Foundation (gnu.org, fsf.org)
Internet Hall-of-Famer (internethalloffame.org)
Skype: No way! See stallman.org/skype.html.


Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Mark Oteiza

Richard Stallman <[hidden email]> writes:

>   > I'm fond of this JSON tokenizer, alas AIUI we cannot use it without
>   > copyright assignment:
>   > https://github.com/zserge/jsmn
>
> If it is an external package, not specifically for Emacs,
> we can include it along with Emacs without copyright papers
> if its license is compatible with GPLv3+.

I see. It's MIT licensed, so I guess it can be integrated.  Thanks for
the correction.

Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Philipp Stephani
In reply to this post by Ted Zlatanov


Ted Zlatanov <[hidden email]> schrieb am Sa., 16. Sep. 2017 um 17:55 Uhr:
I wanted to ask if there's any chance of improving the parsing
performance of JSON, YAML, TOML, and similar data formats. It's pretty
poor today.

That could be done in the core with C code, improved Lisp code,
integration with an external library, or a mix of those.

I don't know much about the others, but given the importance of JSON as data exchange and serialization format, I think it's worthwhile to invest some time here. I've implemented a wrapper around the json-c library (license: Expat/X11/MIT), resulting in significant speedups using the test data from https://github.com/miloyip/nativejson-benchmark: a factor of 3.9 to 6.4 for parsing, and a factor of 27 to 67 for serializing. If people agree that this is useful I can send a patch.
Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Eli Zaretskii
> From: Philipp Stephani <[hidden email]>
> Date: Sun, 17 Sep 2017 18:46:45 +0000
>
> I don't know much about the others, but given the importance of JSON as data exchange and serialization
> format, I think it's worthwhile to invest some time here. I've implemented a wrapper around the json-c library
> (license: Expat/X11/MIT), resulting in significant speedups using the test data from
> https://github.com/miloyip/nativejson-benchmark: a factor of 3.9 to 6.4 for parsing, and a factor of 27 to 67 for
> serializing. If people agree that this is useful I can send a patch.

Before we make a decision on which library to use, I'd prefer some
kind of survey of available free software libraries, including their
popularity and development activity.  The survey doesn't have to be
exhaustive, but I think we should compare at least a few candidates.
We already have 2, so maybe we should start by comparing them.

Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Philipp Stephani


Eli Zaretskii <[hidden email]> schrieb am So., 17. Sep. 2017 um 21:05 Uhr:
> From: Philipp Stephani <[hidden email]>
> Date: Sun, 17 Sep 2017 18:46:45 +0000
>
> I don't know much about the others, but given the importance of JSON as data exchange and serialization
> format, I think it's worthwhile to invest some time here. I've implemented a wrapper around the json-c library
> (license: Expat/X11/MIT), resulting in significant speedups using the test data from
> https://github.com/miloyip/nativejson-benchmark: a factor of 3.9 to 6.4 for parsing, and a factor of 27 to 67 for
> serializing. If people agree that this is useful I can send a patch.

Before we make a decision on which library to use, I'd prefer some
kind of survey of available free software libraries, including their
popularity and development activity.  The survey doesn't have to be
exhaustive, but I think we should compare at least a few candidates.
We already have 2, so maybe we should start by comparing them.


Sure, I've made a quick overview based on https://github.com/miloyip/nativejson-benchmark. I've only used the libraries that are written in C and have been tested in that benchmark; it's still quite a few. I've checked the conformance and speed metrics from the benchmark as well as number of GitHub stars (as proxy for popularity) and number of commit in the last month (as proxy for development activity): Here are the results: https://docs.google.com/spreadsheets/d/e/2PACX-1vTqKxqo47s67L3EJ9AWvZclNuT2xbd9rgoRuJ_UYbXgnV171owr8h2mksHjrjNGADDR3DVTWQvUMBpe/pubhtml?gid=0&single=true
Note that some of the libraries (jsmn, ujson4c) don't appear to support serialization at all; I'd suggest to avoid them, because we'd then need to wrap another library for serialization. Also, even though JSMN advertises itself as "world's fastest JSON parser", it's actually the slowest of the libraries in the survey. json-c appears to be reasonably conformant and fast for both parsing and serialization, and has by far the largest development activity.
Reply | Threaded
Open this post in threaded view
|

Speed of Elisp (was: JSON/YAML/TOML/etc. parsing performance)

Stefan Monnier
In reply to this post by Philipp Stephani
[ To clarify up front: I'm in favor of using those libraries.
  The questions below don't mean that I think it's better to speed up
  Elisp than to use a C implementation of those json primitives: In any
  case, it makes sense to use existing C libraries for that, both for
  speed reasons and for maintenance reasons; like we do for XML.
  The choice between C and Elisp would only make sense if we had to
  write&maintain the C code.  ]

> (license: Expat/X11/MIT), resulting in significant speedups using the test
> data from https://github.com/miloyip/nativejson-benchmark: a factor of 3.9
> to 6.4 for parsing,

Very interesting.  The way I read it, it means either that Elisp is not
nearly as slow as we tend to assume, or that the overhead introduced
when turning json-c's output into an Elisp-usable form dwarfs the json-c
parsing itself.

> and a factor of 27 to 67 for serializing.

I'm curious why there is such a wide discrepancy between the speedup for
parsing and that for serializing (sounds like a factor 10 difference).

Is it because parsing with json-c is slowed down by the conversion to
(especially allocation of) Elisp data structures, or is it because the
Elisp implementation of json serialization suffers more from poor
performance (in which case, maybe it could point to a performance issue
in Elisp which we could try to tackle)?


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Mark Oteiza
In reply to this post by Philipp Stephani
Philipp Stephani <[hidden email]> writes:

> Eli Zaretskii <[hidden email]> schrieb am So., 17. Sep. 2017 um 21:05 Uhr:
>
>  > From: Philipp Stephani <[hidden email]>
>  > Date: Sun, 17 Sep 2017 18:46:45 +0000
>  >
>  > I don't know much about the others, but given the importance of JSON as data exchange and serialization
>  > format, I think it's worthwhile to invest some time here. I've implemented a wrapper around the json-c library
>  > (license: Expat/X11/MIT), resulting in significant speedups using the test data from
>  > https://github.com/miloyip/nativejson-benchmark: a factor of 3.9 to 6.4 for parsing, and a factor of 27 to 67 for
>  > serializing. If people agree that this is useful I can send a patch.
>
>  Before we make a decision on which library to use, I'd prefer some
>  kind of survey of available free software libraries, including their
>  popularity and development activity.  The survey doesn't have to be
>  exhaustive, but I think we should compare at least a few candidates.
>  We already have 2, so maybe we should start by comparing them.
>
> Sure, I've made a quick overview based on
> https://github.com/miloyip/nativejson-benchmark. I've only used the
> libraries that are written in C and have been tested in that
>  benchmark; it's still quite a few. I've checked the conformance and
> speed metrics from the benchmark as well as number of GitHub stars (as
> proxy for popularity) and number of commit in the last month (as proxy
> for development activity): Here are the results:
> https://docs.google.com/spreadsheets/d/e/2PACX-1vTqKxqo47s67L3EJ9AWvZclNuT2xbd9rgoRuJ_UYbXgnV171owr8h2mksHjrjNGADDR3DVTWQvUMBpe/pubhtml?gid=0&single=true

Thanks for coming up with these comparisons.

> Note that some of the libraries (jsmn, ujson4c) don't appear to
> support serialization at all; I'd suggest to avoid them, because we'd
> then need to wrap another library for serialization. Also, even though
> JSMN advertises itself as "world's fastest JSON parser", it's actually
> the slowest of the libraries in the survey. json-c appears to be
> reasonably conformant and fast for both parsing and serialization, and
> has by far the largest development activity.

I was a little confused how they got a "parsing" benchmark out of jsmn,
since all it does is tokenize.  No matter, it makes sense pursue
something that does both.

Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Richard Stallman
In reply to this post by Mark Oteiza
[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > I see. It's MIT licensed, so I guess it can be integrated.

Yes it can be, but please avoid the term "MIT license".  There are two
different licenses people sometimes call by that term, the X11 license
and the Expat license.  See https://gnu.org/licenses/license-list.html.

Both of them are weak licenses -- we also call them "pushover"
licenses -- since they permit inclusion in nonfree software.

Please don't associate them with the name of MIT, as the association
tends to promote them.

--
Dr Richard Stallman
President, Free Software Foundation (gnu.org, fsf.org)
Internet Hall-of-Famer (internethalloffame.org)
Skype: No way! See stallman.org/skype.html.


Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Philipp Stephani
In reply to this post by Philipp Stephani


Philipp Stephani <[hidden email]> schrieb am So., 17. Sep. 2017 um 20:46 Uhr:
Ted Zlatanov <[hidden email]> schrieb am Sa., 16. Sep. 2017 um 17:55 Uhr:
I wanted to ask if there's any chance of improving the parsing
performance of JSON, YAML, TOML, and similar data formats. It's pretty
poor today.

That could be done in the core with C code, improved Lisp code,
integration with an external library, or a mix of those.

I don't know much about the others, but given the importance of JSON as data exchange and serialization format, I think it's worthwhile to invest some time here. I've implemented a wrapper around the json-c library (license: Expat/X11/MIT), resulting in significant speedups using the test data from https://github.com/miloyip/nativejson-benchmark: a factor of 3.9 to 6.4 for parsing, and a factor of 27 to 67 for serializing. If people agree that this is useful I can send a patch.

I've discovered that the interface and documentation of Jansson are much better than the ones of json-c, so I switched to Jansson. I've attached a patch.

0001-Implement-native-JSON-support-using-Jansson.txt (30K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Ted Zlatanov
In reply to this post by Richard Stallman
On Sat, 16 Sep 2017 20:02:17 -0400 Richard Stallman <[hidden email]> wrote:

> Ted wrote:
>> I wanted to ask if there's any chance of improving the parsing
>> performance of JSON, YAML, TOML, and similar data formats. It's pretty
>> poor today.

RS> Can you design a primitive general enough to speed up parsing of all those
RS> formats?  That would make it feasible to handle them all without
RS> too much work.

I don't think that's easy. They are pretty different. Maybe TOML and INI
parsing can be unified, but that's about it.

I think the best approach is to choose one of the libraries Philipp
suggested but I don't have a favorite.

Thanks
Ted


Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Ted Zlatanov
In reply to this post by Philipp Stephani
On Sun, 17 Sep 2017 20:27:02 +0000 Philipp Stephani <[hidden email]> wrote:

PS> Sure, I've made a quick overview based on
PS> https://github.com/miloyip/nativejson-benchmark. I've only used the
PS> libraries that are written in C and have been tested in that benchmark;
PS> it's still quite a few. I've checked the conformance and speed metrics from
PS> the benchmark as well as number of GitHub stars (as proxy for popularity)
PS> and number of commit in the last month (as proxy for development activity):
PS> Here are the results:
PS> https://docs.google.com/spreadsheets/d/e/2PACX-1vTqKxqo47s67L3EJ9AWvZclNuT2xbd9rgoRuJ_UYbXgnV171owr8h2mksHjrjNGADDR3DVTWQvUMBpe/pubhtml?gid=0&single=true
PS> Note that some of the libraries (jsmn, ujson4c) don't appear to support
PS> serialization at all; I'd suggest to avoid them, because we'd then need to
PS> wrap another library for serialization. Also, even though JSMN advertises
PS> itself as "world's fastest JSON parser", it's actually the slowest of the
PS> libraries in the survey. json-c appears to be reasonably conformant and
PS> fast for both parsing and serialization, and has by far the largest
PS> development activity.

Hi Philipp,

thanks for doing all that work!

I'd suggest posting the survey results here directly.

Also maybe consider the jq built-in JSON parser, which could be a good
fit (it's usable as a library IIRC). The rest of the libraries look good
and I hope we make a choice soon. I don't have a favorite.

Thanks
Ted


Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Mark Oteiza
In reply to this post by Philipp Stephani

> Philipp Stephani <[hidden email]> schrieb am So., 17. Sep. 2017 um 20:46 Uhr:
>
> I've discovered that the interface and documentation of Jansson are much better than
> the ones of json-c, so I switched to Jansson. I've attached a patch.

Doing the following on a 276K file:

(with-temp-buffer
  (insert-file-contents-literally "test.json")
  (benchmark-run 10
    (goto-char (point-min))
    (json-parse-buffer)))

These are my rough average benchmarks:

           Time    GCs   GC time
Jansson    0.33    10    0.15
json.el    1.21    38    0.48

Nice.  Was there a particular reason (aside from access time) you chose
hash tables instead of a sexp form?

Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Philipp Stephani


Mark Oteiza <[hidden email]> schrieb am Mo., 18. Sep. 2017 um 15:58 Uhr:
Was there a particular reason (aside from access time) you chose
hash tables instead of a sexp form?

- Hashtables have similar constraints as the underlying JSON objects (no duplicate keys, no ordering), so they are a better match.
- Hashtables have non-nil empty values. If I had uses alists, I would have had to introduce a separate keyword :json-null for null.
- Hashtables always represent maps, but alists are also normal sequences, so users could expect that they get translated into arrays instead of objects.
- Using only one data structure per JSON object type makes the interface and implementation simpler.
Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Mark Oteiza
On 18/09/17 at 02:14pm, Philipp Stephani wrote:

> Mark Oteiza <[hidden email]> schrieb am Mo., 18. Sep. 2017 um 15:58 Uhr:
>
> > Was there a particular reason (aside from access time) you chose
> > hash tables instead of a sexp form?
>
> - Hashtables have similar constraints as the underlying JSON objects (no
> duplicate keys, no ordering), so they are a better match.
> - Hashtables have non-nil empty values. If I had uses alists, I would have
> had to introduce a separate keyword :json-null for null.
> - Hashtables always represent maps, but alists are also normal sequences,
> so users could expect that they get translated into arrays instead of
> objects.
> - Using only one data structure per JSON object type makes the interface
> and implementation simpler.

Makes sense and I agree, thank you.  Thanks for the patch.

Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Philipp Stephani


Mark Oteiza <[hidden email]> schrieb am Mo., 18. Sep. 2017 um 16:28 Uhr:
On 18/09/17 at 02:14pm, Philipp Stephani wrote:
> Mark Oteiza <[hidden email]> schrieb am Mo., 18. Sep. 2017 um 15:58 Uhr:
>
> > Was there a particular reason (aside from access time) you chose
> > hash tables instead of a sexp form?
>
> - Hashtables have similar constraints as the underlying JSON objects (no
> duplicate keys, no ordering), so they are a better match.
> - Hashtables have non-nil empty values. If I had uses alists, I would have
> had to introduce a separate keyword :json-null for null.
> - Hashtables always represent maps, but alists are also normal sequences,
> so users could expect that they get translated into arrays instead of
> objects.
> - Using only one data structure per JSON object type makes the interface
> and implementation simpler.

Makes sense and I agree, thank you.  Thanks for the patch.

Thanks for the review; pushed to master as cb99cf5a99. 
Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Eli Zaretskii
In reply to this post by Philipp Stephani
> From: Philipp Stephani <[hidden email]>
> Date: Mon, 18 Sep 2017 13:26:34 +0000
>
> I've discovered that the interface and documentation of Jansson are much better than the ones of json-c, so I
> switched to Jansson.

Thanks, but isn't Jansson less actively developed, judging by your
survey?

> I've attached a patch.

I thought we wanted to import the library into Emacs proper, didn't
we?  What is the purpose of providing such a core functionality as an
optional feature?

Reply | Threaded
Open this post in threaded view
|

Re: JSON/YAML/TOML/etc. parsing performance

Eli Zaretskii
In reply to this post by Philipp Stephani
> From: Philipp Stephani <[hidden email]>
> Date: Mon, 18 Sep 2017 14:36:43 +0000
> Cc: [hidden email]
>
> Thanks for the review; pushed to master as cb99cf5a99.

Boom!  We've just started talking about this, and AFAIU didn't even
agree on this library.  And I'm not sure we want a JSON library as an
optional feature.  So I must ask you to please revert, and let's wait
until the discussion comes to its completion.

1234