Improving regexp-opt

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Improving regexp-opt

Miguel V. S. Frasson
Hi

I had an idea to improve regexp-opt. (I use Emacs 25.3.2).

In a regexp when you have a group with alternatives, sometimes all alternatives *finish* with one or more common atom regexps. You could take the common part out of the group and try to improve the remaining smaller group, splitting all strings that match and recurse regexp-opt.

Example 1: we read from regexp-opt.el:
;; One possible improvement would be to compile '("aa" "ab" "ba" "bb")
;; into "[ab][ab]" rather than "a[ab]\\|b[ab]".  I'm not sure it's worth
;; it but if someone knows how to do it without going through too many
;; contortions, I'm all ears.

Evaluating,
(regexp-opt '("aa" "ab"  "ba" "bb"))
 -> "\\(?:a[ab]\\|b[ab]\\)"

All alternatives finish with "[ab]", so it is equivalent to
  "\\(?:a\\|b\\)[ab]"

The remaining group is "\\(?:a\\|b\\)". We can further improve making a list of all strings that match it and recourse regexp-opt:
"\\(?:a\\|b\\)"
all-matchs = ("a" "b")
(regexp-opt '("a" "b")) -> "[ab]"

Finally join the two regexps: "[ab][ab]"
... and the wish of the developer is fulfilled :) 

Example 2:
(regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr"))
-> "\\(?:c\\(?:\\(?:a[ad]\\|d[ad]\\|[ad]\\)r\\)\\)"

First of all, there is an (apparently) unnecessary group around the result. 

Look the inner group of the result:
  "\\(?:a[ad]\\|d[ad]\\|[ad]\\)"
Notice that all alternatives finish with "[ad]", so this group is equivalent to
 "\\(?:a\\|d\\|\\)[ad]" 

The smaller group matches the strings ("a" "d" "") and
  (regexp-opt '("a" "d" "")) ->"[ad]?"
and the inner group is equivalent to 
 "[ad]?[ad]"

So,
  (regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr"))
could return (eliminating outer group) 
  "c[ad]?[ad]r"

Of course the splitting and recurse not always improve smaller group. For example, 
(regexp-opt '("master" "monster" "mister"))
-> "\\(?:m\\(?:\\(?:on\\|[ai]\\)ster\\)\\)"

It would be better (imo) eliminate unnecessary groups (those without "\\|") that the result was
  "m\\(?:on\\|[ai]\\)ster"

Cheers

Miguel Frasson

Reply | Threaded
Open this post in threaded view
|

Re: Improving regexp-opt

Stefan Monnier
> (regexp-opt '("car" "cdr" "caar" "cadr" "cdar" "cddr"))
> -> "\\(?:c\\(?:\\(?:a[ad]\\|d[ad]\\|[ad]\\)r\\)\\)"
> First of all, there is an (apparently) unnecessary group around the result.

FWIW, I think this is not an error: we want (concat (regexp-opt STRS) "*")
to have a well-defined behavior (i.e. allow any number of repetitions of
STRS).

> (regexp-opt '("master" "monster" "mister"))
> -> "\\(?:m\\(?:\\(?:on\\|[ai]\\)ster\\)\\)"
> It would be better (imo) eliminate unnecessary groups (those without "\\|")
> that the result was
>   "m\\(?:on\\|[ai]\\)ster"

Here, OTOH, the second (shy) subgroup is indeed unnecessary.

Regarding improving regexp-opt, in the general case you're looking at
minimizing finite state automatons.  When regexp-opt was written, the
main purpose was to try and reduce backtracking and for that it's
perfectly sufficient to turn ("ack" "attack") into
"a\\(?:ck\\|ttack\\)".  I later added "tail sharing" so that ("ack"
"attack") turns into "a\\(?:tta\\)?ck" but that's not really much use in
practice.  We could try and get fancier, but it will tend to slow down
regexp-opt even more for rather small benefits (except in corner cases).
A much better approach is to go for a real "regexp to NFA/DFA
conversion".  The `lex.el` package is one such example, but it's very
inefficient (in terms of building the FA and in the size of the FA, not
in terms of running the FA).


        Stefan