regular expressions that match nothing

classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

regular expressions that match nothing

philippe schnoebelen
I was very happy to see that in v27.0.50 (regexp-opt nil) now properly returns a regular expression that matches nothing, namely a\`. Thanks to whoever fixed that old bug.

I was wondering why (regexp-opt nil) uses a\` and not \'a or another option like \=a\= so I did some profiling (see attached code).

The different options that I tried have more or less the same response time when one checks, via looking-at, whether the regexp matches at point. But when one searches for a match across a whole buffer, some options behave notably faster than the others. And a\` is not the best, e.g., \=a\= is way faster. Maybe some other solutions would be even faster.

Of course this may be dependent on the internals of the specific regexp library at hand. I do not know enough to judge. In fact I believe that a solid regular expression library should provide a specific regular expression that matches nothing with special but easy treatment that guarantees best response time.

--phs

profile-empty-regexp.el (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Mattias Engdegård-2
tis 2019-05-14 klockan 09:25 +0200 skrev philippe schnoebelen:
> I was wondering why (regexp-opt nil) uses a\` and not \'a or another
> option like \=a\= so I did some profiling (see attached code).

Thank you, and sorry about my bad initial attempt. I tried a few more,
like [z-a], \c* and \sq, but these were no better. The distribution is
decidedly bimodal; there seems to be no significant difference between
the 'fast' ones, so I went with \`a\` in the attached patch.

> Of course this may be dependent on the internals of the specific
> regexp library at hand. I do not know enough to judge. In fact I
> believe that a solid regular expression library should provide a
> specific regular expression that matches nothing with special but
> easy treatment that guarantees best response time.

We could add a standard constant for it, like unmatchable-regexp, so
that at least people don't keep reinventing it.
We could also make (rx (or)) work. (It does in my complete rx rewrite.)



0001-Use-faster-unmatchable-regexp.patch (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Stefan Monnier
> Thank you, and sorry about my bad initial attempt. I tried a few more,
> like [z-a], \c* and \sq, but these were no better. The distribution is
> decidedly bimodal; there seems to be no significant difference between
> the 'fast' ones,

Not surprised: the "fast" ones are the ones that the regexp engine
recognizes as "anchored" so the search is reduced to a looking-at.

> so I went with \`a\` in the attached patch.

Sounds good.

>> Of course this may be dependent on the internals of the specific
>> regexp library at hand. I do not know enough to judge. In fact I
>> believe that a solid regular expression library should provide a
>> specific regular expression that matches nothing with special but
>> easy treatment that guarantees best response time.
> We could add a standard constant for it, like unmatchable-regexp, so

Yes, please.  I'd recommend a `regexp-` prefix for it.
[ And I'll carefully avoid having an opinion on the rest of the name. ]

> that at least people don't keep reinventing it.

Not only that, but I'm pretty sure casual users will find it much easier
to understand what's going on when they see `regexp-<foo>` than when
they bump into "\\`a\\`".  I.e. the constant's name should work as
a good comment.


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Mattias Engdegård-2
tis 2019-05-14 klockan 15:41 -0400 skrev Stefan Monnier:

> Yes, please.  I'd recommend a `regexp-` prefix for it.

Well, since you asked so nicely!

> [ And I'll carefully avoid having an opinion on the rest of the name.
> ]

The correct name is obviously something like `regexp-empty', but I have
to concede that it might be misinterpreted. The attached patch uses
`regexp-unmatchable' which is reasonably descriptive. Better
suggestions are welcome.


0001-Add-standard-unmatchable-regexp.patch (28K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Alan Mackenzie
Hello, Mattias.

On Wed, May 15, 2019 at 18:21:07 +0200, Mattias Engdegård wrote:
> tis 2019-05-14 klockan 15:41 -0400 skrev Stefan Monnier:

> > Yes, please.  I'd recommend a `regexp-` prefix for it.

> Well, since you asked so nicely!

> > [ And I'll carefully avoid having an opinion on the rest of the name.
> > ]

> The correct name is obviously something like `regexp-empty', but I have
> to concede that it might be misinterpreted. The attached patch uses
> `regexp-unmatchable' which is reasonably descriptive. Better
> suggestions are welcome.

I think regexp-unmatchable is too much of a mouthful.  It is more
difficult to type that a\\` (or whatever), even after having to think
where the seldom used keys are on the keyboard.  Also it is difficult to
spell.  is it unmatchable or unmatcheable?

I would suggest re-nomatch (or possibly nomatch-re), which is just 10
characters, as opposed to your suggestion which is 18.  Quite possibly,
re-nomatch is easier to type than a\\` (or whatever).

This ease of typing is important, because it encourages hackers to use
it rather than typing out the shorter thing.

--
Alan Mackenzie (Nuremberg, Germany).

Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Michael Heerdegen
In reply to this post by Mattias Engdegård-2
Mattias Engdegård <[hidden email]> writes:

> The correct name is obviously something like `regexp-empty', but I have
> to concede that it might be misinterpreted. The attached patch uses
> `regexp-unmatchable' which is reasonably descriptive. Better
> suggestions are welcome.

Should there be an rx regexp form for this?

Michael.

Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Stefan Monnier
> Should there be an rx regexp form for this?

(or)


        Stefan

Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Mattias Engdegård-2
In reply to this post by Michael Heerdegen
15 maj 2019 kl. 22.17 skrev Michael Heerdegen <[hidden email]>:
>
> Should there be an rx regexp form for this?

We don't necessarily need a special form for it; we can just make `(or)' work.

Proposed patch attached. (I also added its dual, (seq), since it would be silly not to.)


0001-Allow-zero-argument-rx-or-and-seq-forms.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Michael Heerdegen
Mattias Engdegård <[hidden email]> writes:

> 15 maj 2019 kl. 22.17 skrev Michael Heerdegen <[hidden email]>:
> >
> > Should there be an rx regexp form for this?
>
> We don't necessarily need a special form for it; we can just make
> `(or)' work.
>
> Proposed patch attached. (I also added its dual, (seq), since it would
> be silly not to.)

Makes sense to me, thanks.

Michael.

Reply | Threaded
Open this post in threaded view
|

More re odditie [Was: regular expressions that match nothing]

phs-2
In reply to this post by Mattias Engdegård-2
While testing how rx handles the new regexp-unmatchable construct, I
stumbled on some odd emacs behavior.

The regular expression "*", instead of matching zero-or-more occurrences
of the empty string, behaves as "\*" or "[*]" and only matches the *
character.

I believe this is a bug, not a feature, and in any case the Emacs manual
does not document this behavior.
Perhaps it is not a bug in Emacs per se, rather a bug in the gnulib
regex library so I'll wait for comments before submitting a bug report...

If this behavior is the expected one, then (rx (0+ "")) should not
return "*".

We have the same problem with "+" and (rx (1+ "")).

-phs

On 2019/05/15 23:07, Mattias Engdegård wrote:
> 15 maj 2019 kl. 22.17 skrev Michael Heerdegen <[hidden email]>:
>> Should there be an rx regexp form for this?
> We don't necessarily need a special form for it; we can just make `(or)' work.
>
> Proposed patch attached. (I also added its dual, (seq), since it would be silly not to.)
>


Reply | Threaded
Open this post in threaded view
|

Re: More re odditie [Was: regular expressions that match nothing]

Mattias Engdegård-2
16 maj 2019 kl. 08.57 skrev phs <[hidden email]>:
>
> The regular expression "*", instead of matching zero-or-more occurrences
> of the empty string, behaves as "\*" or "[*]" and only matches the *
> character.
>
> I believe this is a bug, not a feature, and in any case the Emacs manual
> does not document this behavior.

Actually this one is documented:

  For historical compatibility, special characters are treated as ordinary
  ones if they are in contexts where their special meanings make no sense.
  For example, `*foo' treats `*' as ordinary since there is no preceding
  expression on which the `*' can act.

and in any case the 'correct' behaviour would be to signal a syntax error, not repeat the empty string.

> If this behavior is the expected one, then (rx (0+ "")) should not
> return "*".

Thanks for reporting it, and you are right, that's a (known) bug in rx.


Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Mattias Engdegård-2
In reply to this post by Alan Mackenzie
15 maj 2019 kl. 21.41 skrev Alan Mackenzie <[hidden email]>:
>
> I think regexp-unmatchable is too much of a mouthful.  It is more
> difficult to type that a\\` (or whatever), even after having to think
> where the seldom used keys are on the keyboard.  Also it is difficult to
> spell.  is it unmatchable or unmatcheable?
>
> I would suggest re-nomatch (or possibly nomatch-re), which is just 10
> characters, as opposed to your suggestion which is 18.  Quite possibly,
> re-nomatch is easier to type than a\\` (or whatever).

Thanks Alan, and you may have a point. I'm definitely not against a better name if there is a consensus for it.
Let me just dispassionately note that:

1. As 'match' does not end in 'e', there is no more reason to write
   'unmatcheable' than 'undrinkeable'.

2. (rx (or)) is even shorter than re-nomatch, and is very memorable.
   (rx (|)) is shorter still.

3. Lisp tradition is unafraid of the verbose, partly because `-' is
   allowed in identifiers which lowers the friction.

4. The point of this name isn't to be shorter than the regexp string
   it represents, but to be more readable and avoid mistakes and
   substandard reinventions.

Start bikeshedding; I'll try to low-pass filter.


Reply | Threaded
Open this post in threaded view
|

Re: More re odditie [Was: regular expressions that match nothing]

phs-2
In reply to this post by Mattias Engdegård-2
Hi Mattias,

On 2019/05/16 11:29, Mattias Engdegård wrote:
> 16 maj 2019 kl. 08.57 skrev phs <[hidden email]>:

> Actually this one is documented:
>
>   For historical compatibility, special characters are treated as ordinary
>   ones if they are in contexts where their special meanings make no sense.
>   For example, `*foo' treats `*' as ordinary since there is no preceding
>   expression on which the `*' can act.

I missed that. Thanks for pointing it out.

> and in any case the 'correct' behaviour would be to signal a syntax error, not repeat the empty string.

I'd rather read `*' as meaning "repeat the empty string", as with
`\(\)*', but this is a matter of taste, and historical compatibility is
very important.

BTW, can your scans of regexps tell if this compatibility is relied on a
lot? It would be safe to replace `*' and `+' with `\*' and `\+' where
this happens.  I've just grep'ed quickly through the code and only
noticed a risky use of "+" (and "[..]") in the definition of `term-word'
in term.el

> Thanks for reporting it, and you are right, that's a (known) bug in rx.

When rx is fixed, I suggest we add the following extra tests (see patch)

--phs

more-rx-tests.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: More re odditie [Was: regular expressions that match nothing]

Stefan Monnier
> I'd rather read `*' as meaning "repeat the empty string",

`*` repeats the immediately preceding regular expression, but here there
is no preceding regular expression:
- there are various ways to resolve this problem.
- all "reasonable" semantics are pretty useless in the sense that it's
  trivial to write another regexp with the same semantics.
- the combination of the previous two points implies that signaling an
  error is probably the better option.

> BTW, can your scans of regexps tell if this compatibility is relied on a
> lot?

His scan does catch those (and many other of its friends).

> It would be safe to replace `*' and `+' with `\*' and `\+' where
> this happens.

We've done that, indeed.

> I've just grep'ed quickly through the code and only noticed a risky
> use of "+" (and "[..]") in the definition of `term-word' in term.el

That's because we already fixed the occurrences that his tool finds ;-)

>> Thanks for reporting it, and you are right, that's a (known) bug in rx.
> When rx is fixed, I suggest we add the following extra tests (see patch)

Good idea.


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Eric Abrahamsen-2
In reply to this post by Mattias Engdegård-2
Mattias Engdegård <[hidden email]> writes:

> tis 2019-05-14 klockan 15:41 -0400 skrev Stefan Monnier:
>
>> Yes, please.  I'd recommend a `regexp-` prefix for it.
>
> Well, since you asked so nicely!
>
>> [ And I'll carefully avoid having an opinion on the rest of the name.
>> ]
>
> The correct name is obviously something like `regexp-empty', but I have
> to concede that it might be misinterpreted. The attached patch uses
> `regexp-unmatchable' which is reasonably descriptive. Better
> suggestions are welcome.

regexp-null? It's also a little cryptic, but it's short!

Reply | Threaded
Open this post in threaded view
|

Re: More re odditie [Was: regular expressions that match nothing]

Michael Heerdegen
In reply to this post by phs-2
BTW, while we are here: I'm also curious whether we want to add
nl, newline -> "\n" to rx.  I mean we have nonl and space - only for
newlines one needs to use string syntax in rx.

Michael.

Reply | Threaded
Open this post in threaded view
|

Re: More re odditie [Was: regular expressions that match nothing]

Mattias Engdegård-2
16 maj 2019 kl. 20.35 skrev Michael Heerdegen <[hidden email]>:
>
> BTW, while we are here: I'm also curious whether we want to add
> nl, newline -> "\n" to rx.  I mean we have nonl and space - only for
> newlines one needs to use string syntax in rx.

Maybe, but there is no special name for any other literal string. `space' doesn't mean the space character but [[:space:]], or \s-, whose exact meaning depends on the current syntax table. (Which is sometimes very surprising; look at rfc2047-syntax-table!) `nonl' has plenty of legacy -- the name is from SRE, and even in string regexps it has a special convenience symbol.

I would much rather see a clean, robust and expressive extension mechanism for rx (`rx-constituents' does not count). There are various hacks and libraries but it probably needs to be integrated.


Reply | Threaded
Open this post in threaded view
|

Global and local definitions of non-functions/variable (was: More re odditie [Was: regular expressions that match nothing])

Stefan Monnier
> I would much rather see a clean, robust and expressive extension mechanism
> for rx (`rx-constituents' does not count).  There are various hacks and
> libraries but it probably needs to be integrated.

Agreed.  Something like `rx-defmacro` and `rx-macrolet`.

Reminds me that this is a recurring need (we see it in `pcase`, in
`peg`, here, and arguably in `gv`).  It would be nice if we could design
a general solution that those packages can (re)use.


        Stefan


PS: Arguably `cl-defmethod` could also be extended to a kind of
    `methodlet` for scoped methods, but it might be tricky to do that.
    OTOH, this might provide exactly the generic mechanism we need to
    implement the others.


Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Phil Sainty
In reply to this post by Mattias Engdegård-2
> 15 maj 2019 kl. 21.41 skrev Alan Mackenzie <[hidden email]>:
>> I think regexp-unmatchable is too much of a mouthful.

I like it, myself.  I think the meaning is 100% clear and unambiguous
for the reader (which I can't say about the alternative suggestions
that I've seen).

Are we expecting this to be used so much that we're prioritising
brevity over clarity?

(That's a genuine question -- I have a similar definition in my own
config, and I have exactly one use for it.)


On 2019-05-16 22:54, Mattias Engdegård wrote:
> 4. The point of this name isn't to be shorter than the regexp string
>    it represents, but to be more readable and avoid mistakes and
>    substandard reinventions.

Quite.


> 2. (rx (or)) is even shorter than re-nomatch, and is very memorable.
>    (rx (|)) is shorter still.

I don't think those are much better than people using "a\\`".

*Surely* `rx` can simply acquire a symbol for this?

(rx unmatchable) or similar?


-Phil


Reply | Threaded
Open this post in threaded view
|

Re: regular expressions that match nothing

Alan Mackenzie
Hello, Phil.

On Fri, May 17, 2019 at 11:18:49 +1200, Phil Sainty wrote:
> > 15 maj 2019 kl. 21.41 skrev Alan Mackenzie <[hidden email]>:
> >> I think regexp-unmatchable is too much of a mouthful.

> I like it, myself.  I think the meaning is 100% clear and unambiguous
> for the reader (which I can't say about the alternative suggestions
> that I've seen).

I find it too difficult to read.  My brain simply doesn't recognise it
instantly, the way it would re-nomatch.  At the moment, given there is
no similar symbol name in Emacs, this is less urgent, but if more
similar long symbols were introduced this would be a pain - a minor pain
yes, but a pain nevertheless.

> Are we expecting this to be used so much that we're prioritising
> brevity over clarity?

Brevity is clarity - up to a point.  We write `defun', not
`define-function'.  Who would argue that the latter of these is clearer?

Why has nobody commented on my suggestion of using re- rather than
regexp- as the prefix?  We already have re-search-forward.

> (That's a genuine question -- I have a similar definition in my own
> config, and I have exactly one use for it.)

There are quite a few uses of "a\\`" in CC Mode.  If they were to be
replaced by regexp-unmatchable, I might have to re-flow the code, to
avoid it going too far over 80 columns.


> On 2019-05-16 22:54, Mattias Engdegård wrote:
> > 4. The point of this name isn't to be shorter than the regexp string
> >    it represents, but to be more readable and avoid mistakes and
> >    substandard reinventions.

> Quite.


> > 2. (rx (or)) is even shorter than re-nomatch, and is very memorable.
> >    (rx (|)) is shorter still.

This is undesirable in source files which don't otherwise use rx.  It's
also cryptic, forcing some readers to do research.

> I don't think those are much better than people using "a\\`".

> *Surely* `rx` can simply acquire a symbol for this?

> (rx unmatchable) or similar?


> -Phil

--
Alan Mackenzie (Nuremberg, Germany).

12