bug#43598: replace-in-string: finishing touches

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Mattias Engdegård-2
25 sep. 2020 kl. 01.54 skrev Lars Ingebrigtsen <[hidden email]>:

> I went ahead and checked in a new C-level function string-search, which
> should be an efficient way to search for strings in strings (using
> memmem, which Emacs has via Gnulib?), and this fixed these corner cases.

Thank you! Here are some proposed tweaks (diff attached):

1. Check the range of the START-POS argument so that we don't crash.
The permitted range is [0..N] where N is (length HAYSTACK), thus we permit a start right after the last character but no further.
We could also return nil in these cases but I think an error is more useful.

2. Make the docs more precise about various things.

3. Slight simplification of the implementation logic to avoid testing the same conditions multiple times.

4. More tests, especially for edge cases. Can't have too many!
One test still fails:

 (string-search "ø" "\303\270")

which should return nil but currently matches.
I think it's wrong to convert the needle to unibyte (using Fstring_as_unibyte) in this case, but I haven't decided what the best solution would be.

We should also consider the optimisations:
- If SCHARS(needle)>SCHARS(haystack) then no match is possible.
- If either needle or haystack is all-ASCII (all bytes in 0..127), then we can use memmem without conversion.


string-search.diff (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Lars Ingebrigtsen
Mattias Engdegård <[hidden email]> writes:

> 1. Check the range of the START-POS argument so that we don't crash.
> The permitted range is [0..N] where N is (length HAYSTACK), thus we
> permit a start right after the last character but no further.
> We could also return nil in these cases but I think an error is more useful.

Good point.  :-)

> 2. Make the docs more precise about various things.
>
> 3. Slight simplification of the implementation logic to avoid testing
> the same conditions multiple times.
>
> 4. More tests, especially for edge cases. Can't have too many!

It all looks good to me; please apply.

> One test still fails:
>
>  (string-search "ø" "\303\270")
>
> which should return nil but currently matches.
> I think it's wrong to convert the needle to unibyte (using
> Fstring_as_unibyte) in this case, but I haven't decided what the best
> solution would be.

Yeah, that's the bit I was most unsure about, because it just didn't
look quite correct to me, but I couldn't come up with the correct test
case last night; thanks.

> We should also consider the optimisations:
> - If SCHARS(needle)>SCHARS(haystack) then no match is possible.

Yup.

> - If either needle or haystack is all-ASCII (all bytes in 0..127),
> then we can use memmem without conversion.

Right, so if the multibyteness differs, then do another check to see
whether both strings are all-ASCII anyway, and do the comparison without
conversion...  Yes, makes sense to me.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Mattias Engdegård-2
25 sep. 2020 kl. 13.11 skrev Lars Ingebrigtsen <[hidden email]>:

> It all looks good to me; please apply.

Thanks, will do shortly.

> Right, so if the multibyteness differs, then do another check to see
> whether both strings are all-ASCII anyway, and do the comparison without
> conversion...

Both strings don't need to be all-ASCII; one of them suffices.

By the way, I added an argument check to replace-in-string to prevent it from entering an infinite loop.




Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Lars Ingebrigtsen
In reply to this post by Mattias Engdegård-2
Mattias Engdegård <[hidden email]> writes:

> We should also consider the optimisations:
> - If SCHARS(needle)>SCHARS(haystack) then no match is possible.

I've now done this.

> - If either needle or haystack is all-ASCII (all bytes in 0..127),
> then we can use memmem without conversion.

I thought that surely there's be a function like that in Emacs, but I
can't find it?

Instead there's code like

          && (STRING_MULTIBYTE (string)
              ? (chars == bytes) : string_ascii_p (string))
[...]
/* Whether STRING only contains chars in the 0..127 range.  */
static bool
string_ascii_p (Lisp_Object string)
{
  ptrdiff_t nbytes = SBYTES (string);
  for (ptrdiff_t i = 0; i < nbytes; i++)
    if (SREF (string, i) > 127)
      return false;
  return true;
}

and

          unsigned char *p = SDATA (name);
          while (*p && ASCII_CHAR_P (*p))
            p++;

sprinkled around the code base.

Would it make sense to add a new utility function that does the right
thing for both multibyte and unibyte strings?  (The multibyte case is
just chars == bytes, but the unibyte case would be a loop.)

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Lars Ingebrigtsen
In reply to this post by Mattias Engdegård-2
Mattias Engdegård <[hidden email]> writes:

> Both strings don't need to be all-ASCII; one of them suffices.

I've now added that, and after that, everything fell into place for the
multibyte-needle/unibyte-haystack case, too.

That can only match if the needle contains nothing but ASCII and
eighth-bit chars, so I've altered it to return Qnil if there's any other
chars, and then convert to unibyte and do memmem otherwise.

*phew*

Is that all cases covered now?

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Lars Ingebrigtsen
Lars Ingebrigtsen <[hidden email]> writes:

> Is that all cases covered now?

Well, I can't think of any more, but I said that before.  :-/

Anyway, we now have this slightly amusing situation:

(string-search (string-to-multibyte "o\303\270") "o\303\270")
=> 0

(string-match (string-to-multibyte "o\303\270") "o\303\270")
=> 0

(equal (string-to-multibyte "o\303\270") "o\303\270")
=> nil

But I guess we've lived with this for...  decades?  So that's probably
OK.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Mattias Engdegård-2
27 sep. 2020 kl. 02.34 skrev Lars Ingebrigtsen <[hidden email]>:

> (string-search (string-to-multibyte "o\303\270") "o\303\270")
> => 0
>
> (string-match (string-to-multibyte "o\303\270") "o\303\270")
> => 0
>
> (equal (string-to-multibyte "o\303\270") "o\303\270")
> => nil

(compare-strings (string-to-multibyte "o\303\270") nil nil "o\303\270" nil nil)
=> t

I'd say it's equal, string-equal, string-lessp and string-greaterp that are odd and that we probably should fix if it can be done without making them slower. Unless, of course, we can come up with an alternative theory of operation that is satisfactory.

> But I guess we've lived with this for...  decades?  So that's probably
> OK.

Yes, there is no hurry.




Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Mattias Engdegård-2
In reply to this post by Lars Ingebrigtsen
27 sep. 2020 kl. 02.03 skrev Lars Ingebrigtsen <[hidden email]>:

> *phew*

Not bad! This seems to work all right.
Here are some minor optimisations:

- Do the fast all-ASCII test (bytes == chars) before iterating through the bytes to check for non-ASCII chars.
- Faster check for non-ASCII non-raw bytes (no need for the complex code in string_char_advance).

It is tempting to vectorise the all-ASCII loop. Maybe another day...

The patch also adds some more test cases for completeness.


string-search-tweaks.diff (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Lars Ingebrigtsen
Mattias Engdegård <[hidden email]> writes:

> Here are some minor optimisations:
>
> - Do the fast all-ASCII test (bytes == chars) before iterating through
> the bytes to check for non-ASCII chars.

[...]

> -  if (STRING_MULTIBYTE (string))
> -    return SBYTES (string) == SCHARS (string);

[...]

> -  if (STRING_MULTIBYTE (haystack) == STRING_MULTIBYTE (needle)
> -      || string_ascii_p (needle)
> -      || string_ascii_p (haystack))
> +  /* We can do a direct byte-string search if both strings have the
> +     same multibyteness, or if at least one of them consists of ASCII
> +     characters only.  */
> +  if (STRING_MULTIBYTE (haystack)
> +      ? (STRING_MULTIBYTE (needle)
> +         || SCHARS (haystack) == SBYTES (haystack) || string_ascii_p (needle))
> +      : (!STRING_MULTIBYTE (needle)
> +         || SCHARS (needle) == SBYTES (needle) || string_ascii_p (haystack)))

Didn't you just move the STRING_MULTIBYTE bits of the test from the
string_ascii_p function and open-code it into Fstring_search function
here?  I'm not sure how that's an optimisation?

> +      ptrdiff_t nbytes = SBYTES (needle);
> +      for (ptrdiff_t i = 0; i < nbytes; i++)
> +        {
> +          int c = SREF (needle, i);
> +          if (CHAR_BYTE8_HEAD_P (c))
> +            i++;                /* Skip raw byte.  */
> +          else if (!ASCII_CHAR_P (c))
> +            return Qnil;  /* Found a char that can't be in the haystack.  */
> +        }

Looks good.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Mattias Engdegård-2
27 sep. 2020 kl. 13.48 skrev Lars Ingebrigtsen <[hidden email]>:

> Didn't you just move the STRING_MULTIBYTE bits of the test from the
> string_ascii_p function and open-code it into Fstring_search function
> here?

No, look again. Previously, we would loop through all bytes of a unibyte needle before checking the lengths of a multibyte haystack. With the patch, we always do the cheap (length) check first. That's why that check had to be moved out of the helper function.




Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Lars Ingebrigtsen
Mattias Engdegård <[hidden email]> writes:

> No, look again. Previously, we would loop through all bytes of a
> unibyte needle before checking the lengths of a multibyte
> haystack.

Duh; you're right.  Please go ahead and apply.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Mattias Engdegård-2
27 sep. 2020 kl. 14.02 skrev Lars Ingebrigtsen <[hidden email]>:

> Please go ahead and apply.

Applied, thank you.

Looks like we are done now. string-replace seems to be substantially faster than replace-regexp-in-string.
Good job!




Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Eli Zaretskii
> From: Mattias Engdegård <[hidden email]>
> Date: Sun, 27 Sep 2020 18:14:36 +0200
> Cc: [hidden email]
>
> Looks like we are done now. string-replace seems to be substantially faster than replace-regexp-in-string.
> Good job!

Thanks.  Is it possible to have some speed comparison for these two?



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

> Thanks.  Is it possible to have some speed comparison for these two?

This is what I used:

(let ((elems (mapcar (lambda (s)
                       (let ((start (random 80)))
                         (cons (substring s start (+ start (random 20)))
                               s)))
                     (cl-loop repeat 1000
                              collect (cl-coerce
                                       (cl-loop repeat 100
                                                collect (+ (random 26) ?a))
                                       'string)))))
  (list
   (benchmark-run 10000 (dolist (elem elems)
                          (string-search (car elem) (cdr elem))))
   (benchmark-run 10000 (dolist (elem elems)
                          (string-match (car elem) (cdr elem))))))

=>
((7.47099299 29 3.773541741999992)
 (19.673036086 74 9.616665831000006))

This is rather geared towards the weaknesses of string-match, though --
we're blowing through the regexp cache.

If you decrease the number of regexps to 10 and the run to 1000000, we get:

((7.818917279000001 37 4.791844609999998)
 (11.049133279 37 4.713127558000011))

And to compare with a "do-nothing" version:

   (benchmark-run 10000 (dolist (elem elems)
                          elem))))

=>

((5.74714395 28 3.722243896000009))

Using that as a baseline, the difference is 2s vs 5.2s.  

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: Mattias Engdegård <[hidden email]>,
>   [hidden email]
> Date: Sun, 27 Sep 2020 18:41:39 +0200
>
> Using that as a baseline, the difference is 2s vs 5.2s.  

Thanks.  So a factor of 2.5, not bad.



Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Richard Stallman
In reply to this post by Lars Ingebrigtsen
[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > (string-match (string-to-multibyte "o\303\270") "o\303\270")
  > => 0

  > (equal (string-to-multibyte "o\303\270") "o\303\270")
  > => nil

It is paradoxical, but I think it is correct.
Equal compares the type of the string, not just the
characters in it.

--
Dr Richard Stallman
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Mattias Engdegård-2
28 sep. 2020 kl. 05.41 skrev Richard Stallman <[hidden email]>:

> It is paradoxical, but I think it is correct.
> Equal compares the type of the string, not just the
> characters in it.

No it doesn't. (equal (string-to-multibyte "A") "A") => t.

There is no deep reason for the current behaviour. It's just how things came to be, and nobody has been sufficiently annoyed to change it. The implementation is efficient and good enough for most purposes.

It is inconsistent and confusing though, and occasionally it does break down. One such case is when two strings that are not 'equal' become 'equal' after printing and reading them back, since unibyte and multibyte strings have the same printed representation. This can arise in conjunction with byte compilation.

Again, I have no plans to do anything about it right now.




Reply | Threaded
Open this post in threaded view
|

bug#43598: replace-in-string: finishing touches

Richard Stallman
[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > It is paradoxical, but I think it is correct.
  > > Equal compares the type of the string, not just the
  > > characters in it.

  > No it doesn't. (equal (string-to-multibyte "A") "A") => t.

I am puzzled, then.  Why DOES the other example return nil?

--
Dr Richard Stallman
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)