bug#45652: so-long mode not triggered despite big file with very long lines

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

積丹尼 Dan Jacobson
Here we see the file(1) command knows a file has very long lines.
$ file www.youtube.com.har
www.youtube.com.har: UTF-8 Unicode text, with very long lines
$ wc www.youtube.com.har
   45982   330703 14075335 www.youtube.com.har

Alas, even in my https://www.jidanni.org/comp/configuration/.emacs with
(global-so-long-mode 1)
(setq so-long-action 'so-long-minor-mode)
this certain file does not trigger so-long mode.

Where to get such a file to test?
Real simple. In Chrome Developer Tools network tab, visit
https://www.youtube.com/c/jidanni2/playlists?view=1 and right click
"Save all as HAR with content."


20210104T194123.jpg (111K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Lars Ingebrigtsen
積丹尼 Dan Jacobson <[hidden email]> writes:

> Where to get such a file to test?
> Real simple. In Chrome Developer Tools network tab, visit
> https://www.youtube.com/c/jidanni2/playlists?view=1 and right click
> "Save all as HAR with content."

Sorry, that's not a good way to reproduce this problem.  Could you put
a test file somewhere that can be retrieved with curl?

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

積丹尼 Dan Jacobson
In reply to this post by 積丹尼 Dan Jacobson
>>>>> "LI" == Lars Ingebrigtsen <[hidden email]> writes:
LI> 積丹尼 Dan Jacobson <[hidden email]> writes:

>> Where to get such a file to test?
>> Real simple. In Chrome Developer Tools network tab, visit
>> https://www.youtube.com/c/jidanni2/playlists?view=1 and right click
>> "Save all as HAR with content."

LI> Sorry, that's not a good way to reproduce this problem.  Could you put
LI> a test file somewhere that can be retrieved with curl?

I know.
Problem is with my 64K ADSL upload speeds here on my mountain,
it would take hours to upload.



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Phil Sainty
In reply to this post by 積丹尼 Dan Jacobson
Hi Dan,

On 5/01/21 12:47 am, 積丹尼 Dan Jacobson wrote:
> Alas, even in my https://www.jidanni.org/comp/configuration/.emacs with
> (global-so-long-mode 1)
> (setq so-long-action 'so-long-minor-mode)
> this certain file does not trigger so-long mode.

I see that Emacs uses js-mode for .har files, so it *is* a targeted
major mode for so-long, which means that it's probably just not meeting
the configured criteria for a file with long lines.  See the docstring
for `so-long-detected-long-line-p' for details.

I'm also in the process of making changes to some of the relevant
default values, so you could test the current WIP version from here:

https://git.savannah.nongnu.org/cgit/so-long.git/plain/so-long.el?h=wip


> Where to get such a file to test?
> Real simple. In Chrome Developer Tools network tab, visit
> https://www.youtube.com/c/jidanni2/playlists?view=1 and right click
> "Save all as HAR with content."

I don't have Chrome installed, sorry; but if it's not a config issue
then I'll be happy to take a look at an example file, if you provide
one (compressed, please).


> Here we see the file(1) command knows a file has very long lines.
> $ file www.youtube.com.har
> www.youtube.com.har: UTF-8 Unicode text, with very long lines
> $ wc www.youtube.com.har
>    45982   330703 14075335 www.youtube.com.har

Those programs have no bearing on so-long's criteria, of course.

In particular, so-long doesn't (by default) look at the whole file,
and the above file is apparently 45,982 lines long.  For context
I am currently looking at increasing the maximum number of lines
checked by so-long (by default) to 500 -- approximately the first
1% of your file.

Even if we go with a larger number, I think it's likely to *be*
a fixed maximum number in the default config, which means there
will always be scope for a file to have its first gigantic line
on the (max+1)th line, and not be detected by so-long.

Hopefully the new default config will be more reliable than before,
though.


-Phil



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

積丹尼 Dan Jacobson
In reply to this post by 積丹尼 Dan Jacobson
Let's make a deal: have so-long mode be the default (for those who have
opted in) for all "big" files in the first place.

I recall for this .har, it starts out well behaved, but then the
whopping lines are closer to the bottom of the file.

I would say that opening any large file is fraught with danger (of
locking up emacs, requiring kill -1.)

And, if the user is sure his file is safe, he can always toggle so-long
mode off for that file.

Even having emacs scan a file for big lines sounds like it might be
risky. So it would be great to give the scanner a vacation in such large
file cases.

PS> for `so-long-detected-long-line-p' for details.
Not yet in 27.1
PS> https://git.savannah.nongnu.org/cgit/so-long.git/plain/so-long.el?h=wip
That's OK, I'm not really following this that closely. Just giving suggestions.



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Lars Ingebrigtsen
In reply to this post by Phil Sainty
Phil Sainty <[hidden email]> writes:

> In particular, so-long doesn't (by default) look at the whole file,
> and the above file is apparently 45,982 lines long.  For context
> I am currently looking at increasing the maximum number of lines
> checked by so-long (by default) to 500 -- approximately the first
> 1% of your file.

Have you considered adding a C-level primitive that just looks at the
entire buffer?  It should be reasonably simple and very fast -- just
count areas between "\n"s, skipping the buffer gap.  We don't have to
care about characters as such, I think, so this should be massively
faster than counting line lengths in Lisp.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Phil Sainty
In reply to this post by 積丹尼 Dan Jacobson
On 12/01/21 1:55 am, 積丹尼 Dan Jacobson wrote:
> Let's make a deal: have so-long mode be the default (for those
> who have opted in) for all "big" files in the first place.

Emacs currently notices when you ask it to visit a very large
file, and offers to let you open the file 'literally' which I
would suggest trying if you are unsure of the contents.

See (info "(emacs)Visiting")

That's already a good way to improve performance in the buffer
(with its own set of trade-offs), but adding a so-long option
to that menu could be something to consider.

That aside, a custom `so-long-predicate' could use the buffer
length as a trigger, instead of scanning for newlines.


> I recall for this .har, it starts out well behaved, but then
> the whopping lines are closer to the bottom of the file.

That might remain problematic, then.

You could always configure Emacs to open all .har files in
so-long-mode, if this is the only way you encounter them?


> Even having emacs scan a file for big lines sounds like it
> might be risky.

There's no risk in checking where the newlines are, other than
time expended in doing it.  (It should be pretty quick, provided
the number of lines being scanned isn't absolutely massive.)

On that note though, I have been wondering whether the newline
cache (not part of so-long) could keep tabs on the longest line
seen thus far when it's being built, and store that value in
a buffer-local variable which so-long could see?

I don't know much about that cache; but if it's all built when
the file is initially inserted into the buffer, then this could
be a nice alternative to having so-long scan for long lines itself,
and may let us 'see' the whole file, rather than just checking
the first N lines.


> PS> for `so-long-detected-long-line-p' for details.
> Not yet in 27.1

It's definitely in 27.1, if so-long.el is loaded.  It's a function.
"C-h o so-long-detected-long-line-p" should find it.


-Phil




Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Phil Sainty
In reply to this post by Lars Ingebrigtsen
On 12/01/21 7:39 am, Lars Ingebrigtsen wrote:
> Have you considered adding a C-level primitive that just looks at the
> entire buffer?  It should be reasonably simple and very fast -- just
> count areas between "\n"s, skipping the buffer gap.  We don't have to
> care about characters as such, I think, so this should be massively
> faster than counting line lengths in Lisp.

Agreed.  See my reply to Dan a few minutes ago for a similar thought.

I don't know what the correct approach to this would be, but there's
definitely scope for some kind of improvement along these lines (pun
not intended :)


-Phil



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

積丹尼 Dan Jacobson
In reply to this post by 積丹尼 Dan Jacobson
PS> Emacs currently notices when you ask it to visit a very large
PS> file, and offers to let you open the file 'literally' which I
PS> would suggest trying if you are unsure of the contents.

PS> See (info "(emacs)Visiting")

PS> That's already a good way to improve performance in the buffer
PS> (with its own set of trade-offs), but adding a so-long option
PS> to that menu could be something to consider.

Yup, definitely add so-long mode to those choices. As even with
"literally" the problems come when the user hits ^S and searches within
possible long lines.

(Sure hope there is a "?" or "C-h" choice there too to describe what
each choice does too.)

PS> You could always configure Emacs to open all .har files in
PS> so-long-mode, if this is the only way you encounter them?

Any archive format that that day has some nasty file at the bottom of it
will have the same problem.

PS> It's definitely in 27.1, if so-long.el is loaded.  It's a function.
PS> "C-h o so-long-detected-long-line-p" should find it.

I see. I was still using C-h v as today is the first time I heard of C-h o.



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Phil Sainty
On 2021-01-12 12:04, 積丹尼 Dan Jacobson wrote:
> Yup, definitely add so-long mode to those choices. As even with
> "literally" the problems come when the user hits ^S and searches
> within possible long lines.

I think that so-long may not perform any better, but it would
still be *different* (and hence potentially useful), so I think
it's probably a good idea.


> PS> You could always configure Emacs to open all .har files in
> PS> so-long-mode, if this is the only way you encounter them?
>
> Any archive format that that day has some nasty file at the bottom
> of it will have the same problem.

Oh, it's an archive format?  In that case Emacs should probably
learn to treat it similarly to tar files, etc?

If the "long line" is actually some distinct file with an archive,
and Emacs treated it as an archive, then opening that file from
within the archive would be more likely to trigger so-long when
necessary.

Looking at https://en.wikipedia.org/wiki/.har I can see why js-mode
has been associated, but it's surely not the ideal solution.


> I was still using C-h v as today is the first time I heard of C-h o.

Remember that C-h v is only for variables.  C-h f is for functions
(which would have worked here), and C-h o is a good choice when you
know what something is named, but don't know exactly what it is.


-Phil




Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Phil Sainty
In reply to this post by Phil Sainty
On 12/01/21 7:39 am, Lars Ingebrigtsen wrote:
> Have you considered adding a C-level primitive that just looks at the
> entire buffer?  It should be reasonably simple and very fast -- just
> count areas between "\n"s, skipping the buffer gap.  We don't have to
> care about characters as such, I think, so this should be massively
> faster than counting line lengths in Lisp.

FYI, this also triggered a memory from several years ago:
https://lists.gnu.org/archive/html/emacs-devel/2016-07/msg00761.html

Stefan's thought at the time was:
> I wonder if we could improve the detection part with some help from
> the C code.  I'm thinking of trying to keep track of "the last \n
> before point" and calling a hook whenever this is larger than a
> threshold.

I never followed up on that (I ended up shelving the idea of releasing
so-long.el for a couple of years or so, as my own use-cases at the
time turned out to be due to one specific badly-behaved library which
had already been fixed upstream, so for a while I thought it might not
be as useful as I'd originally thought).


On 2021-01-12 10:07, Lars Ingebrigtsen wrote:
>> I can give writing something like that a shot...  It could return,
>> say,
>> the length of the longest line, and the number of lines?  The median
>> line length would be nice, but would be slower.
>
> Mean would be easy, and standard deviation would be possible (but
> require two passes, I guess?)

I'm not sure that so-long itself would find it useful to know any kind
of averages, but longest line is obviously useful, and the number of
lines sounds more or less like a freebie, and might be useful too.


-Phil




Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Eli Zaretskii
In reply to this post by Phil Sainty
> From: Lars Ingebrigtsen <[hidden email]>
> Date: Mon, 11 Jan 2021 22:07:20 +0100
> Cc: [hidden email],
>  積丹尼 Dan Jacobson <[hidden email]>
>
> Lars Ingebrigtsen <[hidden email]> writes:
>
> > I can give writing something like that a shot...  It could return, say,
> > the length of the longest line, and the number of lines?  The median
> > line length would be nice, but would be slower.
>
> Mean would be easy, and standard deviation would be possible (but
> require two passes, I guess?)

There are algorithms that don't require 2 passes.  Let me know if you
need me to describe something like that.



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

>> Mean would be easy, and standard deviation would be possible (but
>> require two passes, I guess?)
>
> There are algorithms that don't require 2 passes.  Let me know if you
> need me to describe something like that.

It sounds like this wouldn't be useful for Phil, so I'll skip it.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Eli Zaretskii
In reply to this post by Phil Sainty
> From: Lars Ingebrigtsen <[hidden email]>
> Date: Tue, 12 Jan 2021 13:41:58 +0100
> Cc: [hidden email],
>  積丹尼 Dan Jacobson <[hidden email]>
>
> > I'm not sure that so-long itself would find it useful to know any kind
> > of averages, but longest line is obviously useful, and the number of
> > lines sounds more or less like a freebie, and might be useful too.
>
> OK, I'll write up a simple function, and we can tweak it later, if it
> turns out that more info is useful.

You are going to use find_newline_no_quit or somesuch, I hope?



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

> You are going to use find_newline_no_quit or somesuch, I hope?

I'm not sure whether that would make sense -- this is something that
typically would be called right after visiting a file...  and
find_newline doesn't count line sizes.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email],  [hidden email]
> Date: Tue, 12 Jan 2021 16:37:20 +0100
>
> Eli Zaretskii <[hidden email]> writes:
>
> > You are going to use find_newline_no_quit or somesuch, I hope?
>
> I'm not sure whether that would make sense -- this is something that
> typically would be called right after visiting a file...  and
> find_newline doesn't count line sizes.

It returns the position, though.  And it's lightning-fast, because it
uses memchr.

Maybe I don't understand what function did you intend to write.  Can
you tell more about that?

Thanks.



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

> It returns the position, though.  And it's lightning-fast, because it
> uses memchr.
>
> Maybe I don't understand what function did you intend to write.  Can
> you tell more about that?

I think it's faster just to code than explain, so I did that, and pushed
now.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Lars Ingebrigtsen
In reply to this post by Phil Sainty
Phil Sainty <[hidden email]> writes:

> I'm not sure that so-long itself would find it useful to know any kind
> of averages, but longest line is obviously useful, and the number of
> lines sounds more or less like a freebie, and might be useful too.

The `buffer-line-statistics' function is now on the trunk, and it
returns count/longest/mean, since those three are free-ish.  (I could
also add standard deviation or variance -- it would basically just be a
sqrt and a division per line?)

Testing in a 18MB zip file (which is probably on the outer limits of
what people load into Emacs, size-wise, for editing), it takes 0.002
seconds to compute the data, so for real usage, it should be
unnoticeable.

The data for that file is

(81472 2632 220.35040259229885)

There's a bit more newlines in that zip file than I'd expect?  I mean,
if it was totally random, it should be 255 or something?

Anyway, I digress.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#45652: so-long mode not triggered despite big file with very long lines

Phil Sainty
On 2021-01-13 08:22, Lars Ingebrigtsen wrote:
> The `buffer-line-statistics' function is now on the trunk

That all sounds great, thanks Lars.

I'll try to do some testing with this sometime soon.


-Phil