bug#20891: emacs: Back off if .doc is not an Office document

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

era eriksson-4
Package: emacs
Severity: normal
Version: 24.4+1-4ubuntu5
X-Debbugs-Cc: [hidden email]

(I am forwarding the following bug from the Ubuntu Launchpad bug
tracking system.  The original report contains some upset lanuage; the
boiled-down summary at the top is mine.)

https://bugs.launchpad.net/ubuntu/+source/emacs24/+bug/1466139

It is not uncommon for *.doc files to contain plain ASCII text. In this
case, the default behavior of Emacs is less than ideal, as described in
more detail in the problem report below. Perhaps the .doc file name
mapping should contain some additional heuristics, and fall back to
plain text if the file is not an Office document.

Original problem description follows.

-----

Today I downloaded the sources of secure delete. Inspected some files
with vi and some with Emacs 24. Did what I wanted to do, started to
listen to my favourite internet radio station, wanted to cite on
Facebook a citation from the secure delete docs.

I wanted to open the file "secure_delete.doc" (a pure ASCII text file)
in Emacs 24 and: "Whenever you see this buffer I'm going to make a
picture of it and you won't be able to edit anything." Haha, no this
really reminds me of the monkey face during the Ubuntu installation. But
don't make a monkey out of me because Emacs 24 is going to be replaced
with svi an extensible text base line editor yet to be written.

Emacs' open file is broken:
 - whenever it sees a file with the extension or post fix ".doc" it
 treats it like a Office document.
 - it takes an image of it
 - and shows you the image - which for a pure text file shows you the
 contents of the file as an image in that gone editor

They should use the /file/ utility to check for the file type - but
showing an unmutable picture of pure text is like making a monkey out of
the user.

--
If this were a real .signature, it would suck less.  Well, maybe not.



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Lars Ingebrigtsen
[hidden email] writes:

> It is not uncommon for *.doc files to contain plain ASCII text. In this
> case, the default behavior of Emacs is less than ideal, as described in
> more detail in the problem report below. Perhaps the .doc file name
> mapping should contain some additional heuristics, and fall back to
> plain text if the file is not an Office document.

(I'm going through old bug reports that have unfortunately not gotten
any responses.)

I think this makes sense.  A fix in Emacs would mean moving the .doc
recognition from `auto-mode-alist' to...  `magic-fallback-mode-alist', I
guess.

According to the interwebs, the magic sequence for Word .doc files is:

D0 CF 11 E0 A1 B1 1A E1

Does anybody have an opinion here?

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Stefan Kangas
Lars Ingebrigtsen <[hidden email]> writes:

> [hidden email] writes:
>
>> It is not uncommon for *.doc files to contain plain ASCII text. In this
>> case, the default behavior of Emacs is less than ideal, as described in
>> more detail in the problem report below. Perhaps the .doc file name
>> mapping should contain some additional heuristics, and fall back to
>> plain text if the file is not an Office document.
>
> (I'm going through old bug reports that have unfortunately not gotten
> any responses.)
>
> I think this makes sense.  A fix in Emacs would mean moving the .doc
> recognition from `auto-mode-alist' to...  `magic-fallback-mode-alist', I
> guess.
>
> According to the interwebs, the magic sequence for Word .doc files is:
>
> D0 CF 11 E0 A1 B1 1A E1
>
> Does anybody have an opinion here?

I wasn't aware of the practice to name plain text files *.doc; I can't
remember having encountered any file like that.  Perhaps this practice
is rare.

Would implementing this risk make opening *.doc files slower for most
users?  Perhaps that could make the trade-off not worth it.  Other
than that, I see no problem with the proposal.

Best regards,
Stefan Kangas



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

era eriksson-4
On Wed, Nov 6, 2019, at 03:53, Stefan Kangas wrote:

> Lars Ingebrigtsen <[hidden email]> writes:
> > [hidden email] writes:
> >> It is not uncommon for *.doc files to contain plain ASCII text. In this
> >> case, the default behavior of Emacs is less than ideal
> > I think this makes sense.  A fix in Emacs would mean moving the .doc
> > recognition from `auto-mode-alist' to...  `magic-fallback-mode-alist', I
> > guess.
> I wasn't aware of the practice to name plain text files *.doc; I can't
> remember having encountered any file like that.  Perhaps this practice
> is rare.
> Would implementing this risk make opening *.doc files slower for most
> users?  Perhaps that could make the trade-off not worth it.  Other
> than that, I see no problem with the proposal.

I'd agree that this is probably increasingly rare, but it used to be a practice which wasn't entirely uncommon back when Microsoft was not yet a household brand name and Word wasn't taught in schools.

On the other hand, if the behavior described in the original bug report is still current, that's quirky and unexpected. Really, how many people *expect* Emacs to be able to open a Word document, and are any of them happy when they get a static image to look at in Emacs?

--
If this were a real .signature, it would suck less.  Well, maybe not.



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Stefan Kangas
era <[hidden email]> writes:

> On the other hand, if the behavior described in the original bug report is still
> current, that's quirky and unexpected. Really, how many people *expect* Emacs to
> be able to open a Word document, and are any of them happy when they get a
> static image to look at in Emacs?

AFAIU, the problem is that we do not have a mode to edit Microsoft
Word documents.  It would obviously be fantastic if someone would be
willing to write such a package, but it's a potentially big task.

So, as long as we lack editing capabilities, showing an image of the
document in Emacs is actually pretty useful.  More useful than getting
garbled text, at any rate.

Best regards,
Stefan Kangas



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Richard Stallman
[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > So, as long as we lack editing capabilities, showing an image of the
  > document in Emacs is actually pretty useful.

How would Emacs do that?

--
Dr Richard Stallman
Founder, Free Software Foundation (https://gnu.org, https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

era eriksson-4
On Thu, Nov 7, 2019, at 06:45, Richard Stallman wrote:
>   > So, as long as we lack editing capabilities, showing an image of the
>   > document in Emacs is actually pretty useful.
> How would Emacs do that?

The Emacs-side entry point seems to be doc-view-mode-maybe, which is hooked in auto-mode-alist for a number of file name extensions.

As described in https://www.emacswiki.org/emacs/DocViewMode it relies on external utilities to provide the actual image.

I was unable to quickly repro in a fresh Debian or Ubuntu image, but that might be because I didn't have the external utility installed.

Tangentially, googling for doc-view-mode-maybe suggests that lots of people are annoyed by it and want to turn it off, probably often for related but distinct reasons.

--
If this were a real .signature, it would suck less.  Well, maybe not.



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Lars Ingebrigtsen
In reply to this post by Stefan Kangas
Stefan Kangas <[hidden email]> writes:

> Would implementing this risk make opening *.doc files slower for most
> users?  Perhaps that could make the trade-off not worth it.  Other
> than that, I see no problem with the proposal.

I don't think it'd be any performance problem -- we'd just have to read
the first 8 bytes of the file to see whether the magic sequence is there.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Date: Fri, 08 Nov 2019 21:59:54 +0100
> Cc: [hidden email], [hidden email]
>
> Stefan Kangas <[hidden email]> writes:
>
> > Would implementing this risk make opening *.doc files slower for most
> > users?  Perhaps that could make the trade-off not worth it.  Other
> > than that, I see no problem with the proposal.
>
> I don't think it'd be any performance problem -- we'd just have to read
> the first 8 bytes of the file to see whether the magic sequence is there.

*.doc files are rare nowadays.  Do the *.docx files have the same
signature?  I doubt that, since they are actually *.zip files in
disguise.



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

>> I don't think it'd be any performance problem -- we'd just have to read
>> the first 8 bytes of the file to see whether the magic sequence is there.
>
> *.doc files are rare nowadays.  Do the *.docx files have the same
> signature?  I doubt that, since they are actually *.zip files in
> disguise.

Yeah, *.docx have a different signature, so this would be for *.doc files
only (and since the Windows *.doc files are becoming rarer, perhaps that
means that doing doc-view only on files that have the magic bytes is
more important than it used to be).

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email],  [hidden email]
> Date: Sat, 09 Nov 2019 21:14:52 +0100
>
> since the Windows *.doc files are becoming rarer, perhaps that
> means that doing doc-view only on files that have the magic bytes is
> more important than it used to be

Sorry, I don't follow that logic.  I'd expect that *.doc MS Word files
becoming rarer would mean plain-text *.doc files become relatively
more important, i.e. the opposite conclusion.  What did I miss?



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Lars Ingebrigtsen
Eli Zaretskii <[hidden email]> writes:

>> From: Lars Ingebrigtsen <[hidden email]>
>> Cc: [hidden email],  [hidden email],  [hidden email]
>> Date: Sat, 09 Nov 2019 21:14:52 +0100
>>
>> since the Windows *.doc files are becoming rarer, perhaps that
>> means that doing doc-view only on files that have the magic bytes is
>> more important than it used to be
>
> Sorry, I don't follow that logic.  I'd expect that *.doc MS Word files
> becoming rarer would mean plain-text *.doc files become relatively
> more important, i.e. the opposite conclusion.  What did I miss?

That's what I'm saying.  :-) Or at least I tried to.  It's more
important to add magic byte recognition to doc-mode for .doc files now
than before.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: [hidden email],  [hidden email],  [hidden email]
> Date: Thu, 14 Nov 2019 10:55:35 +0100
>
> > Sorry, I don't follow that logic.  I'd expect that *.doc MS Word files
> > becoming rarer would mean plain-text *.doc files become relatively
> > more important, i.e. the opposite conclusion.  What did I miss?
>
> That's what I'm saying.  :-) Or at least I tried to.  It's more
> important to add magic byte recognition to doc-mode for .doc files now
> than before.

How would the magic signature recognition help with plain-text files?
They don't have any such signatures?  I'm still missing something,
sorry.



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Robert Pluim
>>>>> On Thu, 14 Nov 2019 16:12:22 +0200, Eli Zaretskii <[hidden email]> said:

    Eli> How would the magic signature recognition help with plain-text files?
    Eli> They don't have any such signatures?  I'm still missing something,
    Eli> sorry.

Today we go: ".doc extension -> show an image of the contents of the
file" which is manifestly the wrong thing to do for a non-doc file. If
we do the signature recognition, those files which are not recognized
end up in (probably) fundamental-mode

Robert



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Eli Zaretskii
> From: Robert Pluim <[hidden email]>
> Cc: Lars Ingebrigtsen <[hidden email]>,  [hidden email],
>   [hidden email],  [hidden email]
> Date: Thu, 14 Nov 2019 16:06:36 +0100
>
> Today we go: ".doc extension -> show an image of the contents of the
> file"

Where do we have the code or data which does that?

> which is manifestly the wrong thing to do for a non-doc file. If
> we do the signature recognition, those files which are not recognized
> end up in (probably) fundamental-mode

That's OK, but I'm still missing the code which makes this happen.
E.g., I just did "C-x C-f foo.doc RET" and got a buffer in Fundamental
mode.



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Andreas Schwab
On Nov 14 2019, Eli Zaretskii wrote:

> Where do we have the code or data which does that?

See auto-mode-alist, and doc-view-mode-maybe.

> That's OK, but I'm still missing the code which makes this happen.
> E.g., I just did "C-x C-f foo.doc RET" and got a buffer in Fundamental
> mode.

It only works if you have a doc-view-odf->pdf-converter-program.

Andreas.

--
Andreas Schwab, SUSE Labs, [hidden email]
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Eli Zaretskii
> From: Andreas Schwab <[hidden email]>
> Cc: Robert Pluim <[hidden email]>,  [hidden email],  [hidden email],  [hidden email],  [hidden email]
> Date: Thu, 14 Nov 2019 17:33:28 +0100
>
> On Nov 14 2019, Eli Zaretskii wrote:
>
> > Where do we have the code or data which does that?
>
> See auto-mode-alist, and doc-view-mode-maybe.
>
> > That's OK, but I'm still missing the code which makes this happen.
> > E.g., I just did "C-x C-f foo.doc RET" and got a buffer in Fundamental
> > mode.
>
> It only works if you have a doc-view-odf->pdf-converter-program.

Thanks, I was blind.

So we want to remove docx? from auto-mode-alist and instead to add the
magic signature to magic-mode-alist?  But then AFAIK MS Word documents
had different signatures for different versions, so we should have
several.  And a literal docx should be left in auto-mode-alist, right?



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Lars Ingebrigtsen
In reply to this post by Eli Zaretskii
Eli Zaretskii <[hidden email]> writes:

>> That's what I'm saying.  :-) Or at least I tried to.  It's more
>> important to add magic byte recognition to doc-mode for .doc files now
>> than before.
>
> How would the magic signature recognition help with plain-text files?
> They don't have any such signatures?  I'm still missing something,
> sorry.

The magic signature recognition is for the MS .doc files, not the text
files.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Lars Ingebrigtsen
In reply to this post by Eli Zaretskii
Eli Zaretskii <[hidden email]> writes:

> But then AFAIK MS Word documents had different signatures for
> different versions, so we should have several.

All .doc files allegedly start with the same eight bytes.

--
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no



Reply | Threaded
Open this post in threaded view
|

bug#20891: emacs: Back off if .doc is not an Office document

Eli Zaretskii
> From: Lars Ingebrigtsen <[hidden email]>
> Cc: Andreas Schwab <[hidden email]>,  [hidden email],  [hidden email],
>   [hidden email],  [hidden email]
> Date: Fri, 15 Nov 2019 08:51:40 +0100
>
> Eli Zaretskii <[hidden email]> writes:
>
> > But then AFAIK MS Word documents had different signatures for
> > different versions, so we should have several.
>
> All .doc files allegedly start with the same eight bytes.

Maybe my reading of the 'magic' file is wrong, but it seems to say
otherwise.



12