Detecting the coding system of a file programmatically

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Detecting the coding system of a file programmatically

Andrea Cardaci
Hi,

I'm in a situation where I need to batch process (readonly) a number
of file with Emacs, my current approach is the following:

(with-temp-buffer
  (insert-file-contents-literally path)
  (decode-coding-region (point-min) (point-max) 'utf-8)
  (... do suff with the buffer ...))

I use `insert-file-contents-literally' because the non-literally
counterpart is too slow (about twice as much apparently) as it does a
bunch of stuff in addition to simply populate the buffer.
Unfortunately, one of these things is to decode the buffer.

Now instead of hardcoding 'utf-8 I'd like to detect the correct
encoding where possible, so I tried experimenting with
`find-operation-coding-system'. I created a latin-1 file (which gets
recognised properly when I visit it) and tried the following:

(with-temp-buffer
  (setq path "~/tmp/latin-1")
  (insert-file-contents-literally path)
  (find-operation-coding-system
   'insert-file-contents
   (cons path (current-buffer))))

But all I get is (undecided). Now my question is twofold: is this the
best approach for what I'm trying to achieve? And in any case, why
does the latter example does not work as expected? (And hence how I
can detect the coding system programmatically?)


Best,

Andrea

Reply | Threaded
Open this post in threaded view
|

Re: Detecting the coding system of a file programmatically

Eli Zaretskii
> From: Andrea Cardaci <[hidden email]>
> Date: Fri, 10 Aug 2018 03:02:55 +0200
>
> (with-temp-buffer
>   (insert-file-contents-literally path)
>   (decode-coding-region (point-min) (point-max) 'utf-8)
>   (... do suff with the buffer ...))
>
> I use `insert-file-contents-literally' because the non-literally
> counterpart is too slow (about twice as much apparently) as it does a
> bunch of stuff in addition to simply populate the buffer.
> Unfortunately, one of these things is to decode the buffer.
>
> Now instead of hardcoding 'utf-8 I'd like to detect the correct
> encoding where possible, so I tried experimenting with
> `find-operation-coding-system'.

That's the wrong function to use in this case; you want
decode-coding-inserted-region instead.  Alternatively, you could use
detect-coding-region and then decode-coding-region with the value it
returns.  I suggest a good read of the "Explicit Encoding" and "Lisp
and Coding Systems" nodes of the ELisp manual.

> I created a latin-1 file (which gets
> recognised properly when I visit it) and tried the following:
>
> (with-temp-buffer
>   (setq path "~/tmp/latin-1")
>   (insert-file-contents-literally path)
>   (find-operation-coding-system
>    'insert-file-contents
>    (cons path (current-buffer))))
>
> But all I get is (undecided).

That's expected: find-operation-coding-system returns the _default_ to
use for the named operation.  It doesn't consider the contents of the
buffer.

> Now my question is twofold: is this the best approach for what I'm
> trying to achieve? And in any case, why does the latter example does
> not work as expected? (And hence how I can detect the coding system
> programmatically?)

I hope I answered all of those questions, if not, please ask more.

In any case, it is definitely OK to call decode-coding-region with the
value 'undecided' returned by find-operation-coding-system, because
'undecided' is a special value which signals to decode-coding-region
that detection of the actual encoding is necessary.  Thus, I expect
this to work for you:

  (with-temp-buffer
    (insert-file-contents-literally path)
    (decode-coding-region (point-min) (point-max)
                          (find-operation-coding-system
                            'insert-file-contents
                            (cons path (current-buffer)))))

But I still recommend to use decode-coding-inserted-region, because it
will do all of the above (and slightly more) for you internally.

Reply | Threaded
Open this post in threaded view
|

Re: Detecting the coding system of a file programmatically

Andrea Cardaci
Hi Eli,

Thanks for the thorough reply.

> That's the wrong function to use in this case; you want
> decode-coding-inserted-region instead.

Yes, that works!

> Thus, I expect this to work for you:
>
>   (with-temp-buffer
>     (insert-file-contents-literally path)
>     (decode-coding-region (point-min) (point-max)
>                           (find-operation-coding-system
>                             'insert-file-contents
>                             (cons path (current-buffer)))))

Yes, except that it accepts a single symbol. I also tried directly with:

(decode-coding-region (point-min) (point-max) 'undecided)

which in my use case it resulted in a more snappy performance.
Basically this latter `decode-coding-region' doesn't introduce a
noticeable slowing to the `insert-file-contents-literally', instead
using `decode-coding-inserted-region' is more or less as slow as using
`insert-file-contents' alone. I guess I'll go with the former.


Andrea

Reply | Threaded
Open this post in threaded view
|

Re: Detecting the coding system of a file programmatically

Eli Zaretskii
> From: Andrea Cardaci <[hidden email]>
> Date: Fri, 10 Aug 2018 15:37:08 +0200
> Cc: Emacs developers <[hidden email]>
>
> (decode-coding-region (point-min) (point-max) 'undecided)
>
> which in my use case it resulted in a more snappy performance.
> Basically this latter `decode-coding-region' doesn't introduce a
> noticeable slowing to the `insert-file-contents-literally', instead
> using `decode-coding-inserted-region' is more or less as slow as using
> `insert-file-contents' alone. I guess I'll go with the former.

Suit yourself, but you need to be aware that while speeding up the
code, you lose some features, which may or may not be important, such
as setting the default coding-system based on the file's name.  If
this code ever needs to handle a file whose contents fools the Emacs
guesswork (which is based on a small part of the buffer contents),
your shortcut might misfire.  E.g., UTF-8 encoded files sometimes dupe
Emacs into thinking they are encoded in some Windows codepage, if that
codepage is the default encoding under the user's locale, so
processing XML or LaTeX files might use a wrong encoding.

Reply | Threaded
Open this post in threaded view
|

Re: Detecting the coding system of a file programmatically

Andrea Cardaci
> Suit yourself, but you need to be aware that while speeding up the
> code, you lose some features, which may or may not be important, such
> as setting the default coding-system based on the file's name.

I'll take that into consideration, thanks.

Reply | Threaded
Open this post in threaded view
|

Re: Detecting the coding system of a file programmatically

Stefan Monnier
In reply to this post by Andrea Cardaci
> I use `insert-file-contents-literally' because the non-literally
> counterpart is too slow (about twice as much apparently) as it does a

I'd be interested to hear if your final code is significantly faster
than `insert-file-contents'.


        Stefan


Reply | Threaded
Open this post in threaded view
|

Re: Detecting the coding system of a file programmatically

Juri Linkov-2
In reply to this post by Andrea Cardaci
> I use `insert-file-contents-literally' because the non-literally
> counterpart is too slow (about twice as much apparently) as it does a
> bunch of stuff in addition to simply populate the buffer.
> Unfortunately, one of these things is to decode the buffer.

For better performance I restrict the size of inserted file by giving to
`insert-file-contents' a small value of args BEG and END.  For example,
to automatically detect encodings of files for diff I use such customization:


dired-diff.el (984 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Detecting the coding system of a file programmatically

Andrea Cardaci
HI Juri,

But in this way the extra operations executed by
`insert-file-contents' are performed anyway, albeit on a smaller
portion of the buffer. I'm not sure if the slow part (decoding
excluded) is proportional to the size of the input file.

I'll keep that in mind as an alternative solution, thanks.

I ended up using:

(with-temp-buffer
  (insert-file-contents-literally path)
  (decode-coding-region (point-min) (point-max) 'undecided)
  (... do suff with the buffer ...))

But take a look at the gotchas mentioned by Eli.
On Thu, 16 Aug 2018 at 23:40, Juri Linkov <[hidden email]> wrote:

>
> > I use `insert-file-contents-literally' because the non-literally
> > counterpart is too slow (about twice as much apparently) as it does a
> > bunch of stuff in addition to simply populate the buffer.
> > Unfortunately, one of these things is to decode the buffer.
>
> For better performance I restrict the size of inserted file by giving to
> `insert-file-contents' a small value of args BEG and END.  For example,
> to automatically detect encodings of files for diff I use such customization:
>