[ELPA] New package: find-dups

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[ELPA] New package: find-dups

Michael Heerdegen
Hello,

would somebody want me to add something like the following to Gnu Elpa?
Then I would push it after doing some fine-tuning (like adding
autoloads, finding KEYWORDS for the file header etc - if nobody wants
it, I could spare finding keywords etc).





TIA,

Michael.

find-dups.el (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Robert Weiner-2
This seems incredibly complicated.  It would help if you would state the general problem you are trying to solve and the performance characteristics you need.
It certainly is not a generic duplicate removal library.  Why can't you flatten your list and then just apply a sequence of predicate matches as needed or use hashing as mentioned in the commentary?

Bob
Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Michael Heerdegen
Robert Weiner <[hidden email]> writes:

> This seems incredibly complicated.  It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need.  It certainly is not a generic duplicate
> removal library.  Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?

I guess the name is misleading, I'll try to find a better one.

Look at the example of finding files with equal contents in your file
system: you have a list or stream of, say, 10000 files in a file
hierarchy.  If you calculate hashes of all of those 10000 files, it will
take hours.

It's wiser to do it in steps: first, look at the file's sizes of all
files.  That's a very fast test, and files with equal contents have the
same size.  You can discard all files with unique sizes.

In a second step, we have less many files.  We could look at the first N
bytes of the files.  That's still quite fast.  Left are groups of files
with equal sizes and equal heads.  For those it's worth of calculating a
hash sum to see which have also equal contents.

The idea of the library is to abstract over the type of elements and the
number and kinds of test.  So you can write the above as

#+begin_src emacs-lisp
(find-dups my-sequence-of-file-names
           (list (list (lambda (file)
                         (file-attribute-size (file-attributes file)))
                       #'eq)
                 (list (lambda (file)
                         (shell-command-to-string
                          (format "head %s"
                                  (shell-quote-argument file))))
                       #'equal)
                 (list (lambda (file)
                         (shell-command-to-string
                          (format "md5sum %s | awk '{print $1;}'"
                                  (shell-quote-argument file))))
                       #'equal)))
#+end_src

and `find-dups' executes the algorithm with the steps as specified.  You
need just to specify a number of tests but don't need to write out the
code yourself.

Do you need a mathematical formulation of the abstract problem that the
algorithm solves, and how it works?  I had hoped the example in the
header is a good explanation...


Regards,

Michael.

Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Eli Zaretskii
> From: Michael Heerdegen <[hidden email]>
> Date: Wed, 11 Oct 2017 19:56:26 +0200
> Cc: [hidden email], Emacs Development <[hidden email]>
>
> #+begin_src emacs-lisp
> (find-dups my-sequence-of-file-names
>            (list (list (lambda (file)
>                          (file-attribute-size (file-attributes file)))
>                        #'eq)
>                  (list (lambda (file)
>                          (shell-command-to-string
>                           (format "head %s"
>                                   (shell-quote-argument file))))
>                        #'equal)
>                  (list (lambda (file)
>                          (shell-command-to-string
>                           (format "md5sum %s | awk '{print $1;}'"
>                                   (shell-quote-argument file))))
>                        #'equal)))
> #+end_src

Apologies for barging into the middle of a discussion, but starting
processes and making strings out of their output to process just a
portion of a file is sub-optimal, because process creation is not
cheap.  It is easier to simply read a predefined number of bytes into
a buffer; insert-file-contents-literally supports that.  Likewise with
md5sum: we have the md5 primitive for that.

In general, working with buffers is much more efficient in Emacs than
working with strings, so avoid strings, let alone large strings, as
much as you can.

One other comment is that shell-command-to-string decodes the output
from the shell command, which is not something you want here, because
AFAIU you are looking for files whose contents is identical on the
byte-stream level, i.e. 2 files which have the same characters, but
are encoded differently on disk (like one UTF-8, the other Latin-1)
should be considered different in this contents, whereas
shell-command-to-string will/might produce identical strings for them.
(Decoding is also expensive run-time wise.)

Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Michael Heerdegen
Hi Eli,

Thanks for your comment.


> Apologies for barging into the middle of a discussion, but starting
> processes and making strings out of their output to process just a
> portion of a file is sub-optimal, because process creation is not
> cheap.  It is easier to simply read a predefined number of bytes into
> a buffer; insert-file-contents-literally supports that.  Likewise with
> md5sum: we have the md5 primitive for that.

That's one of the details I delayed until I know whether we want this
for inclusion.  My first version used `insert-file-contents-literally'
but I ran into problems, and calling `debug' crashed Emacs, so I gave up
for the moment.  Personally, I don't care that much because it doesn't
make a big speed difference, but yes, I know it's not absolutely
correct, and I would care about this.


Regards,

Michael.

Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Thien-Thi Nguyen-3
In reply to this post by Michael Heerdegen

() Michael Heerdegen <[hidden email]>
() Wed, 11 Oct 2017 19:56:26 +0200

   Robert Weiner <[hidden email]> writes:

   > This seems incredibly complicated.  It would help if you
   > would state the general problem you are trying to solve and
   > the performance characteristics you need.  It certainly is
   > not a generic duplicate removal library.  Why can't you
   > flatten your list and then just apply a sequence of
   > predicate matches as needed or use hashing as mentioned in
   > the commentary?

   I guess the name is misleading, I'll try to find a better one.

How about "multi-pass-dups", then use/document "pass" everywhere
in the code/discussion?  (Currently, you use "stage" in the code,
and "step" in this thread.)

   Look at the example of finding files with equal contents in
   your file system: [...]

Can you think of another use-case?  That exercise will help
highlight the general (factorable) concepts to document well.
Conversely, if you cannot, maybe that's a hint that the
abstraction level is too high; some opportunity exists for
specialization (and thus optimization).

   In a second step, we have less many files.

This is the key motivation for multi-pass...

   Do you need a mathematical formulation of the abstract
   problem that the algorithm solves, and how it works?

...so briefly explaining how iteration mitigates the suffering
due to the (irreducible) N^2 might be good to do early on (in
Commentary), before giving examples.  Leading w/ a small bit of
theory caters to those readers already disposed to that style.

To help the rest of the readers, a common technique is to label
components of the theory (e.g., [1], [2], ...) and refer to
these later, in the concrete examples.  Those readers might
gloss over the theory at first (being indisposed) but the back
references invite them to make connections at their own pace.

In short, "it is wise" to show how "it is wise" and avoid saying
"it is wise" (according to this practiced "wise"ass :-D).

   (find-dups my-sequence-of-file-names
              (list (list (lambda (file) ...)
                          #'eq)
                    (list (lambda (file) ...)
                          #'equal)
                    (list (lambda (file) ...)
                          #'equal)))

IIUC the 2nd level ‘list’ is to associate each characterization
func w/ a comparison func.  I wonder if there is another way.
Too bad Emacs Lisp has no "object properties" like Guile, eh?

OTOH, the 1st level ‘list’ seems like a gratuitous hoop (read:
source of latent PEBKAC/complaints/redesign).  Why not move that
down-chain, so caller need not worry?  Something like:

#+begin_src emacs-lisp
(multi-pass-dups
 MY-SEQUENCE-OF-FILE-NAMES
 (list (lambda (file) ...)
       #'eq)
 (list (lambda (file) ...)
       #'equal)
 (list (lambda (file) ...)
       #'equal))
#+end_src

(I also use the all-caps-for-metavariables convention, here.)

--
Thien-Thi Nguyen -----------------------------------------------
 (defun responsep (query)
   (pcase (context query)
     (`(technical ,ml) (correctp ml))
     ...))                              748E A0E8 1CB8 A748 9BFA
--------------------------------------- 6CE4 6703 2224 4C80 7502


signature.asc (203 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Robert Weiner-2
In reply to this post by Michael Heerdegen
On Wed, Oct 11, 2017 at 1:56 PM, Michael Heerdegen <[hidden email]> wrote:
Robert Weiner <[hidden email]> writes:

> This seems incredibly complicated.  It would help if you would state
> the general problem you are trying to solve and the performance
> characteristics you need.  It certainly is not a generic duplicate
> removal library.  Why can't you flatten your list and then just apply
> a sequence of predicate matches as needed or use hashing as mentioned
> in the commentary?

I guess the name is misleading, I'll try to find a better one.

​Sounds good.  How about filter-set?  You are filtering a bunch of items to produce a set.  I'm not sure if this is limited to files or more generic.
​​

​​
Look at the example of finding files with equal contents in your file
​​

​​
system: you have a list or stream of, say, 10000 files in a file
​​

​​
hierarchy.  If you calculate hashes of all of those 10000 files,
​​
it will
​​
take hours.
​​

​​
​Ok, so you want to filter down a set of hierarchically arranged files.
​​

​​
It's wiser to do it in steps: first, look at the file's sizes of all
​​

​​
files.  That's a very fast test, and files with equal contents
​​
have the
​​
same size.  You can discard all files with unique sizes.
​​

​​
​​Yes, but that is just filtering (get the size of the files and filter to sets of files with the unique sizes).  Then you chain more filters to filter further.
   (filter-duplicates list-of-filters-to-apply list-of-files-to-filter)

which would produce a chain of filters like:
   (filterN .... (filter2 (filter1 list-of-files-to-filter))

In a second step, we have less many files.  We could look at the first N
bytes of the files.  That's still quite fast.

​So you apply your fastest and most effective filters first.
​​
  Left are groups of files
​​
with equal sizes and equal heads.  For those it's worth of calculating a
​​
hash sum to see which have also equal contents.

​Ok.
​​

​​
The idea of the library is to abstract over the type of elements and the
​​
n
​​
u
​​
m
​​
b
​​
e
​​
r
​​
​​
a
​​
n
​​
d
​​
​​
k
​​
i
​​
n
​​
d
​​
s
​​
​​
o
​​
f
​​
​​
t
​​
e
​​
s
​​
t
​​
.
​​

​But as the prior message author noted, you don't need lists of lists to do that.  We want you to simplify things so they are most​ generally useful and easier to understand.

and `find-dups' executes the algorithm with the steps as specified.  You
need just to specify a number of tests but don't need to write out the
code yourself.

​I don't quite see what code is not being written except the sequencing of the filter applications which is your code.
​​

​​
Do you need a mathematical formulation of the abstract problem that the
​​
algorithm solves, and how it works?  I had hoped the example in the
​​
header is a good explanation...

​The example is a good one to use but as was noted is only one use case.  Keep at it and you'll see it will become something much nicer.

Bob​
Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Andreas Politz, INF|INF-I-2
In reply to this post by Michael Heerdegen

What you seem to be doing is something like this (minus equal vs eq).

#+BEGIN_SRC emacs-lisp
  (defun partition-by (s &rest fns)
    "Partition sequence S by some function F.

  Partition S using a equivalence relation R constructed by F, such
  that for all x,y ∈ S

       (x, y) ∈ R ⇔ f(x) = f(y) .

  If GS is non-nil, it should be a list of functions acting as necessary
  conditions, such that for all x,y ∈ S and g₁̣,..,gₙ ∈ GS:

       g₁(x) = g₁(y)  ⇐ ... ⇐ gₙ(x) = gₙ(y) ⇐ f(x) = f(y)

  The idea is that the functions in GS may be computed more efficiently
  than F and thus allowing for an overall faster computation of the
  partition in some cases.

  Returns the partition of S as a list of lists.

  \(FN S &rest G1 ... GN F\)"

    (cond
     ((= (length s) 0) nil)
     ((or (null fns)
          (< (length s) 2))
      (list s))
     (t
      (seq-mapcat
       (lambda (elt)
         (apply #'partition-by (cdr elt) (cdr fns)))
       (seq-group-by (car fns) s)))))

  (partition-by
   (seq-filter #'file-regular-p (directory-files "/tmp" t))
   (lambda (file) (nth 7 (file-attributes file)))
   (lambda (file) (with-temp-buffer
                    (insert-file-contents-literally file)
                    (md5 (current-buffer)))))
#+END_SRC

Andreas

Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Michael Heerdegen
Andreas Politz <[hidden email]> writes:

> What you seem to be doing is something like this (minus equal vs eq).

I didn't see that `seq-group-by' could be used as a factor - thanks.  It
would be good if `seq-group-by' was able to use hash tables when
appropriate, however.


Michael.

Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Nicolas Petton-2
Michael Heerdegen <[hidden email]> writes:


> I didn't see that `seq-group-by' could be used as a factor - thanks.  It
> would be good if `seq-group-by' was able to use hash tables when
> appropriate, however.

When would it be appropriate?

Cheers,
Nico

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Michael Heerdegen
Nicolas Petton <[hidden email]> writes:

> Michael Heerdegen <[hidden email]> writes:
>
>
> > I didn't see that `seq-group-by' could be used as a factor - thanks.
> > It
> > would be good if `seq-group-by' was able to use hash tables when
> > appropriate, however.
>
> When would it be appropriate?

Probably when there are enough elements - `delete-dups' hardcodes a
limit of 100 for using a list instead of a hash table.


Michael.

Reply | Threaded
Open this post in threaded view
|

Re: [ELPA] New package: find-dups

Michael Heerdegen
Michael Heerdegen <[hidden email]> writes:

> Probably when there are enough elements - `delete-dups' hardcodes a
> limit of 100 for using a list instead of a hash table.

BTW it would also be nice if `seq-group-by' would have similar features
as `cl-remove-duplicates' (in particular, :test and :key) in some way.
AFAICT, `cl-remove-duplicates' never uses hash tables, so the perfect
function of that kind is not yet available.


Michael.