Find and list duplicates of files (26)

1 Name: #!/usr/bin/anonymous : 2008-04-26 14:40 ID:mNaORxmZ

Well, this was easier than I expected it to be. Now for traversing subdirectories... I hate recursion...

#!/usr/bin/perl
$dir = $ARGV[0] or die "Kein gültiges Verzeichnis angegeben.\n";
opendir DIR, $dir or die "Kann $dir nicht öffnen: $!\n";
foreach $file (readdir DIR)
{

next if ($file eq '.') or ($file eq '..') or ! -f $file;
$fullname = "$dir/$file";
$md5sum = `md5sum $fullname | cut -d" " -f1`;
$files{$md5sum} = defined($files{$md5sum}) ? $files{$md5sum} .= ";$file" : $file;

}
closedir DIR;

@files = values %files;
foreach $file (@files)
{

next unless ($file =~ /;/);
$file =~ s/;/ /g;
print "Kopien: $file\n";

}

2 Name: #!/usr/bin/anonymous : 2008-04-26 16:46 ID:Heaven

and you're posting this because...?

3 Name: #!/usr/bin/anonymous : 2008-04-26 17:21 ID:mNaORxmZ

>>2
Yes.

4 Name: #!/usr/bin/anonymous : 2008-04-26 17:23 ID:Heaven

My version uses File::Find, so I don't have to directly deal with recursion. It pairs up the md5 to the file names. And if there is more than one file for an md5, it prints the file names.

use File::Find;

sub wanted {
push @{$dups{substr(`md5sum -b "$_"`,0,31)}}, $File::Find::name if -f;
}


find \&wanted, ('.');

while (($k,$v) = each(%dups)) {
if (scalar @$v > 1) {
print "$k:";
foreach (@$v) {
print " [$_]";
}
print "\n";
}
}

5 Name: #!/usr/bin/anonymous : 2008-04-26 17:24 ID:Heaven

>>2
This is /code/ ...seriously.

6 Name: #!/usr/bin/anonymous : 2008-04-26 17:52 ID:5q+5gGj8

i think you should only compute the digests of files having same sizes.

7 Name: #!/usr/bin/anonymous : 2008-04-26 17:54 ID:mNaORxmZ

>>6
Good point, thank you!

8 Name: #!/usr/bin/anonymous : 2008-04-26 18:00 ID:NC3CTkuc

Hello here is my factorial program I wrote in Python

def fact (n):
if n == 0: return 1
return n*fact(n - 1);

number = int(raw_input("Enter a number:"))
if number < 0:
print "No"
else:
print fact(number)

9 Name: #!/usr/bin/anonymous : 2008-04-26 18:50 ID:Heaven

>>8

import operator
def fact(x): return reduce(operator.mul, xrange(2, x+1))

10 Name: #!/usr/bin/anonymous : 2008-04-26 18:57 ID:Heaven

>>9

fact(1)

11 Name: #!/usr/bin/anonymous : 2008-04-26 19:02 ID:Heaven

>>6
Good idea. Here is the code. Whole thing is pretty slow though.
To >>8,9,10, look at the topic thanks, faggots.

use File::Find;

sub wanted {
$flen = -s $_;
push @{$sizees{$flen}}, $File::Find::name if -f;
}


find \&wanted, ('.');

while (($k,$v) = each(%sizees)) {
if (scalar @$v > 1) {
foreach (@$v) {
push @{$dups{substr(`md5sum "$_"`,0,31)}}, $_;
}
}
}

while (($k,$v) = each(%dups)) {
if (scalar @$v > 1) {
print "$k:";
foreach (@$v) {
print " [$_]";
}
print "\n";
}
}

12 Name: #!/usr/bin/anonymous : 2008-04-27 11:39 ID:fd1yg6ZN

ruby mix.

require "find"
require "digest/md5"

module Digest
class MD5
def self.filehexdigest(filename)
h = MD5.new
File.open(filename, "r") do |file|
h.update(file.read(32 * 1024)) until file.eof
end
h.hexdigest
end
end
end

def dupes(dir)
s, h = {}, {}
Find.find(dir) do |path|
# if this is a file
if FileTest.file?(path)
# append it to the array of files having the same size
size = File.stat(path).size
s[size] ||= []
s[size] << path
end
end
# only consider files that have the same size as some other file
filenames = s.values.select {|e| e.size > 1}.flatten
filenames.each do |filename|
# compute its digest
digest = nil
begin
digest = Digest::MD5.filehexdigest(filename)
rescue => e
$stderr.puts(e)
end
# append it to the array of files having the same digest
h[digest] ||= []
h[digest] << filename
end
# only return entries where two or more files had the same digest
h.values.select {|e| e.size > 1}
end

exit 1 unless ARGV.size == 1
dupes(ARGV[0]).each do |ds|
puts(ds.join(" "))
end

13 Name: #!/usr/bin/anonymous : 2008-04-27 11:45 ID:fd1yg6ZN

ruby fact.

def fact(n)
n <= 1 ? 1 : (2..n).to_a.inject(1) {|s,e| s * e}
end

14 Name: #!/usr/bin/anonymous : 2008-04-27 16:34 ID:GA3w+AGu

>>9

f=lambda x:reduce(lambda a,b:a*b,xrange(2,x+1)

15 Name: #!/usr/bin/anonymous : 2008-04-27 17:51 ID:58v/IFcO

>>13,14
For fucks sake, every nigger knows how to factorial. Would you stop posting that shit?

16 Name: #!/usr/bin/anonymous : 2008-04-27 19:01 ID:Heaven

>>15
And collecting md5 hashes for each file in a directory isn't trivial?

17 Name: #!/usr/bin/anonymous : 2008-04-27 19:57 ID:58v/IFcO

>>16
Well it sure is more useful than fucking factorial.

19 Name: #!/usr/bin/anonymous : 2008-04-27 23:21 ID:58v/IFcO

>>18
lulz

20 Name: #!/usr/bin/anonymous : 2008-04-28 15:38 ID:fd1yg6ZN

you probably mean that everybody knows how to do factorial the simple way. but finding all the duplicate files in a tree and computing factorial are both as complicated as you have time for them to be, like pretty much any subject.

21 Name: #!/usr/bin/anonymous : 2008-04-28 22:47 ID:58v/IFcO

>>20
What I actually mean is that:
1) The topic of this thread is "Find and list duplicates of files"
2) >>8 is a ridiculously obvious troll, it's not even funny that you don't recognize it.
3) So make another thread about factorial and at the very least post a less obvious formula. All I see here is n! = n(n-1) and n! = prod(1..n).
4) Oh fuck it
    人

  (__)
\(__)/ poooooooooooop
 ( ・∀・ )

22 Name: #!/usr/bin/anonymous : 2008-04-29 02:52 ID:JoAFhg0D

why don't you do something other than complain?

23 Name: #!/usr/bin/anonymous : 2008-04-29 03:01 ID:58v/IFcO

>>22
I poasted >>4 and >>11, thanks. And what else is there to do in this thread more than that? Now this thread gets bumped over or someone else comments or posts another way to do the same thing. What else is there?

24 Name: #!/usr/bin/anonymous : 2008-04-29 04:26 ID:JoAFhg0D

the 'what else' is only limited by your time, energy, and interest. what other heuristics could/should you use to avoid having to do the md5sum? at what point would those heuristics become more harmful than helpful? is there some much more clever way to prove that the files are identical or different? etc.

12.rb takes 0m0.204s on my /usr/include. 11.pl takes 0m3.066s. why? do you know? i'm kind of interested. maybe i'll check that out. there is always more to whatever you're looking at. you just stop when you lose interest or lack energy.

25 Name: #!/usr/bin/anonymous : 2008-04-29 15:41 ID:58v/IFcO

>>24
Oh, fine. Let's see, the speed difference is mostly because of the Digest module. Used the same module with perl and it's much faster.

I have an idea though. After grouping the files by size, I could pick a random block of data from the same position from each file in a group and check it to see if it matches. If it matches, then md5sum the file. Don't know if reading a block from the middle of a file is fast though. Maybe better is matching the first few bytes of a file. I could try it next week; I have finals this week and next.

26 Name: #!/usr/bin/anonymous : 2008-04-29 16:16 ID:fd1yg6ZN

yeah, ok. figured it was the md5sums.

yeah, i thought about checking certain bytes or chunks first. or running some much faster but less trustworthy checksum stuff first. like crc32(A) == crc32(B) && md5/sha.

if you want to check a few bytes, i think checking the beginning is just as good as anywhere else. i can't see how you'd know.

one thing that occurs to me is that it's not always necessary to fully compute the md5sum. i think maybe it'd be better to compute the md5sums of the blocks of the files of equal size in parallel so that you can compare the intermediate hashes. that is, right now we say the file sizes are equal, so compute both their hashes and see if those are equal too. but a better way might be to compute the hash of the first block of A and compare that to the hash of the first blocks of B and C and D and the other files whose sizes matched. that way you could kick some out of consideration earlier and do less hashing.

This thread has been closed. You cannot post in this thread any longer.