Maybe you had a lot of files scattered around on different drives, and you added them all into a single git-annex repository. Some of the files are surely duplicates of others.
While git-annex stores the file contents efficiently, it would still help in cleaning up this mess if you could find, and perhaps remove the duplicate files.
Here's a command line that will show duplicate sets of files grouped together:
git annex find --include '*' --format='${file} ${escaped_key}\n' | \
sort -k2 | uniq --all-repeated=separate -f1 | \
sed 's/ [^ ]*$//'
Here's a command line that will remove one of each duplicate set of files:
git annex find --include '*' --format='${file} ${escaped_key}\n' | \
sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
xargs -d '\n' git rm
--Joey
Spaces, and other special chars can make filename handeling ugly. If you don't have a restriction on keeping the exact filenames, then it might be easiest just to get rid of the problematic chars.
Maybe you can run something like this before checking for duplicates.
Is there any simple way to search for files with a given key?
At the moment, the best I've come up with is this:
git annex find --include '*' --format='${key} ${file}' | grep <KEY>
where
<KEY>
is the key. This seems like an awfully longwinded approach, but I don't see anything in the docs indicating a simpler way to do it. Am I missing something?@Chris I guess there's no really easy way because searching for a given key is not something many people need to do.
However, git does provide a way. Try
git log --stat -S $KEY
Thanks. I have quite a lot of papers in PDF formats. Now I'm saving space, have them controlled, synchronized with many devices and found more than 200 duplicates. Is there a way to donate to the project? You really deserve it. Thanks.
@Juan the best thing to do is tell people about git-annex, help them use it, and file bug reports. Just generally be part of the git-annex community.
(If you really want to donate to me, http://campaign.joeyh.name/ is still open.)
I'm already spreading the word. Handling scientific papers, data, simulations and code has been quite a challenge during my academic career. While code was solved long ago, the three first items remained a huge problem. I'm sure many of my colleagues will be happy to use it. Is there any hashtag or twitter account? I've seen that you collected some of my tweets, but I don't know how you did it. Did you search for git-annex? Best, Juan