Eyes, JAPAN Blog > Git literacy : Cleaning

Git literacy : Cleaning

denvazh

この記事は1年以上前に書かれたもので、内容が古い可能性がありますのでご注意ください。

Origins of the problem

When writing code and using git to manage source code versions, one might ( usually unintentionally ) add some files to the repository and then delete them with next commits. It doesn’t sound so bad, isn’t it? However, I would like to show what will happen.
Note: this is available in books about git, but for some reason people tend to NOT read them:)

Digging…

Git is a great tool in many ways. In particular, when the project is cloned by new member of the project or another user, git will download entire history of the project including file versions. This is great if all files are source code, but this ends at the point when somebody accidentally adds big binary file(s). If this happens, every time everybody will be cloning this file even if it was removed in the following commits. It will always be there, as soon as file reachable within the project history.

Suppose you work on some code and have in your repository something like this:

$: ls -1
list_files.sh
nice_script.sh

$: git log --oneline
60cc42f Some scripts added

$: du -skh .
 96K	.

$: git count-objects -v
count: 4
size: 16
in-pack: 0
packs: 0
size-pack: 0
prune-packable: 0
garbage: 0

So, there are two files, one commit and total size of the repository is 96K. Let’s do some mess.

Now, suppose somebody made a very smart move and added whole archive with linux kernel to the repository

$: curl -o linux_kernel.tar.bz2 http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.1.5.tar.bz2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 73.6M  100 73.6M    0     0   317k      0  0:03:57  0:03:57 --:--:--  322k
$: git add linux_kernel.tar.bz2
$: git commit -am "being smart: added linux kernel"
[master 4a7c516] being smart: added linux kernel
 1 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 linux_kernel.tar.bz2
$: du -skh .   
147M	.

Now, we indeed have file in both work space and repository making it total 147M ( 73.6M * 2 ). Its not nice to make repository that big,
so let’s assume very strict project manager found it and deleted it.

$: git commit -am "Removed linux kernel"
[master fed0978] Removed linux kernel
 1 files changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 linux_kernel.tar.bz2

$: git log --oneline
fed0978 Removed linux kernel
4a7c516 being smart: added linux kernel
60cc42f Some scripts added

Looks nice. However, even if file was removed from development tree and no longer present there, it still stays in the repository.
We can easily confirm this:

$: du -skh .
 74M	.

$: git gc
Counting objects: 8, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (6/6), done.
Writing objects: 100% (8/8), done.
Total 8 (delta 2), reused 0 (delta 0)

$: git count-objects -v
count: 0
size: 0
in-pack: 8
packs: 1
size-pack: 75384
prune-packable: 0
garbage: 0

Cleaning

Now we will remove this big file from repository itself. However I must warn, that this method rewrites whole history and removes
all references to the file downstream. Thus, if you work on a local repository and do this before pushing branches to remote
repository ( where usually other contributors pushing as well ) – that’s fine, however if it wasn’t you to add the file and you want
to remove it, or it was you, but you found out after these not nice changes was taken to other member of the project, then you have to
notify everybody, so that they will run “git rebase” they work onto your new commits.

Suppose we do not know what kind of file is in question, let’s search. After we ran git gc all objects was moved to the packfile.
Another command git verify-pack allows us to search withing the packfile. We are only interested in blob type ( actual files ) and size
which is third field, so we can sort by it. Let’s search:

$: git verify-pack -v .git/objects/pack/pack-*.idx | sort -k 3 -n | tail -5
60cc42f69fcb1691f0218faa2ca5c22ac3606792 commit 51 60 354 1 fed0978273e79ccf79e30e0e5019204d843eac57
1982f8e6f72b03112fdc5f948600faabea3b0d44 tree   131 136 414
fed0978273e79ccf79e30e0e5019204d843eac57 commit 241 168 12
4a7c516a07ac843d8a58e1e15552ee6615e231f7 commit 252 174 180
f5f5ccbb633e98cc115f6035725e5b57c61c8117 blob   77244311 77192085 656

Aha! The object in the bottom is the one we are looking for.
Verifying:

$ git rev-list --objects --all | grep f5f5ccbb633e98cc115f6035725e5b57c61c8117
f5f5ccbb633e98cc115f6035725e5b57c61c8117 linux_kernel.tar.bz2

Now we need to find commits, where this file have appeared:

$ git log --pretty=oneline -- linux_kernel.tar.bz2
fed0978273e79ccf79e30e0e5019204d843eac57 Removed linux kernel
4a7c516a07ac843d8a58e1e15552ee6615e231f7 being smart: added linux kernel

Finally. Now we must rewrite all the commits downstream from 4a7c5 to actually remove file from repository:

$: git filter-branch --index-filter 'git rm --cached --ignore-unmatch linux_kernel.tar.bz2' -- 4a7c516^..
Rewrite 4a7c516a07ac843d8a58e1e15552ee6615e231f7 (1/2)rm 'linux_kernel.tar.bz2'
Rewrite fed0978273e79ccf79e30e0e5019204d843eac57 (2/2)
Ref 'refs/heads/master' was rewritten

We cannot just rm file, because we removing file from index, thus we have to use git internal commands to remove references from index itself with git rm –cached.

Even we already removed file from the index and history no longer have references to that file, running filter-branch resulted in reflog and a new set of refs
still having them under .git/refs/original, thus we have to remove them as well and then repack the whole database with git gc.

$: rm -Rf .git/refs/original
$: rm -Rf .git/logs/
$: git gc
Counting objects: 6, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 2), reused 2 (delta 0)

$: git count-objects -v
count: 4
size: 75408
in-pack: 6
packs: 1
size-pack: 1
prune-packable: 0
garbage: 0

Now this file is in the loose object space, and no longer included in the history,
but it won’t be transferred on a push or subsequent clone.

To completely remove objects, we can use git prune –expire with some time value, to remove certain object which is
older than certain time.

$: git prune --expire 0
$: git count-objects -v
count: 0
size: 0
in-pack: 6
packs: 1
size-pack: 1
prune-packable: 0
garbage: 0

Conclusion

It is always important to think beforehand – this saves you from making critical mistakes. However in case they happen, you should now tools you use well, to be able to solve problems with the minimum effort.

Comments are closed.