Friday, 11 July 2008

Bazaar has the model right

Some people in the GNOME community have suggested that if Bazaar has nice usability, then GNOME can just use Git on the back-end, and Bazaar lovers can just use the Git back-end via Bazaar. It's true that Bazaar could support this — an experimental plug-in exists to do this right now. But this suggestion betrays several wrong assumptions.

People assume Git and Bazaar are the same. They're not. People assume that if Git and Bazaar have technical differences, then Git must have it right.

The problem with these assumptions is that usability begins at the ground level. Bazaar started with a focus on usability. Git began with a focus on speed. The data models of both Bazaar and Git reflect their initial focus. But Bazaar's model can also be fast. In fact, the Bazaar developers are currently optimising a number of key operations for speed.

Data retrieval

Git and Bazaar are both key/value mapping systems. When bytes are needed, they are requested with that key.

The big difference is that Git's keys are also the hashes of the bytes. This is why it's called a content-addressable file system. This allows git to offer a guarantee that if the value hashes to the key, it has not been modified, whether deliberately or by accident. The Bazaar team considered adopting this approach, but decided it was too constricting. Bazaar uses UUIDs instead.

Authenticating revisions

For detecting malicious modification of revisions, Git uses its cryptographic hashes.

Bazaar uses revision-signing. All revisions can be PGP-signed. No signed revision can be forged. And the hashed representation can easily be generated and passed around to ensure that exactly the same content is used.

If SHA-1 is broken, both Bazaar and Git will lose their ability to detect malicious modification. But since Bazaar uses UUIDs to identify revisions, users can re-sign their old revisions with whatever method proves to be secure. Changing the hash used by Git would make it incompatible with all existing repositories.

Data Integrity and Serialization formats

Bazaar stores hashes of every value, so it equally capable of detecting accidental modification. It can be useful to have different representations of a tree in different repositories. For example, when Git lists files, it divides this data by directory. This is a good approach, but not necessarily the best approach. An alternative approach would be to use a radix tree. This would ensure that Git performed quickly even if users put unreasonable numbers of files in a single directory. But Git's keys are hashes, upgrading Git's format to use radix trees would change the keys, which means that people could not use the commit-id from one repository to refer to the same tree in an other form.

Bazaar doesn't assume it has the perfect format. It provides an upgrade path, and does't change the commit-id of a revision if you change your format. What's more, Bazaar can even reference data it has never seen. This allows partial imports from other VCSes to be fully compatible with more complete imports. And if a VCS provides UUIDs (content hashes certainly qualify as UUIDs), Bazaar can refer to those UUIDs directly.

File and directory representation

Git refers to files by path. It makes no attempt to track renames in its data store.

Bazaar has an inode abstraction; files and directories both have ids. When a file is renamed, its id stays the same. Bazaar's core code refers to files by their id, so merging a renamed file requires no special effort.

Git's approach means that users are warned not to rename files while changing their content. But when files are renamed, those files that refer to the renamed files must have their contents changed as well. For example, if you rename foo.h and foo.c to bar.h and bar.c, you should update the contents of bar.c, or else you will break the build. With Bazaar, users can do whatever they want, and the VCS just works. While Git must always use heuristics to deduce renames, Bazaar does not have to. Of course, it can if it wants to. This is an example of why it is important to design a model for usability from the beginning.

Bazaar can import rename data losslessly from foreign VCSes. Some other VCSes support file-ids, and Bazaar can reuse those without change. For VCSes that support renames, but not file-ids, Bazaar's representation is also non-lossy. When data imports are deterministic and non-lossy, it's easy to export them back to their source VCS. Bazaar's Subversion integration is a great example of how this can work.

Choose the back-end with the right model

In any situation it makes sense to use a back-end that stores the richer dataset. It makes more sense to have a front end client that doesn't use all the functionality or data representation of the back-end than it does to have a richer client that isn't able to store the required information as the back-end is not able to represent it.

If a single back-end storage is going to be used, it makes more sense to use a Bazaar back-end as Bazaar is able to represent everything that Git does, but the reverse is not true.

Conclusion

The Bazaar developers focused on usability, which requires having a model that supports usability. Bazaar has improved its model to increase the usability of the system. We believe that Bazaar has the right model.

co-written by Aaron Bentley and Tim Penhey

18 comments:

Stoffe said...

That was a very well written and enlightening post. Thank you!

tale said...

Can you provide some context on why you wrote this entry? Does this have to do with Gnome using Git for its projects instead of Subversion or does this have to do with using a VCS system in the Gnome desktop?

Travis said...

How many times did you change bzr's format to get these 'advantages'?

Pau Garcia i Quiles said...

"Git refers to files by path. It makes no attempt to track renames in its data store."

Wrong. Git refers to contents, not to files. Git does not attempt to track renames because it does not care about files. You could take two files, merge them out of git, commit the new file to git, and git would notice where that new file came from and show you its history.

phil said...

What does "changing bzr's format" have anything to do with it ? The post is just about comparing one VCS to another.

Besides, changing bzr storage format was supported with upgrade tools. I'd like to see what 'how soon' does Git come up with 'upgrade tools' when a real sha1 exploit hits the market. More importantly whether you're going to risk you're entire history with it.

Also, since Git used to suck at its usability early on (and its gotten better only recently), I don't see you asking about how many versions did Git go through to suck less ?

Marius Gedminas said...

I tried working with Bazaar yesterday and with Git today. To my utter surprise Git was *more* usable for the actual development work. It has these nice touches like git diff --color that automatically pipe the output to less, or git command --help opening a man page in less with very helpful EXAMPLES sections. I liked the ability to create branches and switch between them in the same working directory without having to create a shared repository in advance and having to juggle an extra directory in your pathnames.

Bazaar wins hands-down when it comes the time to share your code. bzr get lp:projectname is great; bzr push bzr+ssh://mywebserver/home/myusername/public_html/newdirectory is great; bzr bind for automatic post-commit mirroring is great. Git required me to jump through some painful hoops to create a repository mirror.

Unfortunately even here Bazaar's usability breaks down due to bugs. Exceptions when you try to do natural things like pull foreign branches into checkouts, unfathomable svn errors from trying bzr-svn on a repository that git-svn handles just fine, magic upgrade tools that leave your bzr branch in a broken state requiring you to manually move hidden directories around to recover it. (All these are filed in Bazaar's bug tracker, but I'm too tired to hunt down bug numbers right now.)

I stopped believing people who say Bazaar is ready to be used today. :-(

Maybe in a couple of years?

Disclaimers:

* I haven't used git as much as I've used bzr, so it had less time to screw up my repositories.

* My understanding of usability may be influenced by reading and watching presentations about both git and Bazaar, quite a few of them since a few years ago. I cannot estimate how user-friendly either of them would be to a completely new user.

Steven Grimm said...

"For example, if you rename foo.h and foo.c to bar.h and bar.c, you should update the contents of bar.c, or else you will break the build."

I won't claim git's rename handling is perfect, but git does handle this kind of thing just fine:

$ git init
Initialized empty Git repository in .git/
$ cat > foo.h << EOF
heredoc> #include <stdio.h>
heredoc> #include <stdlib.h>
heredoc> #include <sys/types.h>
heredoc> #include <signal.h>
heredoc> EOF
$ cat > foo.c << EOF
heredoc> #include "foo.h"
heredoc>
heredoc> main() {
heredoc> printf("hello there kind sir\n");
heredoc> }
heredoc> EOF
$ git add .
$ git commit -m 'initial revision'
... commit output ...
$ git mv foo.c bar.c
$ git mv foo.h bar.h
$ perl -pi -e s/foo.h/bar.h/ bar.c
$ grep include bar.c
#include "bar.h"
$ git commit -m 'new revision with rename'
... commit output ...
$ git show -C
commit 10f5f485c0636c75bfef9c842c381ef200648189
Author: koreth <koreth@mbp>
Date: Fri Jul 11 19:04:09 2008 -0700

new revision with rename

diff --git a/foo.c b/bar.c
similarity index 100%
rename from foo.c
rename to bar.c
diff --git a/foo.h b/bar.h
similarity index 100%
rename from foo.h
rename to bar.h

Both renames were detected despite the fact that bar.c's content changed. In practice I've found git's file rename handling to do the right thing basically all the time.

Where git's rename support breaks down is when you mix *directory* renames and merges. I'm perfectly willing to believe Bazaar does a better job of that.

Jlouis said...

Does bzr have an Index yet? The index is probably the biggest difference in git versus most other systems. When you learn how to use it, and that may take some time, the usability gains are tremendous. I can't live without anymore.

Then there is all the useful tools which comes into play every day when working with git:

git add --patch : With this command we can add individual diff-hunks to the index for commit. We can split hunks to smaller pieces and choose among those.

git commit --amend : add the things in the index to the last commit. Really nice little thing when you forgot to add something and and have not published your changes yet. In general, it looks like the ability to mangle history of nonpublished branches turns out to be very nice.

git stash : A set of commands that lets you stash changes away and retrieve them later. It is akin to using quilt or mercurial queues. This is one of those features you can't live without when you have tried it.

There is one work directory and git checkout (branch change) changes the files in the current work directory rather than having a directory per branch. If you have a big project with gigabytes of source code, this begins to matter I assume.

So what is the index then? The index is a temporary staging area for your commit. Things that are added to the index is the things we ought to work on with the next commit (or other) command. The neat thing about this is that other commands does 'the-right-thing'. A git-diff for instance by default diffs not from the latest revision, but from the latest revision + things in the index.

This gives another kind of development where you do some work, gets satisfied with it and shove it into the index. Then you do some more work, get satisfied with that and shove that into the index. At some point you have some working code that ought to be published and then you commit. Of course there are commands to pull things back, forth and around in the index so you can tidy up things.

My disclaimer is, however, that I never tried bzr for anything real. When I had the choice, bzr was still rumaging in a slow format, so that left me with mercurial and git. I tried them both and git won.

FelipeC said...

If git is used in the repositories, and bzr users are expected to use bzr-git, then the 'usability' approach of bzr is not relevant at all.

If you do it the other way; bzr in the repositories and git-bzr for git users, again, the 'usability' is not a relevant factor.

The only relevant factor is the repository format.

The sha-1 issue is not a problem. If sha-1 is broken, then another hash method should be used, regenerate the repository with the new method and that's it. You would have to do the same for bzr.

You say that the fact that it makes it incompatible with existing repos is a bad thing, but isn't that exactly what you want? The old repos are using a broken hash, so they should be updated too, in the meantime they can't be trusted.

The alternative tree representation is something that I don't think will be useful at all in GNOME. No need to support such thing in git.

Git finds renames just fine. In fact it's also more user friendly, because you don't need to remember to call 'git-mv' every time. Also precisely because of this, git is able to import repositories from other VCS without much trouble: renames/copies are found automatically.

If git indeed was unable to represent something *useful* of bzr then your suggestion to use bzr as the backend would make sense, but I don't think you have presented something git can't represent.

For the needs of GNOME both git/bzr repository formats could be used, but git's format is just elegant, and it's efficient right now, so it's a good choice.

Santi said...

Apart from other things, that have been said, I have problems to understand how the Authenticating revisions and Data Integrity works.

As the commit-id is a UUID I cannot trust that the same contents are extracted from my repository and yours (in the same way that you can change the hash used in bazaar without changing the UUID).

You said "All revisions can be PGP-signed. No signed revision can be forged", I asume the revision you signed was a hash (the UUID is useless in this context). But then if the hash is broken bazaar and git are in the same position as both must upgrade their repositories.

So I don't see advantage of "It provides an upgrade path, and doens't change the commit-id of a revision if you change your format", because if I cannot trust the commit-id what can I trust?

Santi said...

Some more comments:

I've search a bit more about bazaar signatures and such and I've found this:

http://blogs.gnome.org/jamesh/2007/10/04/signed-revisions-with-bazaar/

It says that you can sign changsets as defined by a revision-id (UUID) and a hash of (SHA1 of the testament):
1. The revision ID
2. The name of the committer
3. The date of the commit
4. The parent revision IDs
5. The commit message
6. A list of the files that comprise the source tree for the revision, along with SHA1 sums of their contents
7. Any revision properties

So, you can only trust that you got the same _contents_ for a revision-id if you have a signature (and you trust the gpg key). I don't talk about who made it but what you got in your harddisk.

In contrast, in git with just the commit id you can trust that you got the same _contents_. Another story is who made them or signed them, and that is what a signed tag is about.

Aaron Bentley said...

santi:

Yes, Bazaar revision signatures are much more like Git's signed tags. Their purpose is to confirm who committed the change. We don't think that providing a cryptographically-secure commit-id is terribly useful, and certainly not worth the trade-offs.

FelipeC said...

santi: that's why in git the signatures are part of the content. That way if the sha1 is the same, it means the contents are the same, and the signatures are the same.

Santi said...

Aaron Bentley:

So we disagree on what is a "terribly usefull" feature. I find it "terribly useful" to be able to trust the content of a revision just with the identifier. And I don't see the "trade-offs" as significant to invalidate the above.

And, as both system have to be upgraded when the hash is broken, I don't see the benefits.

john said...

Steven,

You wrote:

$ git mv foo.c bar.c
$ git mv foo.h bar.h

And then you said GIT detects the moves!!

I mean, you run git mv, why shouldn't it detect it if you explicitly said it has to move the files??

john said...

Pau,

You wrote:

Wrong. Git refers to contents, not to files. Git does not attempt to track renames because it does not care about files. You could take two files, merge them out of git, commit the new file to git, and git would notice where that new file came from and show you its history

Could you please write an example?

wtachi said...

Actually, it works without git mv:

git init
cat > foo.h << EOF
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <signal.h>
EOF
cat > foo.c << EOF
#include "foo.h"

main() {
printf("hello there kind sir\n");
}
EOF
git add .
git commit -m 'initial'
mv foo.c bar.c
mv foo.h bar.h
git rm foo.c foo.h
git add bar.c bar.h
sed -e 's/foo.h/bar.h/g' -i bar.c
git add bar.c
git commit -m 'renamify'

Git will detect that the file was moved even though it was changed slightly.

"You could take two files, merge them out of git, commit the new file to git, and git would notice where that new file came from and show you its history."

I tested this, and Git will not detect that the file was copied from an old revision. (I'm not too familiar with the Git internals, but I'm guessing it just ends up with two trees pointing to the same blob, but completely different history information)

Chris Wellons said...

I know I'm late to the party here, but I just wanted to point out to john at the bottom here that "git mv" is essentially an alias for "mv". It has no effect on the repository.

From the Git FAQ, "Git has a rename command git mv, but that is just a convenience. The effect is indistinguishable from removing the file and adding another with different name and the same content."