Friday 11 July 2008

Bazaar has the model right

Some people in the GNOME community have suggested that if Bazaar has nice usability, then GNOME can just use Git on the back-end, and Bazaar lovers can just use the Git back-end via Bazaar. It's true that Bazaar could support this — an experimental plug-in exists to do this right now. But this suggestion betrays several wrong assumptions.

People assume Git and Bazaar are the same. They're not. People assume that if Git and Bazaar have technical differences, then Git must have it right.

The problem with these assumptions is that usability begins at the ground level. Bazaar started with a focus on usability. Git began with a focus on speed. The data models of both Bazaar and Git reflect their initial focus. But Bazaar's model can also be fast. In fact, the Bazaar developers are currently optimising a number of key operations for speed.

Data retrieval

Git and Bazaar are both key/value mapping systems. When bytes are needed, they are requested with that key.

The big difference is that Git's keys are also the hashes of the bytes. This is why it's called a content-addressable file system. This allows git to offer a guarantee that if the value hashes to the key, it has not been modified, whether deliberately or by accident. The Bazaar team considered adopting this approach, but decided it was too constricting. Bazaar uses UUIDs instead.

Authenticating revisions

For detecting malicious modification of revisions, Git uses its cryptographic hashes.

Bazaar uses revision-signing. All revisions can be PGP-signed. No signed revision can be forged. And the hashed representation can easily be generated and passed around to ensure that exactly the same content is used.

If SHA-1 is broken, both Bazaar and Git will lose their ability to detect malicious modification. But since Bazaar uses UUIDs to identify revisions, users can re-sign their old revisions with whatever method proves to be secure. Changing the hash used by Git would make it incompatible with all existing repositories.

Data Integrity and Serialization formats

Bazaar stores hashes of every value, so it equally capable of detecting accidental modification. It can be useful to have different representations of a tree in different repositories. For example, when Git lists files, it divides this data by directory. This is a good approach, but not necessarily the best approach. An alternative approach would be to use a radix tree. This would ensure that Git performed quickly even if users put unreasonable numbers of files in a single directory. But Git's keys are hashes, upgrading Git's format to use radix trees would change the keys, which means that people could not use the commit-id from one repository to refer to the same tree in an other form.

Bazaar doesn't assume it has the perfect format. It provides an upgrade path, and does't change the commit-id of a revision if you change your format. What's more, Bazaar can even reference data it has never seen. This allows partial imports from other VCSes to be fully compatible with more complete imports. And if a VCS provides UUIDs (content hashes certainly qualify as UUIDs), Bazaar can refer to those UUIDs directly.

File and directory representation

Git refers to files by path. It makes no attempt to track renames in its data store.

Bazaar has an inode abstraction; files and directories both have ids. When a file is renamed, its id stays the same. Bazaar's core code refers to files by their id, so merging a renamed file requires no special effort.

Git's approach means that users are warned not to rename files while changing their content. But when files are renamed, those files that refer to the renamed files must have their contents changed as well. For example, if you rename foo.h and foo.c to bar.h and bar.c, you should update the contents of bar.c, or else you will break the build. With Bazaar, users can do whatever they want, and the VCS just works. While Git must always use heuristics to deduce renames, Bazaar does not have to. Of course, it can if it wants to. This is an example of why it is important to design a model for usability from the beginning.

Bazaar can import rename data losslessly from foreign VCSes. Some other VCSes support file-ids, and Bazaar can reuse those without change. For VCSes that support renames, but not file-ids, Bazaar's representation is also non-lossy. When data imports are deterministic and non-lossy, it's easy to export them back to their source VCS. Bazaar's Subversion integration is a great example of how this can work.

Choose the back-end with the right model

In any situation it makes sense to use a back-end that stores the richer dataset. It makes more sense to have a front end client that doesn't use all the functionality or data representation of the back-end than it does to have a richer client that isn't able to store the required information as the back-end is not able to represent it.

If a single back-end storage is going to be used, it makes more sense to use a Bazaar back-end as Bazaar is able to represent everything that Git does, but the reverse is not true.

Conclusion

The Bazaar developers focused on usability, which requires having a model that supports usability. Bazaar has improved its model to increase the usability of the system. We believe that Bazaar has the right model.

co-written by Aaron Bentley and Tim Penhey

Thursday 10 July 2008

Re: Interesting (rough) statistics at the GNOME distributed RCS BOF

J5 mentioned in his post his interpretation of the number of users for GIT, Bazaar and Hg (Mercurial). He also finishes with "Converse amongst yourselves".

I guess I should first point out that I am a Bazaar user, and that I work for Canonical. I felt somewhat enraged at the post from J5, and have spent some time trying to work out some response.

John Carr mentioned that 83% of statistics are made up on the spot, and that cannot be more true here. I had been waiting for someone else to post the numbers that they saw at the BOF, but so far I have not seen one.

Here is my take on it.

Yes there were more GIT users than Bazaar users at the BOF, but the numbers were more like 50% of the audience were GIT users, and about 40% were Bazaar users. Someone piped up and said "What about Mercurial?" and so the question was asked, and there were about five or six people. There was an overlap of the GIT and Bazaar groups, and there was by far the larger majority of the audience that had not used any DVCS.

What conclusions can we draw from this? Not much. Many people attending the pre-conference work for larger companies, like Red Hat, Novell, and Nokia, and many of those people work on some hard core linux stuff, many of which have chosen GIT. Many have chosen GIT because that is what the linux kernel is using. Is that a good reason to chose a DVCS? I don't feel that we can really answer that question as I am sure there are strong advocates for both sides.

An interesting question is "Which DVCS is easier for the casual contributor to use?" Surely one of the reasons that a project chooses a DVCS is to allow for more community contributions in an easy to merge way that has a clear contribution history. Bazaar just works. It works for the hard-core developers, but is also easy for those soft-core (?).

From the people I talk to, and I've tried to talk to many here, is that of those that use Bazaar it just works. Bazaar doesn't get in your way of developing the software that you are working on. It is just a tool that works.

One final point. The questions were "Who uses <insert DVCS>?", not "Who likes/loves using <insert DVCS>?".