Friday 11 July 2008

Bazaar has the model right

Some people in the GNOME community have suggested that if Bazaar has nice usability, then GNOME can just use Git on the back-end, and Bazaar lovers can just use the Git back-end via Bazaar. It's true that Bazaar could support this — an experimental plug-in exists to do this right now. But this suggestion betrays several wrong assumptions.

People assume Git and Bazaar are the same. They're not. People assume that if Git and Bazaar have technical differences, then Git must have it right.

The problem with these assumptions is that usability begins at the ground level. Bazaar started with a focus on usability. Git began with a focus on speed. The data models of both Bazaar and Git reflect their initial focus. But Bazaar's model can also be fast. In fact, the Bazaar developers are currently optimising a number of key operations for speed.

Data retrieval

Git and Bazaar are both key/value mapping systems. When bytes are needed, they are requested with that key.

The big difference is that Git's keys are also the hashes of the bytes. This is why it's called a content-addressable file system. This allows git to offer a guarantee that if the value hashes to the key, it has not been modified, whether deliberately or by accident. The Bazaar team considered adopting this approach, but decided it was too constricting. Bazaar uses UUIDs instead.

Authenticating revisions

For detecting malicious modification of revisions, Git uses its cryptographic hashes.

Bazaar uses revision-signing. All revisions can be PGP-signed. No signed revision can be forged. And the hashed representation can easily be generated and passed around to ensure that exactly the same content is used.

If SHA-1 is broken, both Bazaar and Git will lose their ability to detect malicious modification. But since Bazaar uses UUIDs to identify revisions, users can re-sign their old revisions with whatever method proves to be secure. Changing the hash used by Git would make it incompatible with all existing repositories.

Data Integrity and Serialization formats

Bazaar stores hashes of every value, so it equally capable of detecting accidental modification. It can be useful to have different representations of a tree in different repositories. For example, when Git lists files, it divides this data by directory. This is a good approach, but not necessarily the best approach. An alternative approach would be to use a radix tree. This would ensure that Git performed quickly even if users put unreasonable numbers of files in a single directory. But Git's keys are hashes, upgrading Git's format to use radix trees would change the keys, which means that people could not use the commit-id from one repository to refer to the same tree in an other form.

Bazaar doesn't assume it has the perfect format. It provides an upgrade path, and does't change the commit-id of a revision if you change your format. What's more, Bazaar can even reference data it has never seen. This allows partial imports from other VCSes to be fully compatible with more complete imports. And if a VCS provides UUIDs (content hashes certainly qualify as UUIDs), Bazaar can refer to those UUIDs directly.

File and directory representation

Git refers to files by path. It makes no attempt to track renames in its data store.

Bazaar has an inode abstraction; files and directories both have ids. When a file is renamed, its id stays the same. Bazaar's core code refers to files by their id, so merging a renamed file requires no special effort.

Git's approach means that users are warned not to rename files while changing their content. But when files are renamed, those files that refer to the renamed files must have their contents changed as well. For example, if you rename foo.h and foo.c to bar.h and bar.c, you should update the contents of bar.c, or else you will break the build. With Bazaar, users can do whatever they want, and the VCS just works. While Git must always use heuristics to deduce renames, Bazaar does not have to. Of course, it can if it wants to. This is an example of why it is important to design a model for usability from the beginning.

Bazaar can import rename data losslessly from foreign VCSes. Some other VCSes support file-ids, and Bazaar can reuse those without change. For VCSes that support renames, but not file-ids, Bazaar's representation is also non-lossy. When data imports are deterministic and non-lossy, it's easy to export them back to their source VCS. Bazaar's Subversion integration is a great example of how this can work.

Choose the back-end with the right model

In any situation it makes sense to use a back-end that stores the richer dataset. It makes more sense to have a front end client that doesn't use all the functionality or data representation of the back-end than it does to have a richer client that isn't able to store the required information as the back-end is not able to represent it.

If a single back-end storage is going to be used, it makes more sense to use a Bazaar back-end as Bazaar is able to represent everything that Git does, but the reverse is not true.

Conclusion

The Bazaar developers focused on usability, which requires having a model that supports usability. Bazaar has improved its model to increase the usability of the system. We believe that Bazaar has the right model.

co-written by Aaron Bentley and Tim Penhey

Thursday 10 July 2008

Re: Interesting (rough) statistics at the GNOME distributed RCS BOF

J5 mentioned in his post his interpretation of the number of users for GIT, Bazaar and Hg (Mercurial). He also finishes with "Converse amongst yourselves".

I guess I should first point out that I am a Bazaar user, and that I work for Canonical. I felt somewhat enraged at the post from J5, and have spent some time trying to work out some response.

John Carr mentioned that 83% of statistics are made up on the spot, and that cannot be more true here. I had been waiting for someone else to post the numbers that they saw at the BOF, but so far I have not seen one.

Here is my take on it.

Yes there were more GIT users than Bazaar users at the BOF, but the numbers were more like 50% of the audience were GIT users, and about 40% were Bazaar users. Someone piped up and said "What about Mercurial?" and so the question was asked, and there were about five or six people. There was an overlap of the GIT and Bazaar groups, and there was by far the larger majority of the audience that had not used any DVCS.

What conclusions can we draw from this? Not much. Many people attending the pre-conference work for larger companies, like Red Hat, Novell, and Nokia, and many of those people work on some hard core linux stuff, many of which have chosen GIT. Many have chosen GIT because that is what the linux kernel is using. Is that a good reason to chose a DVCS? I don't feel that we can really answer that question as I am sure there are strong advocates for both sides.

An interesting question is "Which DVCS is easier for the casual contributor to use?" Surely one of the reasons that a project chooses a DVCS is to allow for more community contributions in an easy to merge way that has a clear contribution history. Bazaar just works. It works for the hard-core developers, but is also easy for those soft-core (?).

From the people I talk to, and I've tried to talk to many here, is that of those that use Bazaar it just works. Bazaar doesn't get in your way of developing the software that you are working on. It is just a tool that works.

One final point. The questions were "Who uses <insert DVCS>?", not "Who likes/loves using <insert DVCS>?".

Saturday 28 June 2008

Shelving looms

For a feature that I'm currently working on, I decided to try out the loom plugin. Looms have been around for a little while, but I just hadn't gotten around to trying them out.

We have code reviews of all work that is to be merged. Part of this process is to try to limit changes to 800 lines of unified diff. We have found that when the branches have more changes than this the time to review the branch increases non-linearly with the increase in line count. In the past in order to break up a "chunky" branch I would branch from trunk for the first part:

$ bzr cbranch trunk feature-part-1

(I use cbranch as I don't have my working trees in my repository. This is another story to write about. cbranch is found in bzrtools.)

Once this part was complete, I would branch from that for part-2:

$ bzr cbranch feature-part-1 feature-part-2

Complications come in when I want to bring an updated trunk into my branch for part-2, as it makes getting a diff of changes much more difficult as I can no longer generate a diff simply. This problem propagates if I need three or four parts to implement the feature.

Enter looms. Looms provide a new branch format for Bazaar. To convert your branch to the new format, you use the command loomify. You can then create threads of your loom. Each thread is like another branch.

So, the process goes something like this:

$ bzr cbranch trunk my-new-feature
$ cd my-new-feature
$ bzr loomify
$ hack, commit, hack, commit, hack, commit
$ bzr create-thread next-part-of-feature

Creating a thread is like creating a new branch which has the same revisions as the last thread.

$ hack, commit, hack, commit, et al

However what happened after several hours of hacking away, and several diversions in the code that needed fixing, I checked out the size of the branch.

$ bzr diff -r thread:

(The loom plugin adds a revision specifier to easily allow things like this to see all the changes that were introduced by the current thread.)

Oh, damn. The resulting size was about 1100 lines. Now while our 800 line limit is set in stone, it is considered a bit rude if you could have broken it up but didn't. Stepping through the diff I identified three distinct chunks of work that could be broken out for review. My question was "now that I have all this in a loom, and I have a vested interest in keeping the loom as the current work is based on an earlier thread, how the hell am I going to break this work up?" Shelve to the rescue. Shelve is also found in bzrtools. I also wanted to have the threads named reasonably, and unfortunately there is no easy way to rename a thread right now. I wanted to have the threads named alpha, bravo and charlie (well, not really, but you get the picture). The first step is to create alpha and get rid of the current thread.

$ bzr create-thread alpha
$ bzr down-thread

create-thread takes you to the new thread too, so using down-thread to go back to the thread I was working on before.

$ bzr combine-thread

This effectively discards the current thread. The assumption is that the changes from this thread had been merged into the lower thread through merging another branch. This hasn't happened in this case, and discarding is what we want here. So now I'm at the state where I was before except my thread is now called "alpha". Now to break out the changes for alpha.

$ bzr create-thread bravo
$ bzr down-thread

I created a thread bravo. This is also at the state where all three parts are there and working. Next I went back to "alpha" thread. Now we use shelve using the revision specifier that looms introduce. Shelve by default will just allow you to shelve (or put to one side) the uncommitted changes.

$ bzr shelve -r thread:

Now I get lots of questions. Do I want to shelve each of the chunks. I shelve all changes that are unrelated to the alpha feature. What I'm left with after this command is a working tree as it would be if I had just written the alpha feature on a clean thread. I checked the results with:

$ bzr diff -r thread:

Looks good, so commit.

$ bzr commit -m "Shelved changes unrelated to alpha."

Now for the magic.

$ bzr up-thread

This takes me back to "bravo". However up-thread also merges in the changes from the thread below. Now my tree is showing that all the changes relating to bravo and charlie have been removed. The actual merge magic is done with this command:

$ bzr revert .

Take a special note of the dot. Without the dot the revert would revert the entire merge. I don't want this. I just want to revert the changes to the tree. I need bzr to remember that I have merged the changes from the thread below. This is exactly what the "revert ." does. The changes to the three are reverted, but the merge isn't. Next you need to commit.

$ bzr commit -m "Merge from alpha while splitting up the changes."

Now I have the alpha thread with just the changes needed to implement alpha, and a bravo thread that appears to introduced the bravo and charlie features. I also have a .shelf directory (created by shelve). Since I have no intention of unshelving these changes (as they are already there), I delete this directory. I'm not sure if this is strictly necessary, but I like to run a clean shop.

To break apart the bravo and charlie features I repeated the process. The end result was three separate threads that each appear to introduce a single feature.

Phew.

One point of caution. Sometimes in the breaking apart, you don't always get a clean break. In these situations you need to keep more than you need (i.e. don't shelve that change), and once the revert and commit is done, then go back to the earlier thread and clean up. If you try to do it earlier, the changes will be thrown away in the revert dot command, and then it just gets messier.

All in all I'm really enjoying working with looms. I currently have about eight threads, and will probably need another four or five to finish the feature, but this way it is dead easy to keep the changes small and distinct and simple to review.

Sunday 8 June 2008

bzr alias

I now officially have some of my code in an open source project where the work was actually done entirely on my own time. Despite being involved with professional software development for almost twenty years, I've normally kept my programming to work hours, or private projects.

This time though, it was different. This time it was truly scratching an itch. I've become somewhat of a Bazaar convert. Bazaar has for a long time allowed you to define command aliases in your bazaar.conf configuration file. I have always used bash aliases for commands that I do really often, so it seemed natural to me to define aliases for bazaar for commands that I used often as well.

Next came the internal conflict. I am an inherently lazy person. This is why I like aliases, less typing. One of the things that bugged me was having to actually edit the configuration file any time I wanted to add or modify an alias. This bugged me. It bugged me so much it caused me to actually do something about it. Luckily for me Bazaar is written primarily in python, and this just happens to be my current favourite language.

The code is now committed to trunk, and should be available generally in bzr 1.6.


  • bzr alias — lists your current aliases

  • bzr alias commit="commit --strict" — sets the alias for commit

  • bzr alias commit — print out the current alias for commit

  • bzr alias --remove commit — removes the commit alias


Obviously if you use alaises as much as I do, one of the first things you'll do is set the alias for unalias.

bzr alias unalias="alias --remove"

My current aliases (that I'll tell you about):

tim@slacko:~/src/bzr/alias-command$ ./bzr alias
bzr alias cbranch="cbranch --lightweight --hardlink"
bzr alias col="checkout --lightweight"
bzr alias commit="commit --strict"
bzr alias lastdiff="diff -r-2..-1"
bzr alias lastlog="log -r-2..-1"
bzr alias ll="log --line -r-10..-1"
bzr alias my-missing="missing --mine-only"
bzr alias unalias="alias --remove"

Saturday 17 May 2008

Code in Launchpad

Launchpad offers many things to developers, and open source software developers in particular. One of these things is the ability to host Bazaar branches. For those that have looked a little deeper, they will have noticed that there are four types of branches in Launchpad: Hosted; Mirrored; Remote; and Imported. Hmm, this isn't really what I was intending to talk about at all, but I'm going to go with the flow.

Hosted branches are those where Launchpad is the primary public location of the branch. Hosted branches are normally created by pushing a branch directly to Launchpad. Before you do that though, you need to have registered on Launchpad, and supplied an SSH key. This is how Launchpad knows who you are. There are two ways you can push a branch to Launchpad: one is via SFTP; and the other using the Bazaar smart server (bzr+ssh).

As an example I'm going to use my alias-command bzr branch. The complete SFTP location would be sftp://thumper@bazaar.launchpad.net/~thumper/bzr/alias-command, and the smart server one bzr+ssh://thumper@bazaar.launchpad.net/~thumper/bzr/alias-command. These are a bit unwieldy, so we extended the lp type urls for bzr to support writing if the launchpad plug-in knows who you are. In order for you to do this you use the lp-login command. bzr lp-login will tell you the username that is currently set. If you have not done this yet, you'll see a message like "No Launchpad user ID configured." I set mine by saying bzr lp-login thumper. This stores thumper as the launchpad_username in the bazaar.conf file. This also means I can use bzr push lp:~thumper/bzr/alias-command to push to my hosted Launchpad branch.

Mirrored branches allow you to have your branches stored publicly in some location that you control, and you let Launchpad know where this is. Launchpad will then update its copy of your branch every six hours. This is handy if you don't have an SSH key, or you have a slow network connection, or you just like having your branches available on your own server.

Remote branches are a bit different. Remote branches were sort of created out of necessity. Some people were registering mirrored branches with unreachable locations. Some of these were possibly by mistake, but quite a few were obviously inaccessible. But more strange is that those branches were linked to bugs or blueprints. There was obviously a desire to have branch meta-data there, but not actually allow Launchpad to get access to the branches. So we have remote branches. You cannot get a copy of a remote branch from Launchpad as Launchpad does not have a copy of it.

Imported branches are those branches where Launchpad get the code from either CVS or Subversion, and puts it into a Bazaar branch. I was really wanting to talk about this as I saw two projects recently where we are importing code that I didn't know about. One is my favourite music player, Amarok, and the other was MPlayer. Just out of curiosity I looked at both of these branches on Launchpad. The Amarok one has 12195 revisions as I'm writing this, and the last revision was 11 hours old, and MPlayer had even more revisions, at 26761. However that isn't even the cool bit. What is really nifty is you can go bzr branch lp:amarok or bzr branch lp:mplayer to get the code. Just to check I did just that, and got a copy of the amarok source. It was the first bit of C++ I had looked at in a long time (it used to be all I did).

Anyway, that was what I really wanted to say. Oh yeah, and bzr rocks.

Saturday 10 May 2008

The Launchpad branch directory service

Recently Bazaar grew a branch directory service. This allows plug-in developers to define custom "protocols" that resolve the branch names into some other branch location.

The Launchpad plug-in defines a protocol "lp". Launchpad uses the other parts of the URL to relate to projects, series or individual's branches. The shortest valid URL for Launchpad is something like lp:do. You can also use an empty authority (or site part of a URL), so lp:///do is exactly the same, just longer. Personally I prefer the one without all the slashes.

The do part of the branch location relates to the GNOME Do project on Launchpad. There is a little magic (a.k.a. configuration) that is needed to make this work. Projects in Launchpad get an initial development focus series created for them. This is intended to relate to the branch of development that is where current or new work goes. In order to have the code available through the Launchpad directory service, the code has to be available through Launchpad as a normal branch.

Once a branch has been either hosted, mirrored or imported for the project, one of the people responsible for the project in Launchpad can relate the branch with the series. Once this is done the branch is easily accessible. People that have permission to make these links will be shown a link on the main code tab for their project (we don't taunt people who can't make the links with an invalid option).

If there is another series, say 1.0 for our project fooix, and we have branches associated with them, then we could get the 1.0 branch using lp:fooix/1.0. Normal Launchpad branches are also accessible using the lp protocol using lp:~username/project/branch-name.