How Bazaar

Thursday, 18 June 2009

You're doing it wrong!

Just yesterday I found a missing feature in one of the apps I just started using. My thought processes were something along the lines of “hey, I could add this feature and it would be good”. So I went to the project's website, found their source code repository, and got blown away by the comment that was with it:

Please note that code you get from this repository is not intended for productive use (unless it's tagged as a released version, of course, in which case the usual alpha/beta disclaimers apply ;-)). We like to break our codebase, config files, database schemas and all kinds of stuff. We sometimes commit non-compiling revisions to facilitate collaborative development. Running such an unstable version might trash your settings, your backlog and maybe your computer. You have been warned!

Eh? OK, I get the first sentence. It is even a good disclaimer. Tagged releases are more stable. People regularly commit code that is unpolished. Sometimes even with some known bugs or issues.

The second sentence has me going “NO!?! What are you doing?”

The third sentence just blew my mind. This project is using a DVCS. Not my DVCS of choice, but really that doesn't matter. All DVCSs are made to have good merging and sharing of code between developers. Saying “We sometimes commit non-compiling revisions to facilitate collaborative development” is just a lack of understanding of how to use the tools. You are using a DVCS to facilitate collaborative development! This is centralised version control thinking.

Try this for a code to work by:

Trunk should always at least compile, run, and pass all the tests.

This hasn't stopped me wanting to work on the code, but it has raised my caution levels.

Tuesday, 16 June 2009

kiwipycon

Today NZPUG held yet another organisation meeting for the first kiwipycon. Organising conferences takes a lot of effort by many dedicated people. The Christchurch python user group has volunteered to host the first PyCon in New Zealand. Personally I suggest things from time to time, but a big thanks goes out to those guys for the hard work that has gone on even before the call for papers.

The dates have been set as Saturday the 7th and Sunday the 8th of November 2009. A weekend was chosen to allow those working or studying who can't get leave to attend. As I understand it they are still working on pricing. The call for papers will probably go out next month some time.

Should be interesting...

Monday, 15 June 2009

launchpadlib updates for branches

Just a quick note to yet you know of a few changes to launchpadlib for branches. Mainly because I've removed a method that I know someone is using.

You used to be able to get to the branches for a project by saying my_project.branches, but I've removed this. It would have been nicer to deprecate it but we don't have a nice deprecation method right now for launchpadlib, and since it is still in beta, I didn't feel too bad.

The branches of a project was an attribute, now we have a getBranches method. The old attribute would give you all the branches of the project, including the merged and abandoned ones. The method defaults to give you the active branches, and allows you to pass in the statuses that you'd like to get.

Also with this change you can now get the branches for a project group, or the branches owned by a person using the same getBranches method call.

Project groups also grew the method getMergeProposals in the same way that the method was already available for people and projects.

Please file any bugs on the launchpad-code project on Launchpad.

Wednesday, 11 March 2009

lp:mad

One of the things that I have spent quite a lot of time on recently is the code review stuff in Launchpad. Recently, as of the 2.2.2 release, new merge proposals get a review diff created for them automagically. This review diff is based on the changes that have been done in the branch relative to the least common ancestor (LCA) of the target branch. Since the review diff only has changes that have been added, there is no way for this diff to ever have conflicts.

There is another diff that is useful to see however. This is the diff of what changes would happen if the source branch was merged into the target branch right now. Sometimes this might conflict. Sometimes this might be a smaller diff as some other dependent functionality has landed. This diff isn't generated automatically by Launchpad. However this is something that you can run to add it.

The Merge Analysis Daemon

Alright, it isn't exactly a daemon (yet), but the name was cool.

What this script does, using the launchpadlib API, is to get all of the current merge proposals for a project and works out the diff that would be — what we call the preview diff.

What do you need

Firstly you need branches of wadllib, launchpadlib, and mad.

$ bzr branch lp:wadllib
$ bzr branch lp:launchpadlib
$ bzr branch lp:mad

Inside the mad directory, there is the LICENSE file (GPL v3), and the script.

The script has many parameters.

$ ./mad.py --help
Usage: mad.py [options]

Options:
  -h, --help            show this help message and exit
  -v, --verbose         Display extra information
  -q, --quiet           Display less information
  -p PROJECT, --project=PROJECT
                        The name of the Launchpad project.
  -r DIR, --repo=DIR    The location of a local repository to use.
  --dry-run             Don't upload the diffs
  --force               Force an update of the diff.
  --staging             Update the proposals on staging.launchpad.net.
  -c FILE, --credentials=FILE
                        The credentials file. Defaults to ~/.launchpad/mad
  --cachedir=DIR        The location of the cache directory. Defaults to
                        ~/.launchpadlib/cache.
  --no-op               Don't get the proposals for the project.
  --new-credentials     Get a new OAuth token and save in the credentials
                        file.

I have the following in the server's crontab listing:

20 * * * * PYTHONPATH=/home/tim/launchpadlib:/home/tim/wadllib /home/tim/mad/mad.py -p storm -r /home/tim/sandbox/mad-playground -c /home/tim/.launchpad/mad -v >> /home/tim/mad-storm.log 2>&1

Basically this says:

At 20 minutes past every hour

Run the mad.py script using a PYTHONPATH that knows about wadllib and launchpadlib.

Use the credentials file ~/.launchpad/mad

Use the respository at ~/sandbox/mad-playground

Be verbose

Make all output go to a log file.

If the specified repository directory does not exist, a new shared repository with no working trees is created. If there is an existing repository, it will use that.

Each of the source and target branches are pulled into the repository. MAD won't create branches for them, it just grabs all the necessary revisions. MAD then calculates the diff that would be if the source was merged into the target, and sends that to Launchpad to have it annotate the proposals. As an example, see the storm ones.

You will also need to have permission to edit the proposals that you are wanting to update. If you are the person that is running the project, and are in the team that owns the target branches, then you should be able to update them.

There is a --staging option to test the script against what is in staging.

The script also walks you through the necessary OAuth token acquisition the first time you run the script.

Report bugs on Launchpad.

Wednesday, 21 January 2009

Shallow branches or history horizons

There is an idea floating around and I'm curious to see if it is an idea that has merit and worth putting effort into. This idea is in the DVCS space and is called "shallow branches" or "history horizons".

The concept itself is pretty simple. When using a DVCS with a project with a long history, each and every user has a copy of this history. Now much of this history may be ancient (for some definition of ancient, 6 months, 6 years, whatever). Most developers will never have a need to go into the ancient history of a project, and so a truncated history is fine as long as their branches that they create are still mergable with the main repository.

Here's how it could play out:

Bob wants to work on the fooix project to fix a minor bug, this is Bob's first look at the fooix source. The fooix project has been around for eons and has a huge history. Bob doesn't care about the history, he just wants to do his simple fix (think a typo).

Bob grabs the fooix trunk branch but only gets enough history to create the working files.

Bob makes his fix, and publishes his branch for the fooix developers to grab.

The advantage here is that when Bob grabs his branch, he is only getting just enough history to work, and so his resulting repository is smaller and faster.

Commands that worked by inspecting the history would stop at the repository's horizon and say something like "and that's all I've got". Obviously there'd need to be a way to say "go and get me another 4 months of history" or even "ok, now I'm really interested, get me the complete history".

This is conceptually different from a lazy loading or stacked repository as there is an explicit horizon where normal history commands stop.

So lazyweb, the question I have is this: "Is this a worthwhile feature in a DVCS tool?"

Friday, 11 July 2008

Bazaar has the model right

Some people in the GNOME community have suggested that if Bazaar has nice usability, then GNOME can just use Git on the back-end, and Bazaar lovers can just use the Git back-end via Bazaar. It's true that Bazaar could support this — an experimental plug-in exists to do this right now. But this suggestion betrays several wrong assumptions.

People assume Git and Bazaar are the same. They're not. People assume that if Git and Bazaar have technical differences, then Git must have it right.

The problem with these assumptions is that usability begins at the ground level. Bazaar started with a focus on usability. Git began with a focus on speed. The data models of both Bazaar and Git reflect their initial focus. But Bazaar's model can also be fast. In fact, the Bazaar developers are currently optimising a number of key operations for speed.

Data retrieval

Git and Bazaar are both key/value mapping systems. When bytes are needed, they are requested with that key.

The big difference is that Git's keys are also the hashes of the bytes. This is why it's called a content-addressable file system. This allows git to offer a guarantee that if the value hashes to the key, it has not been modified, whether deliberately or by accident. The Bazaar team considered adopting this approach, but decided it was too constricting. Bazaar uses UUIDs instead.

Authenticating revisions

For detecting malicious modification of revisions, Git uses its cryptographic hashes.

Bazaar uses revision-signing. All revisions can be PGP-signed. No signed revision can be forged. And the hashed representation can easily be generated and passed around to ensure that exactly the same content is used.

If SHA-1 is broken, both Bazaar and Git will lose their ability to detect malicious modification. But since Bazaar uses UUIDs to identify revisions, users can re-sign their old revisions with whatever method proves to be secure. Changing the hash used by Git would make it incompatible with all existing repositories.

Data Integrity and Serialization formats

Bazaar stores hashes of every value, so it equally capable of detecting accidental modification. It can be useful to have different representations of a tree in different repositories. For example, when Git lists files, it divides this data by directory. This is a good approach, but not necessarily the best approach. An alternative approach would be to use a radix tree. This would ensure that Git performed quickly even if users put unreasonable numbers of files in a single directory. But Git's keys are hashes, upgrading Git's format to use radix trees would change the keys, which means that people could not use the commit-id from one repository to refer to the same tree in an other form.

Bazaar doesn't assume it has the perfect format. It provides an upgrade path, and does't change the commit-id of a revision if you change your format. What's more, Bazaar can even reference data it has never seen. This allows partial imports from other VCSes to be fully compatible with more complete imports. And if a VCS provides UUIDs (content hashes certainly qualify as UUIDs), Bazaar can refer to those UUIDs directly.

File and directory representation

Git refers to files by path. It makes no attempt to track renames in its data store.

Bazaar has an inode abstraction; files and directories both have ids. When a file is renamed, its id stays the same. Bazaar's core code refers to files by their id, so merging a renamed file requires no special effort.

Git's approach means that users are warned not to rename files while changing their content. But when files are renamed, those files that refer to the renamed files must have their contents changed as well. For example, if you rename foo.h and foo.c to bar.h and bar.c, you should update the contents of bar.c, or else you will break the build. With Bazaar, users can do whatever they want, and the VCS just works. While Git must always use heuristics to deduce renames, Bazaar does not have to. Of course, it can if it wants to. This is an example of why it is important to design a model for usability from the beginning.

Bazaar can import rename data losslessly from foreign VCSes. Some other VCSes support file-ids, and Bazaar can reuse those without change. For VCSes that support renames, but not file-ids, Bazaar's representation is also non-lossy. When data imports are deterministic and non-lossy, it's easy to export them back to their source VCS. Bazaar's Subversion integration is a great example of how this can work.

Choose the back-end with the right model

In any situation it makes sense to use a back-end that stores the richer dataset. It makes more sense to have a front end client that doesn't use all the functionality or data representation of the back-end than it does to have a richer client that isn't able to store the required information as the back-end is not able to represent it.

If a single back-end storage is going to be used, it makes more sense to use a Bazaar back-end as Bazaar is able to represent everything that Git does, but the reverse is not true.

Conclusion

The Bazaar developers focused on usability, which requires having a model that supports usability. Bazaar has improved its model to increase the usability of the system. We believe that Bazaar has the right model.

co-written by Aaron Bentley and Tim Penhey

Thursday, 10 July 2008

Re: Interesting (rough) statistics at the GNOME distributed RCS BOF

J5 mentioned in his post his interpretation of the number of users for GIT, Bazaar and Hg (Mercurial). He also finishes with "Converse amongst yourselves".

I guess I should first point out that I am a Bazaar user, and that I work for Canonical. I felt somewhat enraged at the post from J5, and have spent some time trying to work out some response.

John Carr mentioned that 83% of statistics are made up on the spot, and that cannot be more true here. I had been waiting for someone else to post the numbers that they saw at the BOF, but so far I have not seen one.

Here is my take on it.

Yes there were more GIT users than Bazaar users at the BOF, but the numbers were more like 50% of the audience were GIT users, and about 40% were Bazaar users. Someone piped up and said "What about Mercurial?" and so the question was asked, and there were about five or six people. There was an overlap of the GIT and Bazaar groups, and there was by far the larger majority of the audience that had not used any DVCS.

What conclusions can we draw from this? Not much. Many people attending the pre-conference work for larger companies, like Red Hat, Novell, and Nokia, and many of those people work on some hard core linux stuff, many of which have chosen GIT. Many have chosen GIT because that is what the linux kernel is using. Is that a good reason to chose a DVCS? I don't feel that we can really answer that question as I am sure there are strong advocates for both sides.

An interesting question is "Which DVCS is easier for the casual contributor to use?" Surely one of the reasons that a project chooses a DVCS is to allow for more community contributions in an easy to merge way that has a clear contribution history. Bazaar just works. It works for the hard-core developers, but is also easy for those soft-core (?).

From the people I talk to, and I've tried to talk to many here, is that of those that use Bazaar it just works. Bazaar doesn't get in your way of developing the software that you are working on. It is just a tool that works.

One final point. The questions were "Who uses <insert DVCS>?", not "Who likes/loves using <insert DVCS>?".