Archive for September 2007
A Different Approach to Version Control
Version control systems as they exist today are quite dumb. They usually can only distinguish between two types of files — text and binary — and often are only able to do anything useful at all with text files. Focusing on text files specifically, the only change version control systems can detect is on a line granularity. Attempting to detect changes at a finer level would likely result in chaos due to the limitations of plain text algorithms. Even if your only change is to change the indentation of a line or rename a variable, the version control system can only see that the entire line has been edited, and dutifully replaces the entire old one with the entire new one in the repository.
The Problem is Semantics
Normally, this wouldn’t be so bad. It seems like just an implementation detail. So what if the internal storage mechanism is ever so slightly less efficient than it could be? They’re just text files. Or are they? The files in your projects are probably mostly structured source code. These are not “just text files.” They follow a very strict formatting convention, the syntax of the language, and different kinds of changes have different syntactic and semantic meanings. Sometimes line breaks are good points to partition a source code file. Sometimes you can even rearrange the lines in a file while preserving meaning, but usually not. For a simple example of the problems caused by considering lines atomically, consider the following snippet of C code:
if(x > 0) y = 1; else y = x+1;
What if you want to add z=27; as part of the else block of the code? Obviously, you would have to add brackets if you want to insert it as a separate line, so now the code looks like this:
if(x > 0)
y = 1;
else {
y = x+1;
z = 27;
}
In another branch, let’s say that you have added x++; after the entire if-else block. So in that branch, it looks like this:
if(x > 0) y = 1; else y = x+1; x++;
Let’s say you want to merge these two branches. If you have ever merged two branches in a repository before, you probably can already see where this is headed. For a version control system that is only aware of changes to entire lines at a time, the final result of the merge is ambiguous. If we assume that the VCS can magically tell what is syntactically valid (although it probably can’t), then there are three possible results from the merge. One is the result that you probably intended:
if(x > 0)
y = 1;
else {
y = x+1;
z = 27;
}
x++;
One looks like this:
if(x > 0)
y = 1;
else {
y = x+1;
z = 27;
x++;
}
And one looks like this:
if(x > 0)
y = 1;
else {
y = x+1;
x++;
z = 27;
}
The latter versions have completely different meanings from the former. This could serve as an argument against ever using if-else blocks without brackets, but even better would be if you didn’t have to restrict your programming style just to cater to a picky VCS.
The Solution
Don’t represent the code as a plain text file! It’s the wrong abstraction for source code. Represent it as a parse tree or syntax tree. I know this sounds infinitely more complicated than simple diffs at this point, but bear with me.
Let’s take our example from above. First, we need a simple tree data type for the C concepts we use. Here is a very basic one (with a lot of details absent that even our simple example could have benefited from had I taken the time to demonstrate then, such as finer granularity for expressions) in Haskell syntax:
data Statement = If Exp Exp Exp data Exp = Statement Statement | Compound Exp Exp | ExpStr String
So, basically, all we know about C syntax with this is if statements and the fact that expressions can be chained, terminate with Strings (not the best idea), and can be statements (which isn’t really true, but it makes the example simpler).
Here is what the first version of our example program would look like internally:
Statement (If (ExpStr "x > 0") (ExpStr "y = 1") (ExpStr "y = x+1"))
Changes can be represented as functions that work by pattern matching. Here is the first branch:
Statement (If c t f) = Statement $ If c t $ Compound f $ ExpStr "z = 27"
Here is the second branch:
Statement (If c t f) = Compound (Statement $ If c t f) (ExpStr "x++")
Obviously, an internal representation like this could lose certain formatting details, such as whitespace. However, this is probably more of an advantage than a disadvantage. No longer are visual formatting changes going to affect merges or even visually clash due to differences in team members’ styles. The version control system will always normalize the style on update. You could even customize this output locally. Very slick.
There are other implications to this which may lead to evolutionary developments. I will save them for a future post.