Diffing
If you’re into coding, chances are that you’ve seen diff views before. They usually have two sides: The left shows you the old state of a file, the right shows you the new state, and there are - and + markers indicating what changed.
But how does this actually work? And can we implement our own differ? It turns out that it’s actually based on just a few really elegant ideas.
Optimal Diffing
You probably already guessed that we’ll implement our own diffing tool in this post. In fact, we’ll implement the diffing algorithm which results in the smallest possible number of - and + markers.
We will do this by finding the longest common subsequence, a pretty standard dynamic programming problem. The rest of the post first explains this problem and then shows how it can be used to perform diffing.
Longest Common Subsequence (LCS)
Given two sequences S1 and S2, a common subsequence is a sequence that is a subsequence of both S1 and S2. Note that subsequences do not have to be contiguous. For example, consider these two sequences:
- S1:
ABCDE
- S2:
ABZZE
Here, AE
would be a common subsequence. The longest common subsequence (the
LCS) is the longest such subsequence. In our example this is ABE
.
Here’s the fundamental insight behind why the LCS is useful for diffing: The LCS corresponds exactly to the unchanged parts of a diff, i.e. the parts that exist in both S1 and S2. Because we find the longest common subsequence, this means that we maximize the amount of unchanged parts in the diff. In the same way, we minimize the number of change markers.
This is the fundamental idea behind solving diffing this way. Just that realization already seems pretty cool to me: At first, finding the LCS seems like such an abstract problem, but it actually has really nice applications, such as diffing.
Recursive
Instead of finding the LCS itself, let’s start by just computing its length.
We will do so in a recursive way. Let \(f(i, j)\) be the length of the LCS when considering the first \(i\) characters of S1 and the first \(j\) characters of S2. Note that \(i = 0\) corresponds to the empty string for S1, while \(i = |S1|\) would correspond to the full string.
We will recursively find the length of the LCS for different substrings (i.e. \(i, j\) combinations) to build up the final result of \(f(|S1|, |S2|)\). This works as follows:
\[f(i, j) = \begin{cases} 0 & \text{if } i = 0 \text{ or } j = 0 \\ 1 + f(i - 1, j - 1) & \text{if } S1[i - 1] = S2[j - 1] \\ \max\{f(i - 1, j), f(i, j - 1)\} & \text{otherwise} \end{cases}\]Let’s go through these cases individually.
1. Recursion base case: Empty strings, i.e. \(i = 0\) or \(j = 0\)
If one of the two strings is empty, then the only common subsequence is the empty string. The empty string has length 0, so we can directly return that and end the recursion.
2. The currently considered elements match: \(S1[i - 1] = S2[j - 1]\)
If the \(i\)-th and the \(j\)-th elements match, then we can increase the length of the currently considered subsequence by \(1\), and continue with the remaining strings.
Note that we index using \(i - 1\), \(j - 1\) because indexing starts at 0 while we agreed that \(i = 1\) would correspond to the first element and \(i = 0\) to the empty string.
3. In any other case, drop one character
In any other case, we have two options:
- We ignore the \(i\)-th character of S1: \(f(i - 1, j)\)
- We ignore the \(j\)-th character of S2: \(f(i, j - 1)\)
We consider both options and then take the better one, i.e. the one yielding the larger (\(\max\)) sequence.
Combining the cases
All of these cases together allow us to compute the length of the LCS. Note that this recursion is really similar to computing the edit distance of two strings. The only differences are that we now keep track of the matching parts (not the ones that have to be changed) and that we only have add and remove operations (no change operation).
Dynamic Programming
Directly implementing the above recursion would be inefficient because there’s a lot of repeated subcalls. Each time one of these subcalls is made, we compute the result from scratch and perform new recursive calls, leading to an exponential complexity. Instead of doing that, we could cache the results to reduce the complexity down to quadratic.
It turns out, we can also build the results in a bottom-up manner using dynamic programming. This is because each recursive call only uses subresults from \(i - 1\) and \(j - 1\).
This works as follows:
def compute_lcs_len(text1, text2):
"""Computes a table of f(i, j) results."""
n = len(text1)
m = len(text2)
# We store the results in a (n + 1) x (m + 1) matrix. The +1s are to
# allocate space for the empty strings. Cell [i][j] will cache the
# result of f(i, j).
lcs = [[None for _ in range(m + 1)]
for _ in range(n + 1)]
# We then fill the matrix by going through all rows, using the fact
# that each call only needs results from the previous (i - 1) or
# same (i) row, and from the previous (j - 1) or same (j) column.
for i in range(0, n + 1):
for j in range(0, m + 1):
# The remaining code is exactly the same recursion as before, but
# we do not make recursive calls and instead use the results cached
# in the matrix.
if i == 0 or j == 0:
lcs[i][j] = 0
elif text1[i - 1] == text2[j - 1]:
lcs[i][j] = 1 + lcs[i - 1][j - 1]
else:
lcs[i][j] = max(lcs[i - 1][j], lcs[i][j - 1])
return lcs
The returned matrix tells us the results of all \(f(i, j)\). The length of the LCS is then stored in the cell of the last row and column.
Reconstructing the LCS
It might seem weird that so far we focused on finding the length of the LCS rather than the LCS itself. However, the matrix we built up actually tells us exactly how we can find the LCS. Not only that, but it also tells us where + added and - removed parts are.
As a next step, we will focus on reconstructing the actual LCS. Afterwards, we will then adapt the code to find + additions and - removals.
To find the actual LCS string, we traverse the matrix we built up in the previous step, in a way that we always follow the path of the LCS. That is, we traverse equivalent to the previous recursion and take the \(\max\) option when there’s a choice.
def find_lcs_string(text1, text2):
"""Finds the longest common subsequence of the given texts."""
result = ""
lcs = compute_lcs_len(text1, text2)
i = len(text1)
j = len(text2)
# We iterate until we reach the end of text1 (i == 0) or text2 (j == 0)
while i != 0 and j != 0:
# If the parts of text1 and text2 that we consider are equal, then we
# can record this as part of the LCS, and move to i-1, j-1 since this
# is also how compute_lcs_len traversed.
if text1[i - 1] == text2[j - 1]:
result.append(text1[i - 1])
i -= 1
j -= 1
# Otherwise, compute_lcs_len went into the max direction, which is
# also what we do here.
elif lcs[i - 1][j] <= lcs[i][j - 1]:
j -= 1
else:
i -= 1
# Reverse results because we iterated over the texts from the end but
# want the results to be in forward order.
return reversed(result)
Note how we traversed exactly the way the optimal recursion would traverse. If we were to print the results, we would now get the LCS, i.e. the unchanged parts of the diff.
From LCS to Diffing
Now we only need to find the + additions and - removals. Luckily we already have all required information in the LCS table we built up.
Code
We pretty much follow the same logic as the LCS traversal but break some cases up further to account for + additions and - removals.
Let’s first take a look at the code, and then discuss the cases a bit more.
def diff(text1, text2):
"""Computes the optimal diff of the two given inputs.
The result is a list where all elements are Removals, Additions or
Unchanged elements.
"""
lcs = compute_lcs_len(text1, text2)
results = []
i = len(text1)
j = len(text2)
# We iterate until we reach the end of both texts.
while i != 0 or j != 0:
# If we reached the end of one of text1 (i == 0) or text2 (j == 0),
# then we just need to print the remaining additions and removals.
if i == 0:
results.append(Addition(text2[j - 1]))
j -= 1
elif j == 0:
results.append(Removal(text1[i - 1]))
i -= 1
# Otherwise there's still parts of text1 and text2 left. If the
# currently considered parts are equal, then we found an unchanged
# part which belongs to the longest common subsequence.
elif text1[i - 1] == text2[j - 1]:
results.append(Unchanged(text1[i - 1]))
i -= 1
j -= 1
# In any other case, we go in the direction of the longest common
# subsequence.
elif lcs[i - 1][j] <= lcs[i][j - 1]:
results.append(Addition(text2[j - 1]))
j -= 1
else:
results.append(Removal(text1[i - 1]))
i -= 1
# Reverse results because we iterated over the texts from the end but
# want the results to be in forward order.
return list(reversed(results))
Note that Unchanged
, Addition
, Removal
are just simple data
classes that hold the
content.
Going through the cases
Let’s discuss these cases in some more detail.
1. Base case: Both strings are empty, i.e. \(i = 0\) and \(j = 0\)
No more diffs can be produced, so we terminate.
2. Only one string is empty, i.e. \(i = 0\) or \(j = 0\)
If exactly one of the two strings is empty, then the other must contain additions or removals:
- S1 is empty, but S2 is not: This means something was added in S2 and we have to record these elements as + additions
- S2 is empty, but S1 is not: This means something was removed from S1 and we have to record these elements as - removals
3. The currently considered elements match: \(S1[i - 1] = S2[j - 1]\)
As discussed previously, these elements are part of the LCS and as such
must be
unchanged elements.
4. In any other case, drop one character
In the last case, the LCS recursion dropped either a character from S1 (meaning \(i - 1\)) or from S2 (meaning \(j - 1\)). The LCS table tells us which one lead to the optimal (\(\max\)) result:
lcs[i - 1][j] < lcs[i][j - 1]
: Here,j - 1
leads to a longer LCS, meaning that we have to record an + addition toS2
, since this is where an element was skipped for the LCSlcs[i - 1][j] > lcs[i][j - 1]
: Here,i - 1
leads to a longer LCS, meaning that we have to record a - removal fromS1
, since this is where an element was skipped for the LCSlcs[i - 1][j] == lcs[i][j - 1]
: If both alternatives are equal, then it means that both an + addition and a - removal happened. Which one we process first does not matter too much since it only influences which one is displayed first
Afterwards, we update i, j
exactly as the recursive LCS algorithm did.
Variants
Split-view diffing
We are actually pretty much done now. The code above gives us a results list which tells us which elements remained unchanged, were - removed or + added. We could now directly render this to produce a unified diff view:
As you can see, this shows all diff results in a single view.
However, as mentioned at the very beginning of the post, most diffing tools also allow you to display results in a split view. The left shows - removals, the right side + additions:
It turns out this is not too difficult to implement once you have the above algorithm in place. We still use the results list we previously produced and just render it twice:
- The left side only renders - removals and unchanged parts
- The right side only renders + additions and unchanged parts.
Additionally some book-keeping has to be done to figure out how many buffer lines ones needs to insert in order for the left and right side to line up nicely.
Char-level vs word-level vs line-level diffing
Something that might have been confusing so far is that all live demos showed diffs on a line-level, while our code never explicitly handled lines in any way. In fact, the algorithm and code have no notion of lines on purpose. Instead, they work on generic sequences.
This means you could implement different types of diff levels, just by
controlling what is passed to the diff
function:
- Character-level: Just pass in the strings directly (or in some programming languages: pass in arrays of characters)
- Word-level: Tokenize each text and then pass in arrays of tokens
- Line-level: Split the texts by lines and pass in arrays of lines
For all of these, the diff
function stays the same. The only thing that
changes is how you call it and how results have to be rendered in the
visualization.
However, you can optimize word-level and line-level diffing with hashing
to potentially speed up the ==
comparisons: First, store hashes of all
elements. Then, when you want to compare elements, compare their hashes,
and only compare the contents if the hashes do not match.
Is It Really Optimal?
The algorithm we discussed is optimal in the sense that it produces the minimum number of change markers. However, sometimes I find this to produce pretty annoying results.
For example, consider this diff:
We really only added one entire function, but because there is overlap with the previous function, this is not displayed as one added function but as a change that split up the original function in the file.
It is worth noting that I think all code review tools I used so far had this problem. I can also think of some heuristics to fix this particular case, but that seems out of scope for this blog post, and I’m not sure if any of them would generalize nicely.
Conclusion
As we saw, finding the longest common subsequence (LCS) is equivalent to finding the unchanged parts of a diff. The same dynamic programming solution also allows us to reconstruct the - removed and + added parts.
There’s several things I find neat about this: For one, finding the LCS seems like such a theoretical problem in the beginning, but then it has really cool applications. Furthermore, you can build your own diffing tool using a few really elegant ideas instead of hacking away tons of complex rules and heuristics. This was definitely something I did not previously realize.
Appendix: Code
If you are curious about the full code, I put my entire Python implementation on GitHub at florian/diff-tools. I also wrote a JavaScript implementation to power the live demos in this blog post. That code and the matching React visualization code are also on GitHub.