This makes a bold assumption: that the errors (count and basic content) will be the same across versions. If this turns out to be untrue, this may need to get more sophisticated. It should fail obviously when we hit that edge.