Big thread over at Gearslutz on this. Here is the summary:
Null test is a part of the story.
Driver handling however is another part of the story.
Null will verify the files are identical but different software or versions of the software may differ in handling the audio driver and the sound out the speakers can effectively sound different although the files are identical.
Only a neutral mic recording the sound coming out of the speakers then null compared will yield the most relatively accurate test of the difference in sound.
Summary over and just passing along the concepts.
I would like to get an opinion from the Cakewalk team on this i.e. is the driver handling pretty much the same over time or do developers experiment in this area.