I've been exploring this issue a little more. The data corruption that V-Vocal is imposing appears to be a nonlinear phase shift. As the clip proceeds in time, a phase shift is introduced. By nonlinear I mean that the higher the frequency the greater the shift. Because different frequencies are being shifted at different rates, the spectral content is changing and comb filtering is occuring as harmonics pull away from one another.
The effect seems to worsen with the length of the affected clip. The effect is most noticeable on broadband sounds. This might explain why some clips don't noticeably degenerate.
Below about 500Hz, the effect is almost nonexistent. At 8KHz, it is rather pronounced. V-Vocal appears to roll off frequencies above 10KHz, but this kind of weird distortion isn't likely to be audible that high anyway. It's the critical bands between 1Khz and 8KHz that suffer the most.
The screenshots below show how the waveform morphs over time. First, an image of the test tone I used, which is a 1Khz wave with odd-order harmonics, spectrally similar to a square wave:

The left image is the original test signal. The right image is the clip after converting to a V-Vocal clip and then bouncing back to a normal audio clip. No V-Vocal processing was done. The distortion is obvious.
Here's another view:
The two waveforms on the left show the original (top left) and the V-Vocal clone (bottom left) at the start of the clip. There is no obvious distortion. The two waveforms at the right show the original (top right) and the V-Vocal clone (bottom right). Note the distortion. These are two cycles from the same clip. It's the same unchanging waveform. The only difference is the left-hand image is from a quarter-second into the clip and the right-hand image is from about 1.5 seconds into the 5-second clip.