Several replies here suggest improving your listening environment or monitors as a solution to the problem. But it seems to me that this sort of misses the point.
Lets say you have a perfectly treated room and truly "linear" monitors and using your ears in that environment you produce an impeccable mix. Will you not have the same type of problem when you play that mix in an imperfect environment on speakers that are undoubtedly not linear? Many of the devices that listeners will be using are distorted by design to boost certain frequencies, and many have the potential for the listener to adjust an equalizer in ways that you cannot even imagine.
The advantage of a perfect mixing environment, if there is one, is that what you hear will be consistent and represent the actual frequencies in the mix. But to convert that mix to something you can play in the car without adjustment, you would need to either imagine or model how it will sound in that environment. The ability to pre-visualize (pre-herenize?) how it will sound is probably the major skill that professionals have to master. The best way to "model" various listening environments is to play the mix in those environments, where, as you have found, they frequently do not cut it. The VRM devices are a way to more conveniently do this, than trying to mix while riding down the freeway.
In any case, it is difficult to imagine how a "perfect" mix could be moved to all listening transducers/environments/listeners with equally good results. Hence even cell phones typically have equalizers. Do not blame your reference monitors for the failings of the cheap stuff your music will actually be played on. But you will have to take into account the translation of your perfect mix into real world sound if you want to minimize this kind of problem.