Reminds of an article I read a few years back from Dan Holme talking about a gnarly problem he had with the SharePoint User Profile Service (UPS) where he had a BUNCH of people directly from MS product groups trying to troubleshoot (with none of them able to solve of course) and ending up (in the interests of not burning an untellable amount of time, taking a snapshot of the VM created right before the UPS issue and restoring onto a new VM, on the same host!, and it worked flawlessly. And that wasn't even in the cloud, LOL!
The moral of THAT story being similar to what yours probably is at this point which is that there are SO many layers involved now that it is essentially impossible to truly diagnose and confidently correct issues unless you can throw really prodigious resources at it. And, sometimes, not even then.