We moved to Azure Database for MySQL back in March 2020, and while it was stable the most challenging issue we faced was complete lack of control for when patches would be applied.
In December 2021, we moved to Azure Database for MySQL Flexible Server per Microsoft's recommendation since it allows you to specify a patch window. Unfortunately, that was a big mistake due to numerous bugs and a half-baked implementation behind the scenes. I cannot imagine we are alone in experiencing the following issues since mid-December, repeatedly:
- Failure with networking code (DNS records not properly updated upon failover)
- Failure with DNS cleanups (our DNS records deleted erroneously, automatically by Azure backend cleanup scripts, per our support ticket)
- Failure with connectivity (this has been a CONSTANT, REPEATED issue for us -- the Azure Portal will say Healthy and Available, yet the Flex Server instance is completely unreachable)
- Failure to acknowledge connectivity is lost (this compounds problem #3 above; since Azure thinks the instance is still Healthy and Available, and even continues to report it as such via API calls)
- Failure to automatically failover to the secondary instance (compounding on #3 and #4, the automatic failover never happens. High Availability does not exist without proper detection, which Azure Database for MySQL Flexible Server lacks (let me be clear, this is a HUGE oversight and a MASSIVE problem with the offering))
- VNET port issue(s) -- our most recent ticket on Tuesday, May 24, resulted in us being told the Azure backend team "mitigated the issue by killing the port and setting a new port on the backend".
I have repeatedly requested assistance and escalation of these issues, and finally I'm now working with the World Wide team towards having this brought up to someone in Engineering who can look into the underlying issues.
If we just focus on the reoccurring problems, there are two:
- Why does the Azure Database for MySQL Flexible Server frequently stop accepting connections, yet show available? We've repeatedly been told there were either backend DNS issues or backend VNET port issues which the Azure backend team has to resolve. That is NOT acceptable. We should NOT have to continue to submit tickets for these outages. We have roughly 14 tickets in the past 6 months.
- Why does the failover not happen? And actually, our most recent failure on Tuesday, we attempted Forced Failover, and that too failed with a VNET port issue.
I want this to be brought up in as many channels as possible, because it's NOT for "Business Critical" applications. It's completely ironic that it is getting branded this way, when we've been on Memory Optimized and experience the total OPPOSITE of a "Business Critical" experience.
Thanks for coming to my TedTalk.