Evaluating Multi-Turn Agents: A Quality Study of Microsoft Foundry’s Multi-Turn Evaluators