Recently I troubleshooted an issue where a distributed query was randomly getting a poor plan and slow performance where the join would not be remoted. However, based on the actual data distribution and the join condition, you would expect the join to be remoted. All of the remote servers were SQL Server, so we were not dealing with heterogeneous data sources. Further investigation in to the plan also showed that there were some unexplained estimations higher up in the plan.
This all boiled down to a permissions issue with the login used on the remote servers. Your initial reaction may be ‘how can that be a permissions issue!’. Especially in view that the query runs successfully without error and retrieves data from the remote servers successfully.
The root cause is regarding the permissions required when the local server is gathering remote statistics. When SQL Server compiles a distributed query that is accessing data in other SQL Servers, the local server will make calls to the remote server database(s) to gather statistics about the tables referenced by the query. If you gather a Profiler trace on the remote server during one of these compiles (In addition to the RPC and Batch completed events, add the ‘Audit DBCC Event’ and the ‘Exception’ event), you will see calls like the following:
Note the LoginName is
. So the way I had my linked server configured is to connect to the remote server using the login plain. In this case, the login plain had the necessary permissions to run dbcc show_statistics in my remote database. So the optimizer was able to get the statistics it needed and create a good plan.
However, suppose my
login is only a reader in the database. Now see what the trace on the remote server shows:
There is now an Exception event with “Error: 2557”. What is this? You can pull this directly from sys.messages:
select * from sys.messages
where message_id = 2557
And 2557 is “User '%.*ls' does not have permission to run DBCC %ls for object '%.*ls'.”
When this happens, the optimizer on the local server continues to compile the plan without the remote statistics, but will likely make very poor remoting decisions and develop a plan that may not appropriately remote joins or predicates. In the plan you may see that estimates are far off from actual.
Since the permissions are tied to how you have your linked server security defined, you can see how this can introduce some very random plan behavior where sometimes you see a good plan and other times you see a bad plan, and may have trouble reproducing it consistently. Say for example you have the linked server defined to impersonate the login. Some logins have db_owner in the remote database and some logins do not. If the plan is being cached and able for the optimizer to reuse, based on the login permissions at the time the plan was cached you may get a bad plan and then subsequent executions use this bad plan. Then the plan is aged from cache, the next user that runs does have permissions on the remote server, now a good plan is cached and everyone now starts using the good plan.