Forum Discussion

Former Employee

Jul 24, 2018

Welcome everyone to the Skype for Business Server 2019 Preview Forum!

I just want to welcome everyone who has read the blog - SfB Server 2019 preview is ready for you to try. Be sure to post ALL your questions, comments, and feedback here.

Krzysztof Sienkiewicz

Brass Contributor

Apr 17, 2019

Paul CannonHi, we also deployed Skype for Business Server 2019 but run into two strange problems, we don't know what they are related to, cause we see no network issues with the VMs and SQL back-end.

1. Skype 2019 pool in another region sometimes once a day, sometimes 2-3 times a week has problems with communication with back-end in primary site (where the monitoring DB sits), which causes Front End Service to restart sometimes:

1a. Failed to execute a stored procedure on the back-end.

Component: QoE Adaptor
Stored Procedure: HeartbeatFrontEnd
Error: System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.) ---> System.ComponentModel.Win32Exception (0x80004005): The semaphore timeout period has expired

1b. Then it is immediately followed by this error:

A Data Collection QoE adaptor has failed to connect to or lost a connection to the back-end. It will continue to try and reconnect to the back-end.

Adaptor: QoE Adaptor
Connection String: Data Source=primarysiteSQL-BackEnd.domain.com;Initial Catalog=QoEMetrics;Integrated Security=True;Enlist=False;Pooling=True;Max Pool Size=5;Connect Timeout=60;Application Name=Microsoft.Rtc.Management.Core
Dependent Machine: primarysiteSQL-BackEnd.domain.com
Cause: This typically occurs when the back-end is down or not reachable.
Resolution:
Verify the back-end is up and this Skype for Business Server has connectivity to it. If the problem persists, notify your organization's support team with the relevant details.

1c. and then we see:

An attempt by Storage Service to replicate data via the fabric from the primary failed

Fabric Service Id=F0DF488FE78C536EB6DC68BC6B5E6E81. Exception: #CTX#{ctx:{traceId:258301240, activityId:"f2051b0b-d060-4668-a4bd-2de2ad5b9d5b"}}#CTX#System.ApplicationException: Read denied for [fabric:/lync/storage/F0DF488FE78C536EB6DC68BC6B5E6E81] - Access Status [ReconfigurationPending] ---> System.AccessViolationException: PartitionAccessStatus

1d. and then:

Failed to commit session data into the local Storage Service database.

Error:
System.OperationCanceledException: Received queue operation request for fabric service id [F0DF488FE78C536EB6DC68BC6B5E6E81], but that group is not owned by this FE... requested operation unable to continue on this host. ---> Microsoft.Rtc.Internal.Storage.StorageApiParamValidationException: secondsiteFE1.domain.com

1e. and finally even problem with connecting to sql back end in same site:

Failed to connect to back-end database. Skype for Business Server will continuously attempt to reconnect to the back-end. While this condition persists, incoming messages will receive error responses.

Back-end Database: rtcab Connection string of:
driver={SQL Server Native Client 11.0};Trusted_Connection=yes;AutoTranslate=no;server=secondsiteSQL-BackEnd.domain.com.domain.com;database=rtcab;
Cause: Possible issues with back-end database.
Resolution:
Ensure the back-end is functioning correctly.

There are few other errors thrown in the event viewer after which it suddenly starts working

2. Problem with uploading specifically QoE metrics to from second site to primary site SQL (where monitoring DB sits):

The Quality-Metric server cannot be contacted. The Quality metric reports are not sent to the server.

QoE Agent:
QoE GRUU: sip:secondsiteFEpool.domain.com@domain.com;gruu;opaque=srvr:QoS:nH-yhC2-dFWNhnvaPm7qGgAA
Exception: Microsoft.Rtc.Internal.Qoe.QoeSendException:This operation has timed out. ---> Microsoft.Rtc.Signaling.OperationTimeoutException:This operation has timed out.

Is this something that you are aware of and will be fixed in new CU?

Drive_Heart

Former Employee

Apr 17, 2019

Krzysztof Sienkiewicz , We tried to reproduce the issue you met yesterday, and ever things worked well. For investigate completely ASAP, we need further information, Cloud you help me to resolve the following confuses:

1.One of Strange problems that you met, Dose it look like this scenario?(I have 2 site New York and Washington, and they own a single one monitoring DB in New York, but now Washington's site Front End pools can not work .)if so, could you please tell me how other Server roles you have deployed with? or you can tell me what you did before the error occurred (such as restart primarysiteSQLBackEnd.domain.com and so on).

2.The other problem is operation has timed out, you mean it can not

upload specifically QoE metrics (which is from second site to primary site SQL) successfully, thus did you configure the firewall or Check the connection that all servers ping to DC?

Thank

Krzysztof Sienkiewicz
Brass Contributor
Apr 17, 2019
Drive_Heart
Let me provide you with our setup:
Like with Skype for Business Server 2015 we deployed pools in two sites (Denmark (pri) and U.S. (sec))
Denmark pool:
3 Front Ends colocated with Mediation (prisiteFE01.domain.com, prisiteFE02..., prisiteFE03...)
1 SQL Back-End (with CMS DB, pool DBs and Monitoring DB) (prisiteSQLBackEnd.domain.com)
U.S. pool:
3 Front Ends colocated with Mediation (secsiteFE01.domain.com, secsiteFE02..., secsiteFE03...)
1 SQL Back-End (with pool DBs only) (secsiteSQLBackEnd.domain.com)
so US pool sends monitoring data to Denmark pool.
Like mentioned this is the same setup like we had in Skype 2015, we opened same ports, it's configured same and of course part of same topology.
I've been digging into problem nr 2 for few hours yesterday and it seems this is not ocurring constantly, only once in a while when a call completes in US it tries sending QoE metrics and times out and (as I read "by design") does not retry this ever again, jut goes to the next job. Sometimes it times out sometimes it does not. What is the timeout time exactly, do you know?
Firewall (Windows and Cisco) is configured same like it was for Skype 2015 so should be no issues there.
I managed to capture CLS logs when timeout happened and yeah, it shows that it sends the metrics (as it is SIP, it sends through the pool) but does not receive answer 202 OK, in all traffic captured when this one timeout occurs, this is the only SERVICE Out message that does not receive return message 202 OK, so it logs timeout.
I have NetMon logs, but will have to show it to a network guy to really understand it. I only see that one front end in US pool sends the metrics to another front end in the US pool (why?), maybe this other front end is then responsible for sending the metrics to Monitoring DB in pri site?
All servers ping DCs:)

In regards to issue nr1, I am still trying to reproduce, like I said it works sometimes for a few days then suddenly has a problem for 2-3 minutes, then back to normal, but nothing logged in SCOM from today yet.
- Krzysztof Sienkiewicz
  Brass Contributor
  Apr 18, 2019
  Drive_Heart
  So, I know more now, after understanding the flow of traffic.
  It maybe that there are too many hops between new Skype 2019 pool in US and monitoring DB in Denmark.
  To make it simple. User that made a call/received a call is still on Skype 2015, but for all mediation traffic it's Skype 2019 that is the primary pool now:
  1. Skype 2019 in US initiates and controls the call until it ends
  2. Since user calling/called is still on Skype 2015 in US, it sent the QoE metrics from First 2019 FE to Third 2019 FE in US (Hop 1)
  3. Third 2019 FE sends QoE metrics to Second 2015 FE in US (Hop 2)
  4. Second 2015 FE in US sends QoE metrics to whichever (don't have logs from this pool) 2015 FE in DK (Hop 3)
  5. 2015 FE in DK sends data to Skype 2015 Monitoring DB
  Since many calls (tens or hundreds) are happening on the mediation pool, and only few QoE metrics don't get delivered to the monitoring DB, it seems not to be a big issue and might be related to network latency since so many hops are in the path.
  
  But the problem described as Nr 1 (with failing connection to backend) and PowerPoint sharing problems (let's call it problem nr 3) in Skype 2019 are still there. Apart from that we noticed another problem in Skype 2019 (let's call it problem nr 4):
  4a) It starts with LS Protocol Stack Warning (EventID 14397):
  A configured certificate could not be loaded from store. The serial number is attached for reference.
  Extended Error Code: 0x80092004(CRYPT_E_NOT_FOUND).
  
  4b) it then gets LS Protocol Stack ERROR (EventID 14623):
  A serious problem related to certificates is preventing Skype for Business Server from functioning.
  Unable to use a certificate as configured.
  Transport:TLS, IP address:0.0.0.0, Port:5061, Error:0xC3E93C0D(SIP_E_STACK_TRANSPORT_CERT_NOT_FOUND).
  Ensure that a valid certificate is present in the local computer certificate store. Also ensure that the server has sufficient privileges to access the store.
  Cause: The Skype for Business Server failed to initialize with the configured certificate.
  Resolution:
  Review and correct the certificate configuration, then start the service again.
  
  4c) then it goes into shutting down the Front End Service with LS Server ERROR (EventID 12303):
  The protocol stack reported a critical error: code 0xC3E93C0D (SIP_E_STACK_TRANSPORT_CERT_NOT_FOUND). The service has to stop.
  
  4d) it continues with series of few warnings and erros about not being able to open connection to Storage Service (LS Data Collection EventID 56726), and then after 3 minutes it starts working again and service is brought back up, and no more issues connecting to Certificates store.
  
  If you think writing all this here in the forum is not right, we could switch to another tool (mail/Skype) so I can describe the issue and we could maybe resolve them)?
  - openstreamtechnologies
    Copper Contributor
    Jan 15, 2021
    Krzysztof Sienkiewicz
    
    Hi, Krzysztof, Did you ever find out the root cause of this problem. I have this issue on SFB Server 2015 latest CU.
    Front End Service with LS Server ERROR (EventID 12303):
    The protocol stack reported a critical error: code 0xC3E93C0D (SIP_E_STACK_TRANSPORT_CERT_NOT_FOUND). The service has to stop.