Outbound Messaging Publication Approval Service Degradation update
On October 1, the Khoros outbound message publication approval service experienced a degradation of service, causing outbound message publication to be severely delayed and, in some cases, abandoned to an error state. This was caused by the failure of a server node for our approvals processing service. After diagnosing the issue, we restarted the service and reprocessed the posts with their approvals paths, communicating that posts delayed more than an hour beyond their originally scheduled publication time would error out.
After the October 1 degradation, we continued investigating the node failure and resulting approvals outage for further understanding. This investigation was completed on October 7.
The underlying issue was discovered to be linked to maintenance on our cloud infrastructure. Specifically, the approvals service did not handle the effects of the maintenance in a resilient manner and ceased to process approvals paths on posts. We executed the immediate-term solution of restarting the service and reprocessing the posts with approvals paths, as we did on October 1. A sustained solution required us to make the approvals service more resilient to this particular class of cloud maintenance and we began working on that longer-term solution on October 7.
From October 8 onward, we identified a failure to an internal approval processing service, for which coding changes have been implemented to repair and improve internal approval processing service and related nodes. Additionally, a software defect was identified that was harnessing the ability to repair approvals services promptly — coding changes have since been implemented into our production environment.
Our engineering teams have enhanced and expanded monitoring of approvals — across related services and software architectures — to enhance proactive monitoring and identify any potential issues, if/as they arise.
We have a dedicated team actively evaluating and improving overall approval systems and services health.
With these corrective and proactive actions, we are now moving from monitoring to an operational status.
Thank you for your patience with Khoros at this time. Our customers and customer trust is our imperative.