Description: Snow Atlas - West Europe - SAM Core Errors
Timeframe: Mar 20, 2025, 1:17 AM PST to Mar 21, 2025, 9:39 AM PST
Incident Summary
On Thursday, March 20, 2025, at 1:17 AM PST monitoring systems identified an issue with the Snow Atlas platform that affected SAM Core functionality for a single customer in the West Europe region. Users within this tenant encountered errors or pages failing to load correctly, significantly affecting self-service capabilities. Additionally, intermittent issues were reported with SaaS, audit logging, and currency pages.
Our technical team promptly investigated and discovered that the problem was isolated to one customer, with no impact on other users or services.
During the investigation, engineers identified that a messaging server was overloaded, preventing some platform services from registering correctly. As a result, certain data processing streams experienced delays and failures, contributing to the observed functionality issues. To restore service, the team executed a targeted restart of the affected servers, and the servers were restored by 4:01 AM PST.
However, a large volume of pending messages remained in one of the processing streams, requiring additional time for clearance. Extended monitoring continued to ensure platform stability, and after validation, the incident was formally declared resolved on Mar 21, 2025, at 9:39 AM PST.
Root Cause
The incident's root cause was traced to an overloaded messaging server that prevented some services from registering properly. This occurred due to a single customer sending an unusually high volume of discovery data, leading to temporary delays in data processing.
While the servers were restored and fully operational by 20th March at 4:01 AM, a large backlog of messages remained in one stream, requiring additional time for processing. No other customers or tenants were impacted, and all other services continued to operate normally.
Remediation Actions
· Server Restart & Restoration: The technical team identified and restarted the affected servers, restoring most services.
· Stream Recreation: The overloaded message stream and its associated consumers were re-created to ensure proper functionality.
· Continuous Monitoring: Monitoring was implemented to track the backlog processing and ensure system stability.
Future Preventative Measures
· Customer Configuration Review: Our teams are working with the impacted customer to review and correct any misconfigurations in their environment to prevent excessive data transmission.
· Proactive Performance Reviews: Regular performance assessments to identify and address potential bottlenecks before they cause disruption.