Snow Atlas - West Europe - SAM Core Errors

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Snow Atlas - West Europe - SAM Core Errors

Timeframe:  Mar 20, 2025, 1:17 AM PST to Mar 21, 2025, 9:39 AM PST

Incident Summary

On Thursday, March 20, 2025, at 1:17 AM PST monitoring systems identified an issue with the Snow Atlas platform that affected SAM Core functionality for a single customer in the West Europe region. Users within this tenant encountered errors or pages failing to load correctly, significantly affecting self-service capabilities. Additionally, intermittent issues were reported with SaaS, audit logging, and currency pages.

Our technical team promptly investigated and discovered that the problem was isolated to one customer, with no impact on other users or services.

During the investigation, engineers identified that a messaging server was overloaded, preventing some platform services from registering correctly. As a result, certain data processing streams experienced delays and failures, contributing to the observed functionality issues. To restore service, the team executed a targeted restart of the affected servers, and the servers were restored by 4:01 AM PST.

However, a large volume of pending messages remained in one of the processing streams, requiring additional time for clearance. Extended monitoring continued to ensure platform stability, and after validation, the incident was formally declared resolved on Mar 21, 2025, at 9:39 AM PST.

Root Cause

 

The incident's root cause was traced to an overloaded messaging server that prevented some services from registering properly. This occurred due to a single customer sending an unusually high volume of discovery data, leading to temporary delays in data processing.

While the servers were restored and fully operational by 20th March at 4:01 AM, a large backlog of messages remained in one stream, requiring additional time for processing. No other customers or tenants were impacted, and all other services continued to operate normally.

Remediation Actions

 

·        Server Restart & Restoration: The technical team identified and restarted the affected servers, restoring most services.

·        Stream Recreation: The overloaded message stream and its associated consumers were re-created to ensure proper functionality.

·        Continuous Monitoring: Monitoring was implemented to track the backlog processing and ensure system stability.

Future Preventative Measures 

 

·        Customer Configuration Review: Our teams are working with the impacted customer to review and correct any misconfigurations in their environment to prevent excessive data transmission.

·        Proactive Performance Reviews: Regular performance assessments to identify and address potential bottlenecks before they cause disruption.

Posted Apr 02, 2025 - 01:41 PDT

Resolved

At this time, all services are fully operational, and no further disruptions are observed. Our analysis has shown that the issue had minimal impact, with the majority of customers remaining unaffected. This incident has been resolved.
Posted Mar 21, 2025 - 10:19 PDT

Update

The service continues to operate normally. As a precaution, we are performing extended monitoring to ensure there are no lingering issues. While message processing has been restored and the service remains stable, we are also conducting additional validations to confirm data integrity.
Posted Mar 20, 2025 - 16:15 PDT

Update

The affected service is operational, and all message streams have been restored. While the service remains stable, we are conducting additional validations to ensure data integrity. We will continue to monitor and assess the situation and provide updates as needed.
Posted Mar 20, 2025 - 14:49 PDT

Update

The affected service is operational, but message processing is still ongoing. We anticipate it may take a few hours to complete. Our teams are actively monitoring the progress, and we will continue to provide updates as we make progress
Posted Mar 20, 2025 - 09:15 PDT

Update

The services remain stable following our remediation efforts, we are conducting additional validations and extending our monitoring period before officially declaring the services as fully restored.
Posted Mar 20, 2025 - 03:57 PDT

Monitoring

Incident Description: Our teams are currently investigating an issue affecting the Snow Atlas platform, where users may encounter errors when accessing SAM Core functionality. This issue impacts multiple customers in the West Europe region, resulting in pages throwing errors or loading incorrectly for affected users within Snow Atlas SAM Core. Some users may have faced intermittent issues with SaaS, audit logging and currency as well.

Priority: P2

Restoration Activity: Our technical team was promptly engaged and was able to isolate an overloaded server as the root cause. They have performed the necessary restarts to restore the services and are observing a huge improvement. We are monitoring the situation closely and will keep you informed of any developments.
Posted Mar 20, 2025 - 01:28 PDT
This incident affected: Snow Atlas (Snow Atlas - Europe).