Flexera One Platform - EU- Service Degradation

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One Platform – Europe – Access Disruption

Timeframe: July 6, 11:35 PM to July 7, 12:28 AM PDT

Incident Summary

On July 7, 2025, at 12:09 AM PDT, the Flexera One platform in the Europe region experienced an access disruption. During this period, customers may have been unable to load the landing page or access key workflows within the platform.

The disruption was caused by a failure in a core service responsible for routing platform traffic. The affected component became unresponsive, which led to temporary unavailability. A manual reboot of the impacted node was performed to restore service availability.

Platform access was restored by 12:28 AM PDT. Comprehensive checks were completed to confirm full functionality, and no customer-reported issues were received during the incident.

Root Cause

Primary Root Cause:
A failure occurred in one of the infrastructure components responsible for traffic routing. The node hosting the gateway service became unresponsive, which caused a temporary disruption in platform access. The impacted node was manually rebooted to restore routing capabilities.

Contributing Factors:

• Workload Placement: System components responsible for routing traffic were deployed alongside other workloads on shared infrastructure. This configuration may have contributed to increased resource demand ahead of the failure. Measures are being implemented to isolate critical workloads more effectively.
• Alert Routing Gaps: While multiple alerts were triggered during the incident through product-specific monitoring systems, infrastructure-level alerting did not provide sufficient visibility into the platform-level disruption. Enhancements to alert coverage are in progress to support timely detection and awareness of core service impacts.

Remediation Actions

  1. Manual Reboot of Affected Node: The impacted node was manually rebooted to restore the gateway service and resume traffic routing across the platform.
  2. Post-Recovery Readiness Checks: After restoration, teams performed comprehensive validations across key workflows to confirm full accessibility and identify any residual impact, supporting overall readiness and continuous monitoring efforts.
  3. Targeted Infrastructure Enhancements: Following the incident, our technical teams conducted a detailed analysis and implemented targeted infrastructure adjustments to reduce the risk of recurrence. As part of this effort, throughput capacity for critical services was increased to better accommodate variable usage patterns and ensure smoother traffic handling. This adjustment is part of a broader initiative to strengthen performance and minimize the risk of future resource-related disruptions.

Future Preventative Measures

Following this incident, a detailed root cause analysis and internal retrospective were completed to identify areas for long-term improvement. The following workstreams were initiated under a platform reliability initiative aimed at strengthening infrastructure performance and minimizing the risk of recurrence. While the underlying cause of the infrastructure disruption remains under investigation, the actions already implemented and those underway are expected to significantly improve the platform’s ability to detect, respond to, and recover from similar events in the future.

  1. Increased Throughput Capacity: Throughput limits for key platform services were elevated to accommodate varying levels of system usage. This change is expected to reduce the likelihood of resource-related disruptions under peak or unexpected load conditions. Ongoing analysis is in place to further investigate the factors contributing to usage spikes.
  2. Infrastructure Resource Optimization: Work is underway to isolate supporting system processes from core routing services. This adjustment aims to reduce resource contention on shared infrastructure and improve the efficiency of traffic handling during variable demand.
  3. Planned Upgrade of Gateway Codebase: An upgrade to the gateway component is currently being planned to incorporate the latest stability improvements. This initiative is intended to improve resilience, observability coverage, and the ability to diagnose abnormal behavior in future incidents.
Posted Jul 15, 2025 - 20:42 PDT

Resolved

This incident has been resolved.
Posted Jul 07, 2025 - 01:03 PDT

Monitoring

Our teams have restarted some infrastructure components, which appear to have resolved the issue. We continue to monitor the platform for stability before declaring this as restored.
Posted Jul 07, 2025 - 00:34 PDT

Investigating

Issue Description: We are currently investigating a service degradation affecting the Flexera One platform in the EU region. Customers may experience issues accessing services at this time.

Priority: P1

Restoration Activity: Our technical teams have been engaged and are working to identify the affected service and determine the scope of impact. Further updates will be shared as the investigation progresses
Posted Jul 07, 2025 - 00:26 PDT
This incident affected: Flexera One - IT Asset Management - Europe (IT Asset Management - EU Beacon Communication, IT Asset Management - EU Inventory Upload, IT Asset Management - EU Login Page, IT Asset Management - EU Batch Processing System, IT Asset Management - EU Business Reporting, IT Asset Management - EU SaaS Manager, IT Asset Management - EU Restful APIs) and Flexera One - IT Visibility - Europe (IT Visibility EU).