Description: Flexera One – IT Visibility – NAM – Data Processing Delays
Timeframe: November 7th, 2024, 2:13 PM to November 10th, 2024, 12:11 PM PDT
Incident Summary
On Tuesday, November 7th, 2024, at 2:13 PM PDT, delays in data processing were observed within the IT Visibility (ITV) platform, affecting customers in the NAM region. While the user interface and dashboards remained accessible, the inventory data updates were delayed, impacting the timeliness of dashboard insights. No other ITV services were affected during the incident.
The issue originated from a bottleneck in a core processing service, caused by an internal database's autoscaling challenges. The database’s primary node experienced slow response times, and secondary nodes were either in recovery mode or inaccessible due to insufficient disk space. The incident was further impacted by high incoming traffic and unexpected data configurations from a specific data source, which compounded the processing delays.
The backlog reduction efforts involved scaling resources, temporarily disabling traffic from specific data sources, and extensive collaboration with our service provider. The backlog was fully cleared, and all data processing returned to normal on November 10th, 2024, at 12:11 PM PDT.
Root Cause
Primary Root Cause:
The delays in data processing were caused by a bottleneck in a core processing service. This bottleneck arose from autoscaling challenges in the internal database, where the primary node experienced slow response times, and secondary nodes became either inaccessible or entered recovery mode due to insufficient disk space.
Contributing Factors:
Remediation Actions:
Future Preventative Measures:
This incident was caused by an extraordinary and rare occurrence of high data volume, which exceeded the system’s ability to scale automatically. Detailed discussions have been held with the service provider to understand the contributing factors and explore measures to prevent similar issues in the future.
To ensure operational stability, the following actions are being taken:
This incident represents a rare occurrence of circumstances that exceeded the platform’s capacity to scale. However, we remain committed to proactively improving our processes, collaborating with our service provider, and implementing measures to prevent similar issues in the future while ensuring a seamless experience for our customers.