Flexera One - IT Visibility - NA - Data Processing Delayed

Incident Report for Flexera System Status Dashboard

Postmortem

Description: Flexera One – IT Visibility – NAM – Data Processing Delays

Timeframe: November 7th, 2024, 2:13 PM to November 10th, 2024, 12:11 PM PDT

Incident Summary

On Tuesday, November 7th, 2024, at 2:13 PM PDT, delays in data processing were observed within the IT Visibility (ITV) platform, affecting customers in the NAM region. While the user interface and dashboards remained accessible, the inventory data updates were delayed, impacting the timeliness of dashboard insights. No other ITV services were affected during the incident.

The issue originated from a bottleneck in a core processing service, caused by an internal database's autoscaling challenges. The database’s primary node experienced slow response times, and secondary nodes were either in recovery mode or inaccessible due to insufficient disk space. The incident was further impacted by high incoming traffic and unexpected data configurations from a specific data source, which compounded the processing delays.

The backlog reduction efforts involved scaling resources, temporarily disabling traffic from specific data sources, and extensive collaboration with our service provider. The backlog was fully cleared, and all data processing returned to normal on November 10th, 2024, at 12:11 PM PDT.

Root Cause

Primary Root Cause:

The delays in data processing were caused by a bottleneck in a core processing service. This bottleneck arose from autoscaling challenges in the internal database, where the primary node experienced slow response times, and secondary nodes became either inaccessible or entered recovery mode due to insufficient disk space.

Contributing Factors:

High Incoming Traffic: Elevated levels of incoming traffic during the incident period increased the load on the processing service, exacerbating the delays.
Unexpected Data Configurations: Data from a specific source introduced unanticipated complexity, placing additional strain on the processing system and compounding the backlog.
Resource Constraints: The system's scaling mechanisms and resource allocation were insufficient to manage the combination of high traffic and complex data configurations, leading to prolonged recovery times.
Database Limitations: Disk space constraints in secondary database nodes further hindered the autoscaling and recovery process.

Remediation Actions:

Traffic Limitation: Traffic to the affected processing service was temporarily reduced to stabilize the system and prevent further strain on the database.
Resource Optimization: Additional resources were allocated to the processing service, significantly increasing its capacity to handle the backlog.
Targeted Data Handling Adjustments: Specific data sources contributing to the delays were temporarily disabled to alleviate pressure on the system.
Collaboration with Service Provider Support: Service Provider support teams were engaged to address the database autoscaling issues and assist in clearing disk space on affected nodes.
Incremental Backlog Reduction: The system processed the backlog in stages, monitoring progress continuously to ensure stability and gradual improvement.
Database Recovery: Secondary database nodes were re-synced and brought back online after addressing disk space limitations, restoring full functionality to the database.
Extended Monitoring: Real-time monitoring was conducted throughout the incident to track system performance and ensure the success of recovery efforts.

Future Preventative Measures:

This incident was caused by an extraordinary and rare occurrence of high data volume, which exceeded the system’s ability to scale automatically. Detailed discussions have been held with the service provider to understand the contributing factors and explore measures to prevent similar issues in the future.

To ensure operational stability, the following actions are being taken:

Continuous Monitoring and Alerts: Enhanced monitoring and alert mechanisms are being implemented to detect early warning signs of scaling challenges or abnormal resource usage.
Preventative Collaboration: Ongoing collaboration with the service provider will focus on identifying and addressing potential vulnerabilities in advance, ensuring a proactive approach to system stability.
Proactive System Evaluations: Regular evaluations of the system’s capacity and performance during peak scenarios will guide preemptive adjustments to resource allocation strategies.

This incident represents a rare occurrence of circumstances that exceeded the platform’s capacity to scale. However, we remain committed to proactively improving our processes, collaborating with our service provider, and implementing measures to prevent similar issues in the future while ensuring a seamless experience for our customers.

Posted Nov 27, 2024 - 09:33 PST

Resolved

This incident has been resolved. After extended monitoring, our teams have confirmed that all data processing is up to date and has returned to normalcy.

Posted Nov 10, 2024 - 23:00 PST

Update

The team has identified the bottleneck areas and increased resources significantly. We are seeing positive progress following these measures and will continue to monitor to ensure consistent progress.

Posted Nov 10, 2024 - 07:05 PST

Update

The backlog processing is ongoing. We are actively monitoring the progress and assessing the allocation of additional resources to speed up the process.

Posted Nov 09, 2024 - 15:20 PST

Update

The system is steadily reducing the backlog, and data updates are progressing well. We will continue to monitor the situation closely and share updates as we make progress.

Posted Nov 08, 2024 - 15:02 PST

Update

Our technical teams have been working diligently to expedite the recovery process. An unforeseen inventory issue temporarily delayed progress, but corrective measures have been implemented, and the service is now stable. Data ingestion has resumed, and we continue to closely monitor the situation.

Posted Nov 08, 2024 - 07:19 PST

Update

The database instance has returned to a healthy state. We will continue to monitor the system and provide an updated ETA.

Posted Nov 08, 2024 - 02:49 PST

Update

The sync process is progressing smoothly. The primary sync has been completed successfully, and we’re making steady progress on the remaining nodes. Further updates will be provided once synchronization is complete for all nodes.

Posted Nov 07, 2024 - 20:57 PST

Identified

The data processing issue was due to a disk space problem affecting a database node. We have been actively working with our service provider, and corrective actions have been taken. The system is currently in the process of re-syncing. The initial sync is stable, with completion expected soon, after which additional nodes will be synchronized. We continue to monitor the situation and will provide further updates as progress is made.

Posted Nov 07, 2024 - 17:32 PST

Update

To expedite the recovery process, we have engaged our service provider and are actively collaborating with them to review and assess impacted services. We will continue to provide updates as we make progress.

Posted Nov 07, 2024 - 15:17 PST

Investigating

Incident Description: We are currently experiencing a data processing delay affecting IT Visibility (ITV) in the NA region. While the user interface and Dashboards remain accessible, inventory data updates are delayed. All other ITV services continue to operate normally.

Priority: P2

Restoration Activity: Our technical teams have been engaged and are actively working to resolve the underlying cause. Mitigation measures have been implemented to reduce impact while we investigate further. We will provide updates as more information becomes available.

Posted Nov 07, 2024 - 14:25 PST

This incident affected: Flexera One - IT Visibility - North America (IT Visibility US).