Back to overview
Degraded

sieve/youtube-downloader suffering long queue times, several other functions are slow to process.

Mar 13 at 05:34am PDT
Affected services
Job Processing

Resolved
Mar 13 at 07:10am PDT

We've root caused and fixed the issue, the full RCA is written below:


Root Cause Analysis (RCA) - March 13, 2025

Incident Summary:

On March 13, 2025, between 2:30 AM - 5:30 AM PST, a large push of jobs by our internal team to the sieve/youtube-downloader function caused a queue buildup, stalling customer requests to this function. This overwhelmed the servers responsible for video file CRUD operations, affecting multiple functions that generate and output video files. The issue was mitigated by 7:00 AM PST after manual intervention.

Timeline:
2:30 AM PST - Internal team initiated a large push of jobs to sieve/youtube-downloader.
3:00 AM PST - Queue buildup began, impacting file handling servers.
4:00 AM PST - Degradation in video file processing observed.
5:30 AM PST - Alerts triggered, issue identified as an internal push causing excessive load.
6:00 AM PST - Large internal push was removed from the queue to allow customer jobs to process.
7:00 AM PST - Additional manual scaling applied to video file storage services; system returned to normal operation.

Root Cause:
A sudden and unexpected influx of jobs from an internal team overloaded the queue, leading to excessive demand on video file CRUD operations.
The scaling mechanism for video file storage servers was not aggressive enough to handle the spike in demand.
Alerts were delayed as the load originated from an internal push rather than customer requests.

Resolution & Mitigation:
Queue Separation:
Segregating internal and customer job queues to prevent internal pushes from affecting customer processing. This has already been implemented.
Improved Scaling:
Implementing more aggressive auto-scaling policies for video file storage servers to handle sudden spikes. Will be implemented by March 25th.
Enhanced Monitoring & Alerting:
Refining alerting mechanisms to detect large internal job influxes earlier. This is done.
Setting up specific monitoring for queue buildup to ensure proactive mitigation. This is done.
Internal Process Changes:
Implementing guidelines for internal teams to coordinate with infrastructure teams before large job pushes. This is done.

Conclusion:
This incident on March 13, 2025, was caused by an internal job push overwhelming video file handling services. While the issue was resolved through manual intervention, long-term mitigations including queue separation, improved scaling, and better monitoring are being implemented to prevent recurrence.

Created
Mar 13 at 05:34am PDT

We're looking into why this is happening, will keep you posted. We believe this started at around 2:30 - 3 am PDT.