Infrastructure Issues

Incident Report for Marfeel

Postmortem

Between 11:43 AM and 12:14 PM CET, the platform experienced a full service disruption affecting several core services.

The incident was caused by an issue with the distribution of a renewed internal certificate authority used by our clusters in the load balancer layer. This CA was part of the trust chain used to validate encrypted end-to-end communications between internal components. When that validation failed, internal services could no longer establish trusted connections.

‌

This CA was shared across the communication path between our production and backup infrastructure. As a result, traffic redirected to the backup infrastructure was also affected by the same issue.

Our engineering team identified the root cause and applied mitigation measures to restore the platform to normal operation. The service is now fully restored and we continue to monitor system stability.

‌

After reviewing the incident, we have confirmed that some publishing and analytics data affected during the disruption cannot be recovered.

‌

We apologize for the impact this incident may have caused. We are already working on additional safeguards, including stronger end-to-end validation of certificate renewal, distribution, and activation across our load balancer layer; improved alerting for certificate deployment inconsistencies; and further separation between production and backup infrastructure to reduce shared failure domains.

Posted May 26, 2026 - 12:44 CEST

Resolved

The mitigation measures have been successfully applied, and the platform is now operating normally. We apologize for the inconvenience

Posted May 26, 2026 - 12:31 CEST

Monitoring

The root cause has been identified as an expired SSL certificate affecting communication between internal services.

Mitigation measures have been applied, and services are progressively returning to normal operation while we continue to closely monitor the situation.

Posted May 26, 2026 - 12:18 CEST

Update

Posted May 26, 2026 - 12:17 CEST

Identified

The issue has been identified, and our team is actively working on mitigation measures. The platform is now operating normally again in most cases, although some intermittent issues may still occur.

We continue to closely monitor the situation to ensure full stability.

Posted May 26, 2026 - 12:11 CEST

Investigating

We are currently experiencing infrastructure issues that may cause intermittent errors in the web interface and publishing flows.

We apologize for the inconvenience. Our team is actively investigating the issue.

Posted May 26, 2026 - 11:48 CEST

This incident affected: Data Ingestion and Realtime.