Knack Software Service Disruption Post-Mortem and Improvements
We want to extend our gratitude for your patience during last week’s software outage. We understand the inconvenience and frustration it caused, and we acknowledge that our communication during the incident fell short of expectations. In this post-mortem, we aim to provide more clarity on the cause of the outage and the steps we are taking to prevent similar issues in the future.
What happened during the disruption?
Last week, we performed a major version upgrade on our database systems. Our cloud provider assured us that downtime would be minimal, and our testing environments validated this. However, for our European and US production regions, the downtime was longer than anticipated. The majority of customers experienced 30 minutes or less of downtime, but some faced up to 4 hours.
Our lack of proactive communication on the status page compounded the problem, leaving many of you in the dark.
How are we addressing this?
In light of these events, our leadership team conducted a comprehensive post-mortem to analyze both the technical and communication aspects of the outage. As a historically engineering-centric company, we recognize the need to improve our communication processes as we continue to grow.
On the technical front, our engineering team is revamping our testing environments to more accurately mirror production systems. This will allow us to simulate peak production traffic and better understand the impact of changes on our infrastructure.
To improve communication, we are re-evaluating and redefining our internal processes for maintenance and outages, with a focus on keeping customers informed in real-time. We are working towards providing fine-tuned status updates for all our services. In the meantime, we have ensured that more Knack staff members are trained in managing our status page (https://status.knack.com/) and are committed to increased transparency regarding the status of the application.
We sincerely apologize for any inconvenience the prolonged disruption caused and appreciate your continued trust in Knack. We are taking the necessary steps to prevent similar incidents in the future and enhance our communication efforts going forward.
Thank you for reading - please reach out to me if you want to discuss, or email email@example.com for information or updates on existing tickets.