Cluster issues
Incident Report for Next Tech
Postmortem

Between Wednesday 9/4 at 2230 PT and Thursday 9/5 at 1030 PT, Next Tech experienced two periods of limited sandbox availability lasting approximately 1.5 hours each. This resulted in some projects failing to launch a computing environment and some environments or logic (such as task processing) performing slowly.

Since identifying the bottlenecks responsible for causing these issues, we have rolled out a number of already planned and newly identified improvements to our sandbox cluster, including:

  • Applied a configuration change resulting in a 5x increase in cluster performance
  • Adjusted a number of job processing queues to ensure that they perform well under heavy load
  • Made several application-level logic changes to handle heavy load better
  • Improved our error alerting to alert users to our status page in the event of an issue

These improvements also improved the performance of the computing environments drastically. For example, the time to launch a new sandbox has been decreased by 40% on average (nearly 2x faster!), as of today. Similar performance improvements can be expected when compiling code, installing software, and performing other CPU or disk IO bound tasks.

That being said, we sincerely apologize for the inconvenience this may have caused you or your users, and we appreciate your support as we continue to improve our platform.

Posted 16 days ago. Sep 05, 2019 - 20:55 PDT

Resolved
This issue has been resolved. A postmortem report will be added by end of day Thursday. Apologies for any inconvenience this may have caused!
Posted 17 days ago. Sep 04, 2019 - 23:53 PDT
Monitoring
We are back up and monitoring the cluster!
Posted 17 days ago. Sep 04, 2019 - 23:26 PDT
Update
The sandbox cluster is currently rebuilding. We'll be back online soon!
Posted 17 days ago. Sep 04, 2019 - 23:04 PDT
Identified
We've run into some issues with our sandbox cluster, which requires us to start a number of hosts. We apologize for the inconvenience and will be back online soon!
Posted 17 days ago. Sep 04, 2019 - 22:28 PDT
This incident affected: Website, Project API, Zapier API, and Sandbox Cluster.