PubNub posted the following postmortem for this incident here. We’ve been extremely satisfied with PubNub thus far as a vendor for the real-time communication within our product and value their transparency and quick response time.
“Problem Description, Impact, and Resolution
In our US-EAST-1 Point of Presence we received an unusual amount of connections that caused upstream backlog within our internal network. This manifested by causing delays in delivering subscription requests to some clients. Via our normal catch up mechanism subscription deliveries continued to function with delays throughout the incident. Manual operation tools deployed mechanisms which alleviated back pressure on connection creation bringing the incident to full resolution.
Mitigation Steps and Recommended Future Preventative Measures
In the future there are several areas to improve. First, a gap was identified in our internal monitoring which hid some of the connection and channel creation from our system. This caused us to respond slower towards a resolution than expected. This gap will be resolved in the coming days. Additionally, we have automatic throttles that closely guard connection creation we are always trying to improve. In this exact scenario we believe the specific pattern of the problem kept the influx of connections underneath our rate limiting. We are analyzing the exact pattern so that in the future our connection rate limiting will take into account more sophisticated usage patterns.”