Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
OpenAI blames one of them the longest breaks in history “new telemetry service” error went away.
On Wednesday, OpenAI’s AI-powered chatbot platform, ChatGPT; its video generator, Sora; and its developer-facing API experienced major outages beginning at 3:00 p.m. Pacific. OpenAI soon acknowledged the problem and began working on a fix. But it will take the company about three hours to restore all services.
At postmortem has been published On Thursday, OpenAI wrote that the outage was not caused by a security incident or a recent product release, but by a telemetry service it deployed on Wednesday to collect Kubernetes metrics. Kubernetes is an open source software that helps manage containers or software packages and associated files used to run software in isolated environments.
“Telemetry services have a very large scope, so the configuration of this new service inadvertently led to … resource-intensive Kubernetes API operations,” OpenAI wrote posthumously. “(Our) Kubernetes API servers became overloaded, bringing down the Kubernetes control plane in most of our large (Kubernetes) clusters.”
That’s a lot of jargon, but basically, the new telemetry service affected OpenAI’s Kubernetes operations, including the resource many of the company’s services rely on for DNS resolution. DNS resolution translates IP addresses into domain names; This is why you type “Google.com” instead of “142.250.191.78”.
OpenAI’s use of DNS caching, which stores information about previously looked-up domain names (such as website addresses) and their corresponding IP addresses, is complicated by “appearance latency,” OpenAI writes, and “enables good . telemetry service) to proceed before the full extent of the problem is understood.
OpenAI says it was able to detect the issue “minutes” before customers were ultimately impacted, but couldn’t implement a fix quickly because it had to work around overloaded Kubernetes servers.
“It is a combination of several systems and processes that fail simultaneously and interact in unexpected ways,” the company said. “Our tests did not capture the impact of the change on the Kubernetes control plane (and) the fix was very slow due to the locked-in effect.”
OpenAI says it will take a number of steps to prevent similar incidents in the future, including better monitoring for infrastructure changes and improving phased rollouts with new mechanisms to ensure OpenAI engineers can access the company’s Kubernetes API servers under any circumstances.
“We apologize for the impact this incident has had on all of our customers, from ChatGPT users to companies that rely on OpenAI products,” OpenAI wrote. “We fell short of our expectations.”