GitHub marked the start of the week with more than four hours of downtime, as GitHub Issues, Actions, Pages, Packages and API requests all reported “degraded performance.”
A problem on the world’s most popular code repository and developer collaboration site was first reported around 05:00 UK time (04:00 UTC) this morning and was resolved at 09:30 UK time (08:30 UTC). Basic Git operations were not affected.
GitHub, on the whole, is a relatively reliable site but the impact of downtime is considerable because of its wide use and critical importance. The site has over 44 million users and over 100 million repositories (nearly 34 million of which are public).
The last major outage before today was on 29th June, and before that on 19 June, and 22nd and 23rd May. In the context of such a key service, that isn’t a great recent track record. “You are a dependency to our systems and if this keeps happening, many will say goodbye,” said developer Emad Mokhtar on Twitter.
GitHub reported on what went wrong in May and June. It turns out that database issues are the most common problem. On May 5, “a shared database table’s auto-incrementing ID column exceeded the size that can be represented by the MySQL Integer type,” said GitHub’s SVP of engineering, Keith Ballinger.
May 22 was another bad day for the company’s MySQL servers. A primary MySQL instance was failed over for planned maintenance, but the newly promoted instance crashed after six seconds. “We manually redirected traffic back to the original primary,” said Ballinger. Recovering the six seconds of writes to the crashed instance, though, caused delays. “A restore of replicas from the new primary was initiated which took approximately four hours with a further hour for cluster reconfiguration to re-enable full read capacity,” he added.