What do you think were the dynamics of the engineering team working on this?
I'd think this isn't too crazy to stress test. If you have 300 million users signed up then you're stress test should be 300 million simultaneous streams in HD for 4 hours. I just don't see how Netflix screws this up.
Maybe it was a management incompetence thing? Manager says something like "We only need to support 20 million simultaneous streams" and engineers implement to that spec even if the 20 million number is wildly incorrect.
There's no way 300 million people watched this, especially if that number is representing every Netflix subscriber. The largest claimed live broadcast across all platforms is last year's Super Bowl with 202 million unique viewers for at least part of it, but that includes CBS, Nickelodeon, and Univision, not just streaming audiences. Its average viewers for the whole game was 123 million, which is second all-time to the Apollo 11 moon landing.
FIFA claimed the 2022 World Cup final reached 1.5 billion people worldwide, but again that seems like it was mostly via broadcast television and cable.
As far as single stream, Disney's Hotstar claimed 59 million for last year's Cricket World Cup, and as far as the YT platform, the Chandrayaan-3 lunar landing hit 8 million.
100 million is a lot of streams, let alone 300. But also note that not every stream reaches a single individual.
And, as far as the 59 million concurrent streams in India, the bitrate was probably very low (I'd wager no more than 720p on average, possibly even 480p in many cases). It's again a very different problem across the board due to regional differences (such as spread of devices, quality of network, even behavioral differences).
I mean, yes, but nobody streams RAW video in practice, and I can't imagine any users or service providers who'd be happy with that level of inefficiency. In general, it's safe to assume some reasonable compression (which, yes, is likely lossy).
Not through a single system, the advantage of diversity rather than winner-takes-all.
The world cup final itself (and other major events) is distributed from the host broadcaster to either on site at the IBC or at major exchange points.
When I've done major events of that magnitude there's usually a backup scanner and even a tertiary backup. Obviously feeds get sent via all manner - the international feed for example may be handed off at an exchange point, but the reserve is likely available on satelite for people to downlink on. If the scanner goes (fire etc), then at least some camera/sound feeds can be switched direct to these points, on some occasions there's a full backup scanner too.
Short of events that take out the venue itself, I can't think of a plausible scenario which would cause the generation or distribution of the broadcast to break on a global basis.
I don't work for OBS/HBS/etc but I can't imagine they are any worse than other broadcast professionals.
The IT part of this stuff is pretty trivial nowadays, even the complex parts like the 2110 networks in the scanner tend to be commoditised and treated as you'd treat any other single system.
The most technically challenging part is unicast streaming to millions of people at low latency (DASH etc). I wouldn't expect an enormous architectural difference between a system that can broadcast to 10 million or 100 million though.
I'd think this isn't too crazy to stress test. If you have 300 million users signed up then you're stress test should be 300 million simultaneous streams in HD for 4 hours. I just don't see how Netflix screws this up.
Maybe it was a management incompetence thing? Manager says something like "We only need to support 20 million simultaneous streams" and engineers implement to that spec even if the 20 million number is wildly incorrect.