[14 January 7h00 EST]There is an issue with the console at this and the service on 1 service location (Roubaix, Europe) that we are investigating.
[7h30 EST] There was an issue with one of our radio location (Roubaix). All the secondary IP stopped responding on the server. This caused a lot of error on our console which degraded the service for the radio player and console access. We restarted the network interface on Roubaix which resolved the issue with all the secondary IP. Services are going back to normal now. We are still monitoring the situation. More information to come.
[8h03 EST] We are adding monitoring on every secondary IP to have full visibility on this. All radio should be back online, if your radio is still offline, contact email@example.com
Incident details: Last night at around 1h40 EST, our servers in Roubaix location were doing a critical security update. One of the update caused the network interface to restart. Upon restart, the secondary network interface failed to load its configuration. Because of this, some IP were not accessible until we restarted the interface around 7h15 EST. It impacted multiple radios who became unavailable. This had for impact to cause higher load than usual on our console due to all the failed call. It explains why the console was not accessible (or very slow) for our customer. There are multiple steps we will take to address this issue. First, we added monitoring on secondary IP on all our server so we have visibility in the future if such a problem arise again. We will also improve the way we receive those critical notifications so we are alerted day and night. We are reviewing and improving our internal procedure as well so everyone in our team knows how to react and who to contact in such an event. We will also add more resource to our console so it can handle the load better in those type of situation.
We apologize for the issue again, we take your radio uptime very seriously. If you have any question, do not hesitate to contact us at firstname.lastname@example.org
[16 January 13h10 EST] Update: The issue happened again last night. While we responded quickly to avoid extended downtime, we are still investigating the root cause of this. I will update shortly on this.
[16h00 EST] After a complete review, it seems like automatic updates were causing the issue. Both day (14 and 16) there was a systemd update. We have disabled all automatic update at this time and we will perform them manually. We have also updated the firmware of the network interface on affected servers.