User Details
- User Since
- Jan 5 2016, 9:54 PM (463 w, 5 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- LToscano (WMF) [ Global Accounts ]
Fri, Nov 22
The host is fully in service now and I had a chat with Matthew to put it in production, resolving!
Thu, Nov 21
+1 from my side!
Merged all the patches, and we finally have http://docker-registry.wikimedia.org/wikimedia/mediawiki-services-kartotherian:2024-11-21-145831-production \o/
@Jclark-ctr all configured, the host has been reimaged and all the disks are shows up.
Wed, Nov 20
Quick note about the reimage step - due to a bug in Supermicro's BMC firmware (at least, this is what we suspect) the first reimage run will likely end up into two consecutive debian installs that will end up in causing reimage to fail/stall. A subsequent reimage should be enough to fix it, getting the node in the final state.
@Jclark-ctr the host is provisioned, next step is the number 2 in T370453#10326159, lemme know if you want me to do it or not!
Mon, Nov 18
@Jclark-ctr I updated the firmware to the correct one, but I'd need the BMC label password in pvt when you are in the DC (it is needed for the factory reset that needs to happen post-firmware upgrade, sigh). Thanks for the patience!
Ok I found the issue, I asked Jenn to turn off IPv6 last week for the BMC network to test if that was the issue, but it was before upgrading the firmware. With the BMC reset the network settings are preserved, so the old test/setting caused the last hiccup in running provision.
My bad, I misremembered that we got the firmware for config J from Supermicro already (somehow I thought it was for the ganeti nodes, too many firmware floating around :D) and I uploaded it to thanos-be2005, followed by a factory reset. The issue is the same as happened on backup1012: T371416#10216617
Fri, Nov 15
I was able to upload the firmware via Web UI, but the issue seems still present (new version, 01.04.08. Need to investigate more what is the problem, and/or to ping supermicro to give us the same firmware that they deployed to the ms-be nodes.
@Jclark-ctr Hello! For this host, we have to follow a new workflow:
@Papaul @Jhancock.wm we'd need to upgrade the firmware on this node, I think that we could use directly this instead of the custom one. I tried to connect to the BMC web ui in various ways but I failed since the BMC network config is the one that fails while provisioning. I tried also to do it by hand via DEL/Setup at boot but for some reason I cannot modify any value (or, my client prevents me to do it remotely, not sure why).
While provisioning I see the following error for the BMC NIC config:
Thu, Nov 14
Tried to manually set the continuous flag on sretest2001, rebooted but I didn't see the boot options changing like ms-be2088. So at this point it may not be relevant, but I don't explain the above differences. Maybe we just need to reimage all of them another time and they will get the same conf?
@jhathaway something interesting that I found on Redfish related to BIOS boot options:
Wed, Nov 13
@jhathaway another episode of the saga, ms-be2088 :D
Tue, Nov 12
I think I found the issue, this is what I see from a new pod (available only for the time of the deployment, then helmfile/helm rolls it back):
All action items done, now the next step is to wait for the k8s service to be deployed on Wikikube :)
Mon, Nov 11
The new kartotherian.discovery.wmnet:6543 endpoint is available.
Sun, Nov 10
I ran optimize table archive (11M records, seemed safe enough) after stopping the slave, and it seems to have recovered. I wasn't confident enough to declare it "ready for prod" so we decided not to repool, leaving the decision to data persistence :)
Fri, Nov 8
I also tried to not configure any special JBOD config for ms-be2087 after provision, and kick off reimage to see if the double d-i issue appeared (to rule out special SAS controller features/settings) but no luck, still double d-i at first try.
So far I provisioned up to ms-be2087, and ms-be2088 was left untouched. The ADMIN/root password should already be set to the one on pwstore, so if you want to go ahead and test with 2088 please do it :)
Another test, leading to weird results. I tried to do the following:
This is the boot order right after provisioning:
Very interesting - I watched the sol1 console of ms-be2086 when doing provisioning, and right after the second round of reboot (for BIOS updates) I noticed an attempt to PXE boot over HTTP, failed and ended up in:
I tried with ms-be2085, doing the following:
@jhathaway thanks a ton for the tests, it was exactly what I had in mind to do today :)
Thu, Nov 7
All the maps nodes are now serving traffic from port 6543 too. The next step is to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087422 to create the new lvs endpoint kartotherian.discovery.wmnet:6543, and after that we'll be able to switch maps.wikimedia.org to it.
Wed, Nov 6
To keep archives happy, I met with @Jgiannelos and we decided to proceed in this way:
I had a chat with @Jgiannelos about the plan and it seems good, we just need to verify if the old plaintext lvs:ip combination is still in use. It will be really easy to do it when the Kartotherian Docker image will be deployed on k8s, since we'll see egress rules (the theory is that kartotherian may need to contact itself via plaintext).
@MatthewVernon I tried to provision/reimage ms-be2083 with UEFI but we have the same /dev/disk/by-path duplication issue, I think it is something intrinsic in how the SAS controller is supported by udev/linux. We can either wait for the new controller to be deployed or adjust the puppet fact code to take into account the new format in /dev/disk/by-path.
I found two issues while reimaging ms-be2083 (supermicro):
Mon, Nov 4
We do have support for UEFI in the provision cookbook and in reimage (after https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1077497 is merged), but there are a couple of things that we are still working on, see the subtasks of T373519. There is nothing incredibly blocking but we are going outside the perimeter of what is battle tested in production, please be advised that there may be further issues to debug and the ms-be hosts would be the first production hosts to use UEFI. If everybody is onboard with this, we can go ahead :)
I created T378944 to kick off the discussion about how to best move Kartotherian to k8s when ready.
Fixed 1044. For some reason IPv6 support was disabled, so our settings like IPv6AutoConfigEnabled: False led to a HTTP 400. I connected to the Web UI, turned IPv6 on and re-run provision, all good. I've also set the host's status to Active in Netbox.
Fri, Nov 1
@VRiley-WMF @Jclark-ctr I fixed the issue with ganeti1044 and ran provision, all good! The rest of the nodes should be fine as well :)
Supermicro sent a new license for 1044 that worked, and I've ran successfully the provision cookbook.
Thu, Oct 31
Found the issue with 1044 (see T376121#10280005), I'll post an update as soon as Supermicro replies with the correct license.
The mac address that Supermicro provided to us on the server's label is not correct, the last digit that we have is 8 meanwhile the MAC address returned by Redfish lists B, and the mac address is used to generate the license (I am 99% sure of it). Followed up with Supermicro via email.
It seems that https://puppetboard.wikimedia.org/node/ms-be2083.codfw.wmnet shows 4 entries for accounts under the swift_disks fact list, meanwhile configure_disks.pp explicitly require two:
Wed, Oct 30
Run provision on ganeti1045+, and fixed the ADMIN password as well.
It seems that ganeti1044+ hosts were already provisioned, and I didn't notice an error when uploading the license to 1044:
Declaring this as closed since we have tested everything that we needed :)
Finally all the hosts without the license, that were manually configured, should be ok. The only remaining thing left is to make a quick pass and set the ADMIN password equal to the root one, since I've run provision with --no-users and --no-dhcp to avoid a change of status in Netbox (Active hosts cannot get provisioned without -no-dhcp and --no-users for safety).
My opinion it totally not relevant but I've read the proposal and it seems great, thanks Ben! +1
Hey folks! I've uploaded all the Redfish licenses to these ganeti nodes, and ran provision again up to ganeti1043. I tried 1044 but it seems that it still needs to run the first provision run, and I don't have the custom BMC password, so I cannot proceed. Feel free to go ahead and report issues if any!
Tue, Oct 29
aux-k8s-worker1003 migrated to containerd on Bookworm, all good so far. I tested nerdctl and it seemed working, we'll keep the cluster monitored for a few days and then I'll migrate all the other VMs.
All the licenses are applied, the last steps are to run the provision cookbook on all nodes.
Mon, Oct 28
@Ottomata thanks for the replies :)
@Titore I am closing this since as far a I get there is no action item left to do, please re-open if you feel we missed anything! Thanks! :)
Oct 25 2024
Summary of an IRC chat between me, Jayme and Matthew:
Tried to upgrade the SAS3809's firmware on ms-be2083 to see if the JBOD disks would be picked up, but no luck (followed https://www.supermicro.com/support/faqs/faq.cfm?faq=35522).
Oct 24 2024
There is something strange (at least for me) in the BMC's storage view:
Oct 23 2024
@Titore do you feel that the above Wikitech change is enough? The link is visible on the MOTD, so it should be easy for a user to get the privacy info. Lemme know :)
As I understand, IRC users are no longer visible to each other on the new ircstream service, so I suppose a cloak may not be necessary. Still, a user joining irc.wikimedia.org may think their IP is exposed. Is this intended, or should a cloak be enabled on irc1003.wikimedia.org as well?
For the blubber part, it should be as easy as:
Oct 22 2024
To keep archives happy - in T376014 we moved irc.wikimedia.org's backed to https://github.com/paravoid/ircstream, a more modern stack that still uses UDP. It also supports EventStreams, and the next step will be to test/switch to the new event source and drop the UDP support. This task has several birthdays and it is not up-to-date, any further enhancement will be tracked in T376014.