iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: http://phabricator.wikimedia.org/p/elukey/
♟ elukey
Page MenuHomePhabricator

elukey (Luca Toscano)
Site Reliability Engineer - Machine Learning

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Jan 5 2016, 9:54 PM (463 w, 5 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
LToscano (WMF) [ Global Accounts ]

Recent Activity

Fri, Nov 22

elukey closed T370453: Q1:rack/setup/install thanos-be1005 as Resolved.

The host is fully in service now and I had a chat with Matthew to put it in production, resolving!

Fri, Nov 22, 9:48 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey updated the task description for T370453: Q1:rack/setup/install thanos-be1005.
Fri, Nov 22, 9:45 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops

Thu, Nov 21

elukey added a comment to T380373: Allow TLS authenticated client to write on new topics.

+1 from my side!

Thu, Nov 21, 5:16 PM · Data-Platform-SRE, Traffic
elukey added a comment to T327396: Migrate Kartotherian to node-mapnik v4.2.1 and unfork.

Merged all the patches, and we finally have http://docker-registry.wikimedia.org/wikimedia/mediawiki-services-kartotherian:2024-11-21-145831-production \o/

Thu, Nov 21, 4:04 PM · WMDE-TechWish-Sprint-2024-10-16, Patch-For-Review, Essential-Work, Content-Transform-Team-WIP, WMDE-GeoInfo-FocusArea, Maps (Kartotherian)
elukey updated subscribers of T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr all configured, the host has been reimaged and all the disks are shows up.

Thu, Nov 21, 11:06 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops

Wed, Nov 20

elukey added a comment to T370453: Q1:rack/setup/install thanos-be1005.

Quick note about the reimage step - due to a bug in Supermicro's BMC firmware (at least, this is what we suspect) the first reimage run will likely end up into two consecutive debian installs that will end up in causing reimage to fail/stall. A subsequent reimage should be enough to fix it, getting the node in the final state.

Wed, Nov 20, 5:22 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey added a comment to T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr the host is provisioned, next step is the number 2 in T370453#10326159, lemme know if you want me to do it or not!

Wed, Nov 20, 2:59 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops

Mon, Nov 18

elukey added a comment to T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr I updated the firmware to the correct one, but I'd need the BMC label password in pvt when you are in the DC (it is needed for the factory reset that needs to happen post-firmware upgrade, sigh). Thanks for the patience!

Mon, Nov 18, 12:05 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

Ok I found the issue, I asked Jenn to turn off IPv6 last week for the BMC network to test if that was the issue, but it was before upgrading the firmware. With the BMC reset the network settings are preserved, so the old test/setting caused the last hiccup in running provision.

Mon, Nov 18, 12:04 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

My bad, I misremembered that we got the firmware for config J from Supermicro already (somehow I thought it was for the ganeti nodes, too many firmware floating around :D) and I uploaded it to thanos-be2005, followed by a factory reset. The issue is the same as happened on backup1012: T371416#10216617

Mon, Nov 18, 10:56 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Fri, Nov 15

elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

I was able to upload the firmware via Web UI, but the issue seems still present (new version, 01.04.08. Need to investigate more what is the problem, and/or to ping supermicro to give us the same firmware that they deployed to the ms-be nodes.

Fri, Nov 15, 6:18 PM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey updated subscribers of T370453: Q1:rack/setup/install thanos-be1005.

@Jclark-ctr Hello! For this host, we have to follow a new workflow:

Fri, Nov 15, 11:38 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-eqiad, DC-Ops
elukey updated subscribers of T370452: Q1:rack/setup/install thanos-be2005.

@Papaul @Jhancock.wm we'd need to upgrade the firmware on this node, I think that we could use directly this instead of the custom one. I tried to connect to the BMC web ui in various ways but I failed since the BMC network config is the one that fails while provisioning. I tried also to do it by hand via DEL/Setup at boot but for some reason I cannot modify any value (or, my client prevents me to do it remotely, not sure why).

Fri, Nov 15, 11:20 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T370452: Q1:rack/setup/install thanos-be2005.

While provisioning I see the following error for the BMC NIC config:

Fri, Nov 15, 10:50 AM · Patch-For-Review, SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Thu, Nov 14

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Tried to manually set the continuous flag on sretest2001, rebooted but I didn't see the boot options changing like ms-be2088. So at this point it may not be relevant, but I don't explain the above differences. Maybe we just need to reimage all of them another time and they will get the same conf?

Thu, Nov 14, 8:55 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@jhathaway something interesting that I found on Redfish related to BIOS boot options:

Thu, Nov 14, 8:43 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Wed, Nov 13

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@jhathaway another episode of the saga, ms-be2088 :D

Wed, Nov 13, 8:30 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Tue, Nov 12

elukey added a comment to T379592: Unable to deploy new version of recommendation-api to production due to connectivity issues.

I think I found the issue, this is what I see from a new pod (available only for the time of the deployment, then helmfile/helm rolls it back):

Tue, Nov 12, 11:55 AM · Unplanned-Sprint-Work, LPL Essential (LPL Essential 2024 Nov-Dec), Recommendation-API
elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

All action items done, now the next step is to wait for the k8s service to be deployed on Wikikube :)

Tue, Nov 12, 10:17 AM · serviceops-radar, Maps (Kartotherian)

Mon, Nov 11

elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

The new kartotherian.discovery.wmnet:6543 endpoint is available.

Mon, Nov 11, 4:23 PM · serviceops-radar, Maps (Kartotherian)
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@elukey I was able to reproduce the issue, by wiping the files from the efi partition, before kicking off another re-image. I think the problem is actually in the debian-installer, rather than on the supermicro side, which is why we don't see this issue on sretest2001.codfw.wmnet. I think the debian-installer is failing to install grub properly and create the efi boot entry, which is part of the grub install process. I think the issue is related to setting grub-installer/bootdev which is done by autoinstall/scripts/partman_early_command.sh on the ms-be boxes. On ms-be2082 this evaluated to grub-installer/bootdev /dev/sdj /dev/sdk which seems correct, but perhaps /dev/sdk needs to be first? I also tried setting grub-installer/only_debian boolean false, which we set in the raid1-2dev-efi.cfg, but that didn't seem to have any effect, so I don't think we are still hitting, "#this workarounds LP #1012629 / Debian #666974", but I'm also not sure. I am off Monday, but happy to investigate more on Tuesday.

Mon, Nov 11, 2:47 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey edited P70998 rec-api-ng deployment log errors.
Mon, Nov 11, 11:23 AM
elukey created P70998 rec-api-ng deployment log errors.
Mon, Nov 11, 11:23 AM

Sun, Nov 10

elukey added a project to T379491: PROBLEM - MariaDB Replica SQL: s6 on db2217 is CRITICAL: CRITICAL: Data-Persistence-SRE.

I ran optimize table archive (11M records, seemed safe enough) after stopping the slave, and it seems to have recovered. I wasn't confident enough to declare it "ready for prod" so we decided not to repool, leaving the decision to data persistence :)

Sun, Nov 10, 12:39 PM · Data-Persistence-SRE, DBA

Fri, Nov 8

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

I also tried to not configure any special JBOD config for ms-be2087 after provision, and kick off reimage to see if the double d-i issue appeared (to rule out special SAS controller features/settings) but no luck, still double d-i at first try.

Fri, Nov 8, 12:05 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

So far I provisioned up to ms-be2087, and ms-be2088 was left untouched. The ADMIN/root password should already be set to the one on pwstore, so if you want to go ahead and test with 2088 please do it :)

Fri, Nov 8, 11:40 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Another test, leading to weird results. I tried to do the following:

Fri, Nov 8, 11:38 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

This is the boot order right after provisioning:

Fri, Nov 8, 10:04 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Very interesting - I watched the sol1 console of ms-be2086 when doing provisioning, and right after the second round of reboot (for BIOS updates) I noticed an attempt to PXE boot over HTTP, failed and ended up in:

Fri, Nov 8, 10:01 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

I tried with ms-be2085, doing the following:

Fri, Nov 8, 9:22 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@jhathaway thanks a ton for the tests, it was exactly what I had in mind to do today :)

Fri, Nov 8, 8:30 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Thu, Nov 7

elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

All the maps nodes are now serving traffic from port 6543 too. The next step is to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087422 to create the new lvs endpoint kartotherian.discovery.wmnet:6543, and after that we'll be able to switch maps.wikimedia.org to it.

Thu, Nov 7, 5:02 PM · serviceops-radar, Maps (Kartotherian)

Wed, Nov 6

elukey created P70973 (An Untitled Masterwork).
Wed, Nov 6, 3:26 PM
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Alas :(

I think adjusting the fact is the way to go? Presumably it now needs to keep track of the targets of the symlinks in /dev/disk/by-path and only emit one symlink per target...

Wed, Nov 6, 10:56 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T216826: Move Kartotherian to Kubernetes.

To keep archives happy, I met with @Jgiannelos and we decided to proceed in this way:

Wed, Nov 6, 10:54 AM · Content-Transform-Team, WMDE-TechWish-Sprint-2022-11-29, serviceops, WMDE-TechWish-Sprint-2022-11-09, Platform Engineering, WMDE-TechWish-Sprint-2022-10-26, WMDE-TechWish-Maintenance, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey added a comment to T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.

I had a chat with @Jgiannelos about the plan and it seems good, we just need to verify if the old plaintext lvs:ip combination is still in use. It will be really easy to do it when the Kartotherian Docker image will be deployed on k8s, since we'll see egress rules (the theory is that kartotherian may need to contact itself via plaintext).

Wed, Nov 6, 10:35 AM · serviceops-radar, Maps (Kartotherian)
elukey added a comment to T373519: Allow UEFI DHCP configs.
  • The recipe for ms-be2083 may be wrong since during boot it can't find any media present, and forces PXE again (that triggers d-i etc...).
Wed, Nov 6, 10:31 AM · Infrastructure-Foundations
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@MatthewVernon I tried to provision/reimage ms-be2083 with UEFI but we have the same /dev/disk/by-path duplication issue, I think it is something intrinsic in how the SAS controller is supported by udev/linux. We can either wait for the new controller to be deployed or adjust the puppet fact code to take into account the new format in /dev/disk/by-path.

Wed, Nov 6, 10:26 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T373519: Allow UEFI DHCP configs.

I found two issues while reimaging ms-be2083 (supermicro):

Wed, Nov 6, 9:27 AM · Infrastructure-Foundations
elukey closed T378345: Migrate the AUX K8s cluster to containerd as Resolved.
Wed, Nov 6, 8:19 AM · Infrastructure-Foundations, Prod-Kubernetes, Kubernetes
elukey closed T378345: Migrate the AUX K8s cluster to containerd, a subtask of T362408: Migration to containerd and away from docker, as Resolved.
Wed, Nov 6, 8:18 AM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops

Mon, Nov 4

elukey triaged T378584: Evaluate hw-raid controllers for Supermicro's Config J as Medium priority.
Mon, Nov 4, 4:00 PM · SRE-swift-storage, Infrastructure-Foundations, Data-Persistence, DC-Ops
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

We do have support for UEFI in the provision cookbook and in reimage (after https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1077497 is merged), but there are a couple of things that we are still working on, see the subtasks of T373519. There is nothing incredibly blocking but we are going outside the perimeter of what is battle tested in production, please be advised that there may be further issues to debug and the ms-be hosts would be the first production hosts to use UEFI. If everybody is onboard with this, we can go ahead :)

Mon, Nov 4, 2:45 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T216826: Move Kartotherian to Kubernetes.

I created T378944 to kick off the discussion about how to best move Kartotherian to k8s when ready.

Mon, Nov 4, 11:39 AM · Content-Transform-Team, WMDE-TechWish-Sprint-2022-11-29, serviceops, WMDE-TechWish-Sprint-2022-11-09, Platform Engineering, WMDE-TechWish-Sprint-2022-10-26, WMDE-TechWish-Maintenance, WMDE-GeoInfo-FocusArea, Epic, Maps (Kartotherian)
elukey created T378944: Strategy to slowly move Kartotherian's traffic from bare metal to k8s.
Mon, Nov 4, 11:28 AM · serviceops-radar, Maps (Kartotherian)
elukey added a comment to T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

Fixed 1044. For some reason IPv6 support was disabled, so our settings like IPv6AutoConfigEnabled: False led to a HTTP 400. I connected to the Web UI, turned IPv6 on and re-run provision, all good. I've also set the host's status to Active in Netbox.

Mon, Nov 4, 9:00 AM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops

Fri, Nov 1

elukey added a comment to T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

@VRiley-WMF @Jclark-ctr I fixed the issue with ganeti1044 and ran provision, all good! The rest of the nodes should be fine as well :)

Fri, Nov 1, 1:36 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
elukey closed T376121: Upload redfish licenses to supermicro hosts as Resolved.

Supermicro sent a new license for 1044 that worked, and I've ran successfully the provision cookbook.

Fri, Nov 1, 1:35 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey closed T376121: Upload redfish licenses to supermicro hosts, a subtask of T365372: Spicerack: expand Supermicro support in the Redfish module, as Resolved.
Fri, Nov 1, 1:35 PM · Patch-For-Review, DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Fri, Nov 1, 1:34 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack

Thu, Oct 31

elukey added a comment to T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

Found the issue with 1044 (see T376121#10280005), I'll post an update as soon as Supermicro replies with the correct license.

Thu, Oct 31, 9:39 AM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
elukey added a comment to T376121: Upload redfish licenses to supermicro hosts.

The mac address that Supermicro provided to us on the server's label is not correct, the last digit that we have is 8 meanwhile the MAC address returned by Redfish lists B, and the mac address is used to generate the license (I am 99% sure of it). Followed up with Supermicro via email.

Thu, Oct 31, 9:38 AM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

It seems that https://puppetboard.wikimedia.org/node/ms-be2083.codfw.wmnet shows 4 entries for accounts under the swift_disks fact list, meanwhile configure_disks.pp explicitly require two:

Thu, Oct 31, 8:35 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Wed, Oct 30

elukey added a comment to T376121: Upload redfish licenses to supermicro hosts.

Run provision on ganeti1045+, and fixed the ADMIN password as well.

Wed, Oct 30, 4:42 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Wed, Oct 30, 4:38 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey added a comment to T376121: Upload redfish licenses to supermicro hosts.

It seems that ganeti1044+ hosts were already provisioned, and I didn't notice an error when uploading the license to 1044:

Wed, Oct 30, 3:49 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Wed, Oct 30, 3:48 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey reopened T376121: Upload redfish licenses to supermicro hosts, a subtask of T365372: Spicerack: expand Supermicro support in the Redfish module, as Open.
Wed, Oct 30, 3:41 PM · Patch-For-Review, DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey reopened T376121: Upload redfish licenses to supermicro hosts as "Open".
Wed, Oct 30, 3:41 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey closed T365167: Q4:rack/setup/install sretest2001 as Resolved.

Declaring this as closed since we have tested everything that we needed :)

Wed, Oct 30, 2:38 PM · SRE, Infrastructure-Foundations, ops-codfw, DC-Ops
elukey closed T376121: Upload redfish licenses to supermicro hosts as Resolved.

Finally all the hosts without the license, that were manually configured, should be ok. The only remaining thing left is to make a quick pass and set the ADMIN password equal to the root one, since I've run provision with --no-users and --no-dhcp to avoid a change of status in Netbox (Active hosts cannot get provisioned without -no-dhcp and --no-users for safety).

Wed, Oct 30, 12:00 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey closed T376121: Upload redfish licenses to supermicro hosts, a subtask of T365372: Spicerack: expand Supermicro support in the Redfish module, as Resolved.
Wed, Oct 30, 12:00 PM · Patch-For-Review, DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Wed, Oct 30, 11:53 AM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Wed, Oct 30, 11:34 AM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Wed, Oct 30, 11:12 AM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey added a comment to T371416: Q1:rack/setup/install backup1012.

Also, @elukey - the RAID controller kit that Supermicro is currently suggesting for us is BTR-CV3908-FT1 - 8GB cache module, which is non-volatile so no battery needed. Let me if you have any initial thoughts or opinion on this one.

Wed, Oct 30, 11:09 AM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
elukey added projects to T378584: Evaluate hw-raid controllers for Supermicro's Config J: Data-Persistence, Infrastructure-Foundations.
Wed, Oct 30, 11:07 AM · SRE-swift-storage, Infrastructure-Foundations, Data-Persistence, DC-Ops
elukey created T378584: Evaluate hw-raid controllers for Supermicro's Config J.
Wed, Oct 30, 11:06 AM · SRE-swift-storage, Infrastructure-Foundations, Data-Persistence, DC-Ops
elukey added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

My opinion it totally not relevant but I've read the proposal and it seems great, thanks Ben! +1

Wed, Oct 30, 8:24 AM · Data-Platform-SRE, Epic, Data Products, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
elukey added a comment to T365650: Q4:rack/setup/install ganeti1039 to ganeti1052.

Hey folks! I've uploaded all the Redfish licenses to these ganeti nodes, and ran provision again up to ganeti1043. I tried 1044 but it seems that it still needs to run the first provision run, and I don't have the custom BMC password, so I cannot proceed. Feel free to go ahead and report issues if any!

Wed, Oct 30, 7:55 AM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops

Tue, Oct 29

elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Tue, Oct 29, 5:01 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Tue, Oct 29, 4:23 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey added a comment to T378345: Migrate the AUX K8s cluster to containerd.

aux-k8s-worker1003 migrated to containerd on Bookworm, all good so far. I tested nerdctl and it seemed working, we'll keep the cluster monitored for a few days and then I'll migrate all the other VMs.

Tue, Oct 29, 4:16 PM · Infrastructure-Foundations, Prod-Kubernetes, Kubernetes
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Tue, Oct 29, 4:00 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Tue, Oct 29, 2:25 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey added a comment to T376121: Upload redfish licenses to supermicro hosts.

All the licenses are applied, the last steps are to run the provision cookbook on all nodes.

Tue, Oct 29, 11:50 AM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Tue, Oct 29, 11:40 AM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack

Mon, Oct 28

elukey added a comment to T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps.

@Ottomata thanks for the replies :)

Mon, Oct 28, 2:26 PM · SRE-Unowned, SRE, Infrastructure-Foundations
elukey triaged T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool as Medium priority.
Mon, Oct 28, 2:23 PM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey closed T377967: Better highlight user-facing IRC changes following migration to ircstream as Resolved.

@Titore I am closing this since as far a I get there is no action item left to do, please re-open if you feel we missed anything! Thanks! :)

Mon, Oct 28, 2:18 PM · Infrastructure-Foundations
elukey triaged T378345: Migrate the AUX K8s cluster to containerd as Medium priority.
Mon, Oct 28, 2:17 PM · Infrastructure-Foundations, Prod-Kubernetes, Kubernetes
elukey created T378345: Migrate the AUX K8s cluster to containerd.
Mon, Oct 28, 11:27 AM · Infrastructure-Foundations, Prod-Kubernetes, Kubernetes

Oct 25 2024

elukey added a comment to T377853: RAID monitoring on new hardware spec requires new or updated user space cli tool.

Summary of an IRC chat between me, Jayme and Matthew:

Oct 25 2024, 9:45 AM · DC-Ops, SRE-tools, observability, Puppet, Infrastructure-Foundations
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

Tried to upgrade the SAS3809's firmware on ms-be2083 to see if the JBOD disks would be picked up, but no luck (followed https://www.supermicro.com/support/faqs/faq.cfm?faq=35522).

Oct 25 2024, 9:40 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Oct 24 2024

elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

There is something strange (at least for me) in the BMC's storage view:

Oct 24 2024, 8:51 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops

Oct 23 2024

elukey added a comment to T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps.

reconnecting using the last event id's timestamp may be lossy

I think it shouldn't be lossy, but it will result in re-consuming messages that have already been consumed.

Everything else you wrote about EventStreams is correct. :)

Oct 23 2024, 5:51 PM · SRE-Unowned, SRE, Infrastructure-Foundations
elukey added a comment to T377967: Better highlight user-facing IRC changes following migration to ircstream.

@Titore do you feel that the above Wikitech change is enough? The link is visible on the MOTD, so it should be easy for a user to get the privacy info. Lemme know :)

Oct 23 2024, 3:32 PM · Infrastructure-Foundations
elukey added a comment to T377967: Better highlight user-facing IRC changes following migration to ircstream.

Added https://wikitech.wikimedia.org/w/index.php?title=Irc.wikimedia.org&diff=2238422&oldid=1986975

Oct 23 2024, 3:31 PM · Infrastructure-Foundations
elukey added a comment to T377967: Better highlight user-facing IRC changes following migration to ircstream.

As I understand, IRC users are no longer visible to each other on the new ircstream service, so I suppose a cloak may not be necessary. Still, a user joining irc.wikimedia.org may think their IP is exposed. Is this intended, or should a cloak be enabled on irc1003.wikimedia.org as well?

Oct 23 2024, 3:23 PM · Infrastructure-Foundations
elukey added a comment to T327396: Migrate Kartotherian to node-mapnik v4.2.1 and unfork.

For the blubber part, it should be as easy as:

Oct 23 2024, 10:34 AM · WMDE-TechWish-Sprint-2024-10-16, Patch-For-Review, Essential-Work, Content-Transform-Team-WIP, WMDE-GeoInfo-FocusArea, Maps (Kartotherian)
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

I had a quick look at the web UI of the BMC, and saw the following under storage:

ms-be2082-disks.png (455×394 px, 34 KB)

...which looks to my inexpert eye like none of the drives are configured to be JBOD, which might explain why the OS can't see them?

Oct 23 2024, 8:59 AM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey added a comment to T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps.

if/when we'll decide to move to Eventstreams

Are you sure you want to move to EventStreams? Would consuming directly from Kafka not be better?

Oct 23 2024, 8:31 AM · SRE-Unowned, SRE, Infrastructure-Foundations

Oct 22 2024

elukey updated the task description for T376121: Upload redfish licenses to supermicro hosts.
Oct 22 2024, 1:56 PM · DC-Ops, Infrastructure-Foundations, SRE-tools, User-Elukey, Spicerack
elukey added a comment to T371400: Q1:rack/setup/install ms-be208[1-8].

@elukey thank you for putting this together . There is one thing i am truing to understand here
" Wait for the update and BMC reset
At this point, for some reason, the BMC's ADMIN user is configured with the custom label unique password, not calvin, same thing for the Redfish stack (the provision cookbook still fails if we run it at this point)."
At the above what i understand is that the BMC firmware update have some way of reading the label unique password and setting the BMC password to it, that;s is why running the provision cookbook fails at this point because not using the calvin password

"A BMC Factory reset correctly sets the expected state (calvin password for ADMIN, correct Redfish stack)."
Now at the above a second reset, set back the BMC password to calvin making now the povision cookbook to work.

Is my understanding right?

Oct 22 2024, 1:37 PM · SRE, SRE-swift-storage, Data-Persistence, ops-codfw, DC-Ops
elukey closed T234234: Port architecture of irc-recentchanges to Kafka as Resolved.

To keep archives happy - in T376014 we moved irc.wikimedia.org's backed to https://github.com/paravoid/ircstream, a more modern stack that still uses UDP. It also supports EventStreams, and the next step will be to test/switch to the new event source and drop the UDP support. This task has several birthdays and it is not up-to-date, any further enhancement will be tracked in T376014.

Oct 22 2024, 10:42 AM · Data-Engineering, MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Event-Platform, User-Elukey, Analytics
elukey closed T234234: Port architecture of irc-recentchanges to Kafka, a subtask of T128592: Add redundancy to IRC recent changes service, as Resolved.
Oct 22 2024, 10:39 AM · Sustainability, SRE, codfw-rollout
elukey added a comment to T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps.

Future steps: keep improving EventStreams support in ircstreams, and hopefully move Bots away from irc as much as possible (in favor to directly contact the ES service directly).

Modulo fixing the root causes of the rollaback; are {T240182: Create EventStream's equivalent to irc.wikimedia.org's #central channel} and {T234234: Port architecture of irc-recentchanges to Kafka} still relevant, or do we need some grooming / reboot of those tasks?

Oct 22 2024, 10:37 AM · SRE-Unowned, SRE, Infrastructure-Foundations
elukey closed T240182: Create EventStream's equivalent to irc.wikimedia.org's #central channel, a subtask of T234234: Port architecture of irc-recentchanges to Kafka, as Declined.
Oct 22 2024, 10:32 AM · Data-Engineering, MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Event-Platform, User-Elukey, Analytics
elukey closed T240182: Create EventStream's equivalent to irc.wikimedia.org's #central channel as Declined.
Oct 22 2024, 10:32 AM · Data-Engineering, Event-Platform, User-Elukey
elukey closed T242712: Deprecation (if possible) of the #central channel on irc.wikimedia.org, a subtask of T240182: Create EventStream's equivalent to irc.wikimedia.org's #central channel, as Declined.
Oct 22 2024, 10:32 AM · Data-Engineering, Event-Platform, User-Elukey