User Details
- User Since
- Jan 23 2023, 12:05 PM (96 w, 3 d)
- Availability
- Available
- LDAP User
- EoghanGaffney
- MediaWiki User
- EGaffney-WMF [ Global Accounts ]
Yesterday
Tue, Nov 26
The pipeline seems to be working correctly, and we've got documentation in place. I think we can close this out as completed!
Mon, Nov 25
We've been doing some investigating over the last week, and it's a very hard problem to track down. No useful information is seen in the logs, and there's no exceptions or anything that would indicate what's been happening.
Fri, Nov 22
I see what's going on. There are three checks provided by VALIDITY configured by default. RCVD_IN_VALIDITY_CERTIFIED, RCVD_IN_VALIDITY_SAFE, and RCVD_IN_VALIDITY_RPBL. We disabled the first one but not the other two, which is why we saw a small but measurable decrease in false clean messages, and why we're still seeing the other VALIDITY checks in spam messages. These can be seen configured here. I also checked that they were responding to all queries with the same exceeded error message.
Thu, Nov 21
I've put in a change to disable this specific check, we're also going to look at whether we can sign up for an account with them to get a higher usage limit.
Tue, Nov 19
Mon, Nov 18
@jhathaway It was a rule set up to change the envelope-to of a mail from a given source. When we disabled the rule, gmail started returning 550s for any address unknown in the wm.o domain, but when the rule was re-enabled, it was back to 250/ok for anything unknown. When we set the "Account types to affect" not to include the catch-all, it started returning 550s again. It's not clear whether leaving the catch-all unchecked is desirable behaviour on the ITS side, waiting to hear back on that. I'm sure there'll be a way to work around it if we have to.
We had a quick chat with ITS today where they disabled the change that caused the routing to change, and it did cause gmail to start returning 550 for unknown addresses again, so we have confirmed their change was what caused this to start behaving differently.
Fri, Nov 15
We've made a change to the aliases routing script which we believe has fixed the problem. I've verified that mail is delivering to vrts now, and we've seen two of our test tickets arrive.
So the issue is coming from the vrts_aliases.py cron job. Something has changed in how gmail is responding to emails here and claiming they're valid. I'm going to try see if we can change that to ignore the gmail check
Tue, Nov 5
Oct 24 2024
We've been talking about this back and forth with znuny, they have a few suggestions, tracking here to see what we can try:
Oct 21 2024
Silenced these alerts for 48 hours.
Oct 4 2024
The file is cleaned up, closing.
Oct 3 2024
Oct 1 2024
I commented out the extra lines and rebooted, the host came back up and mail is flowing as expected.
Sep 26 2024
208.80.154.21/32 and 2620:0:861:1:208:80:154:21/128 are both from the linked puppet change, but the third address, 2620:0:861:1:208:80:154:81/64, isn't. This was generated somewhere else.
Sep 24 2024
Sep 17 2024
This is no longer an issue
Sep 9 2024
This will mask the service from starting inadvertently, and should stop these alarms.
Sep 6 2024
This was an after-effect of rebooting the host.
This was due to the other list host rebooting.
Sep 5 2024
This was me when restarting the host related to T373980
Yeah, that's right -- we moved from ferm to nftables, but then reverted because of T373637. I'll take a look at the cleanup later this afternoon.
@fgiunchedi pointed out that there was still some space free in the VG, so the volume could be expanded instead. There's approximately 90G free on the disk unallocated, and since the mailman2 data will never grow (the newest file in that directory is 2001), this should give us sufficient headroom for logrotate to take care of the rest.
Sep 3 2024
Aug 23 2024
I've changed the backup script to tolerate a failure in the prometheus-pushgateway, so this can be closed.
Aug 22 2024
This seems to have been a blip that hasn't reoccurred.
Aug 16 2024
I've added the user to the wmf group. @dchan, I'm going to close this now, let me know if anything seems missing!
Hi @odimitrijevic, could you please look at this as an approver for the analytics-privatedata-users group? Thanks!
Aug 15 2024
Confirmed working!
Aug 13 2024
Jul 25 2024
Jul 9 2024
lists1001 has been decommissioned and all current hosts are running bookworm.
This was mostly a brain-dump just before I left on PTO, so we're going to close this in favour of some of the other more detailed tasks, namely T278495: Figure out plan for mailman IP situation and T286066: Put lists.wikimedia.org web interface behind LVS
Jul 4 2024
lists1003 doesn't exist anymore, so this can probably be closed.
Jul 2 2024
Spoken with @Ladsgroup , I think there's nothing immediate for sre-collab to do here so reassigning. Feel free to send it back to me if that changes!
I think we can close this, since the puppet module now installs mailman3 on lists2001 (albeit disabled), unless I'm missing something
lists1001 has been powered off, it will stay off for 1 week and then I'll decommission it fully on Tuesday, 9th July, after this we can close this ticket.
This was fixed by the patch merged on June 20th.
Puppet has been re-enabled and run successfully, sorry for the noise
Jun 21 2024
The migration to the new host is done. The last remaining item before we can close this ticket is to decommission the old host. We're going to keep that around for two weeks after the migration, which will be Tuesday 2nd July. The host will be shut down on that date, and decommissioned on the Tuesday after.
Jun 20 2024
The patch for this was merged and should no longer be an issue.
Yep!
Jun 19 2024
The maintenance was completed yesterday and so far the service seems stable. I'm going to close this now, and we can re-open if we come across any issues.
Jun 18 2024
That's right -- we'll be doing that as part of the maintenance work later today. We kept them firewalled off so that the non-active host isn't writing to the database at the same time as the active. In the future it might make more sense to allow all hosts access but have a read/write user for the active host, and read only for the non-active.
Jun 17 2024
It's possible that the grants are already covered by the proxies listed here, but it would be good to check before we start our migration
Jun 14 2024
I've created a sub-task for the migration itself so users and community members can follow the migration itself more easily, rather than trawling through comments and patch notifications. It's been tagged with User-notice so it ends up on tech news. The downtime will be on Tuesday 18th from 10-12 UTC.
Jun 10 2024
This was due to the apt package starting the service (and failing) despite the puppet recipe being set to ensure => stopped
Jun 6 2024
The rough outline for migration is:
May 31 2024
May 30 2024
I've also run the upgrade on gitlab1004, now only leaving the primary (gitlab2002) left to be upgraded.
May 28 2024
I ran the test upgrade (sudo gitlab-ctl pg-upgrade) this afternoon on gitlab1003, and it succeeded. The total time required was 1m51s, and generated no error messages.