User Details
- User Since
- Jul 18 2022, 2:39 PM (123 w, 3 d)
- Availability
- Available
- IRC Nick
- dhinus
- LDAP User
- FNegri
- MediaWiki User
- FNegri-WMF [ Global Accounts ]
Today
kinda weird behavior if you ask me
@dcaro thanks for that analysis! I had a look at the source code for Resolv::DNS and apparently ResolvTimeout is only triggered if you enable the optional :raise_timeout_errors, see also https://bugs.ruby-lang.org/issues/18151
Yesterday
+1 for reopening T374830, and maybe linking T380882: openstack network problems (November 2024) as a parent task.
Right now, the ISA tool is down. For hours, only the main page load.
An additional webservice restart seems to have fixed it.
I manually edited the file www/python/src/isa/config.yaml, replacing tools-db-1 with tools.db.svc.wikimedia.cloud and restarted the worker with toolforge jobs restart celery-worker.
Cc-ing the ISA maintainers @Sebastian_Berlin-WMSE and @NavinoEvans
This server is currently undergoing maintenance (T380673) but it was not downtimed.
Tue, Nov 26
The logs have already been rotated by journald so I cannot find the message that triggered this alert, but I think they were I/O errors caused by the disk being full, as discussed in IRC:
The first is expected, the second seems harmless too:
The error that triggered this alert is:
Joanna is out sick, but I discussed this with her and we have a team-wide meeting next month to discuss if we want to use wmcs-roots or if we should create more granular permission levels.
Mon, Nov 25
Assigning this task to @Andrew as he's currently working on a patch.
The upgrade is complete. tools-db-4 is the new primary and tools-db-5 is the new replica. I shut off the old hosts tools-db-1 and tools-db-3.
Possibly the (still unclear) root cause for this is the same as T379927: Puppet removed "nameserver" line from /etc/resolv.conf
These can all be dropped, fwiw :-)
Re-opening as I also want to import them to tofu-infra.
Successfully migrated with the following commands:
This has just caused a WMCS proxy outage, because the nameserver was removed from both /etc/resolv.conf and the Nginx config files in /etc/nginx/sites-available/.
Fri, Nov 22
@aborrero I spent some time today on this refactor. The MR is now producing a working "tofu plan", but it's still a work in progress.
Thu, Nov 21
We should also make sure that these records are no longer managed by the wmcs-wikireplica-dns script: T374953: tofu-infra: replace wmcs-wikireplica-dns.py with tofu
I have added many more details to the procedure in Wikitech, and I will test it next Monday as part of T352206: [toolsdb] Upgrade to MariaDB 10.6.
@aborrero I will try to import the records to tofu-infra in the process!
Changing this specific subtask to "High" and assigning to myself as I want to do it as part of T352206: [toolsdb] Upgrade to MariaDB 10.6, which requires me to update the DNS records in this zone.
Wed, Nov 20
I create an out-of-band project X using e.g. the cli
on next tofu run project X is 'cleaned up'
simply undefined those projects which resulted in them being deleted
The reservations table looks consistent:
The "reserved" column is inconsistent again, but this time it's ram and volumes that were not reset, instead of instances and volumes:
Puppet was failing on this host because of T379927: Puppet removed "nameserver" line from /etc/resolv.conf. I manually added the missing nameserver line to resolv.conf and Puppet is now working again.
having it delete things that are undefined in opentofu is generally bad.