The Year of Endless Technical Problems

  • Iniciador del tema Iniciador del tema Null
  • Fecha de inicio Fecha de inicio
  • 🔧 Site instability resolved. You can report double-posts and broken attachments. For bigger issues, use the Technical Grievances thread.
    🇵🇦 Nuestro primer dominio localizado está en español en kiwifarms.pa. Our first localized domain is on Spanish on kiwifarms.pa.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Estado
No está abierto para más respuestas.
1757522727103.webp

Just had this happen by merely clicking "What's new" by the way.
 
this has absolutely and completely and totally slaughtered the system. iowait is 25%+ with no cpu usage.
Is
Código:
$ zcat /proc/config.gz | grep CONFIG_HZ_1000
CONFIG_HZ_1000=y
on your machine?
Sorry for not telling you about that. You need a fast tick rate. Back to the drawing board, I'll try to test with a load that creates higher IOWAIT on a system I have access to.

Do you have a particular NUMA setup? What program gets pinned to what node, that sort of thing. That might be useful to know for testing.

Ah yeah, the scheduler behaves differently when IOWAIT is high but CPU is low.
 
Última edición:
Do you have compression enabled on any filesystems or zfs datasets? If so turn it off right now. I had a problem where my system would stall for a minute after using a Windows VM. Turns out it was caused by f2fs's kernel threads compressing all the writes that were done to the VM image.
A way to check for similar issues is to show kernel threads in htop (shift-k) and look for ones with high priority and CPU usage. Also keep a window open with dmesg -w -H and watch for anything interesting to show up.
 
I didn't make much progress. I don't want to bother you more. If you get a CONFIG_HZ_1000=y kernel and details on your NUMA setup (if you have one) I can try to help again.
 
I had some thoughts more thoughts about this since yesterday. The correct way is still add monitoring until the problem becomes apparent but seeing as we're doin' the cowboy thing I have a few things to try that so far haven't been suggested (in this thread at least).

Have you tried turning pcie power management off? Just add "pcie_aspm=off" to the grub linux command, update grub, and reboot. I've seen a few times where buggy power management can tank performance or imitate a flaky pcie device or connection. And since you have nvme drives...

I assume the server has ECC memory but do you have rasdaemon setup so you actually will see ECC (and other machine check) errors? ECC errors will tank performance but can be sporadic based on what (or nothing) is using that memory or even memory temperature and it won't necessarily crash if the ECC can recover. Since the site hasn't been down for several days recently you probably haven't run memtest but at this point it might be worth it. You MUST use the free version of memtest86+ from the company website. The one bundled with most linux distros WILL NOT REPORT CORRECTED ECC ERRORS.

I know you said you use debian but we are on a new-ish server. What kernel version are we on currently? If it's older a yolo upgrade to the newest LTS might just werk (YeeHaw!)

I presume you've checked dmesg for anything suspicious. But giving us a copy of dmesg to look at might yield some clues.

Edit: I feel like this must've been checked but during the slowness there's no packet loss right?
 
Última edición:
It's so weird, I've had this burning all consuming desire to fix the site all week, I sat down and did 6 hours of work on it today, and almost as soon as I got it working great, Charlie got shot.
 
dog-accepting-fate.gif
how it feels to finally be able to use 3000+ page threads again without the site shitting itself and doing nothing

thanks null
 
any updates on this @Null, did you manage to fix it?
did anything here help?
 
In last week’s MATI, Josh said AI suggested an issue with having lots of requests allocating and releasing lots of memory each and reducing that happened to solve the issue. Not because of not enough memory but because you can’t do infinity of these memory operations at once and apparently we hit the limit because Josh was feeling RAM rich and upped the spending limits like a nigger getting his first credit card.

At this rate he might just abandon us and just post to his AI so be can get the answers he wants, and just have AI Josh niggerpost in a random thread every other day.
 
It still feels kinda slow, like half the time the reaction image icons are not even loading for more, and some images
I don't disagree, but it's been reliably slow. No more random 504 errors, very few "clicked a link and it took 15 seconds to load" issues. That's a major step in the right direction.
 
Estado
No está abierto para más respuestas.
Atrás
Top Abajo