Debian Bitnami WordPress VM Disk IO Spike Most Days at Same Time and Crippling the Site
-
I know the Linux expertise in ML is strong and could use some guidance on a Debian WordPress VM issue. Here's some context on the VM that is having performance issues (almost daily now) during the same time window:
-
Bitnami WordPress instance with NGINX deployed into an e2-micro instance in Google Compute Engine. It uses MariaDB specifically as its database. It's Debian 5.10.205-2. The VM was deployed in the last month or two to migrate to a more current version of the OS from the previous version of the site. We were experiencing all of the problems described below using the former instance of the site before it was migrated. The only migration work that was done was a backup and restore of the images and the database to a new VM with the latest Bitnami WordPress image, and the problem has persisted even after that.
-
Most every morning (with few exceptions) around 5 AM CST there is a spike in disk IO which causes CPU IO wait to spike and sends swap through the roof for 1 - 2 hours minimum. Queue length also goes way up too. The site's behavior when you visit during this time is to eventually throw an NGINX 504 gateway timeout error. The only way to resolve the problem that we have found is to login to the VM via SSH and either reboot it or run the Bitnami service restart command (sudo /opt/bitnami/ctlscript.sh restart). Either of those will return the VM to a working, functional, responsive state until the same time window hits the next day. There seems to be no real difference in the reboot vs. the services restart in terms of keeping the problem at bay other than the reboot might prevent it for the day after you rebooted (but not always). All Bitnami WordPress services seem to be running during the problem window (nothing seems to be failing).
-
If someone were to run iotop during the problem window, they would see the mariadb service as the top culprit as shown below (which seems to indicate something is hitting the database really hard during this time window). Outside of the problem window you don't see IO spikes. The mariabd service may jump up and use 30% IO for a second and then disappear from atop the list. During the problem window you can guarantee mariadb and the php-fpm services (and several instances of them) will be at the top.
-
I've also noticed during the problem window systemd-journal-flush.service shows loaded and failed when you run systemctl. That seems to make sense during a period of high IO and a lot of swap with high CPU IO wait, but I would love insight from others.
-
This site is a website for a podcast and hosts the main feed for the show. The WordPress database itself is tiny (like 10 MB) but has several hundred posts in it. The only other data really stored in WordPress would be small PNG files that get used for featured images. We have at least 3 GB free on the VM's disk at this point.
-
From a plugin standpoint in WordPress we have everything disabled except Akismet, Jetpack, Blubrry PowerPresss, and Updraft Plus. We thought it might be Updraft, but every backup completes in 10 - 11 seconds with no errors. The time for backups is like 6:30 PM CST in the evening, We confirmed that by looking at time stamps of files inside the VM's OS and even tried de-activating Updraft to see if the problem went away (it did not).
-
Outside of the problem window the VM has plenty of free memory when you run free -m and is using little or no swap. The site works great outside of the specific time window.
-
We looked at cron jobs, and nothing seems out of the ordinary. It feels like there is some kind of scheduled task for the database specifically that is causing the problem, but I do not know how to pinpoint it or what queries are being run against the database. I tried installing sar to get some details but apparently have too much rust on my Linux chops (which were minimal) since the days of building and administering Elastix PBXs. There do not seem to be any scheduled scripts, etc. from looking at wp-config.php either.
Has anyone here seen an issue like this? If so, does it make any sense why the database would be hit with so much IO during a specific time period like this? By the way, this site isn't getting a crazy amount of traffic either. I looked at Jetpack stats, and it gets anywhere from 5 to 20 or 30 visits in a day. Any guidance is greatly appreciated.
I'll also add that looking at dashboards in Google Compute Engine confirms the time window of issue. The database process seems to show up as top usage of CPU and memory during the problem window.
-
-
Obviously look for stored procedures in the database.
SHOW PROCEDURE STATUS;
You also might want to check for a php-cron somewhere in WordPress itself. It's been long enough since I've touched WordPress that I forget where to look for that.
-
1. Identify the Cause of High Disk IO and CPU Wait
- MariaDB Activity: Since
mariadb
is showing high IO during the problematic window, it's crucial to identify the queries causing this load. You can enable the slow query log in MariaDB to capture queries that are taking an unusually long time to execute. - Scheduled Tasks: Check for any scheduled tasks (cron jobs) on the server that run around 5 AM CST. These could be system tasks, WordPress cron jobs, or database maintenance tasks.
2. Systemd-journald Failure
- The failure of
systemd-journal-flush.service
suggests that the journaling system is overwhelmed, likely due to the high IO load. Investigate the journal logs (journalctl
) for any errors or warnings that occur around this time.
3. Review WordPress Plugins and Activities
- Plugin Behavior: Even though plugins like Updraft Plus are scheduled for different times, they might be triggering background tasks. Verify plugin behavior and logs.
- WordPress Cron: WordPress has its own cron system (
wp-cron.php
) that can sometimes trigger resource-intensive tasks. Review the WordPress cron events.
4. Server and Database Optimization
- Database Optimization: Run a check and optimization task on your MariaDB database. Over time, databases can become inefficient and slow.
- Upgrade Resources: An e2-micro instance is quite limited in resources. If this issue is related to resource constraints, consider upgrading the VM instance type.
5. Monitoring and Logs
- Enable Enhanced Monitoring: Tools like
sar
,iotop
, oratop
can provide in-depth system metrics. Make sure they are configured correctly. - Access and Error Logs: Review NGINX, PHP-FPM, and MariaDB logs for any anomalies during the problematic time frame.
6. External Factors
- Traffic Spikes: Although Jetpack stats show low traffic, consider checking the access logs for unexpected traffic spikes, which might be bots or crawlers.
- Network Analysis: Use tools to monitor network activity. Unexpected external connections might be contributing to the load.
7. Testing and Isolation
- Isolate Components: Temporarily disable certain components or plugins during the problem window to see if the issue persists.
- Test in a Staging Environment: If possible, replicate the setup in a staging environment to test without affecting the live site.
- MariaDB Activity: Since