CDNWiki - KnowledgeBase / LimitNCQ

Symptom: During periods of high load, the PMP will freeze until all requests time out, causing extremely poor performance.
Hypothesis: The Sil3826 port multiplier gets confused when either the disk or the driver does not adhere to its interpretation of the command spec and has poor capability for 31 outstanding commands across four or five devices, so raises an error condition. The Linux driver knows enough to restart all 31 outstanding NCQ tags on the faulted device without waiting for a timeout. However, it assumes all the other devices hanging off the PMP are isolated, and does not reset and restart the other disks after a fault, causing a forced timeout wait until each of the devices are reset.
Solution: Force queue_depth to 1 for all devices connected through a SATA PMP; this reduces the likelihood of the issue occurring, and greatly limits the duration of the freeze when it does occur.

/etc/udev/rules.d/10-limit-ncq-on-pmp.rules

SUBSYSTEM=="block", DEVTYPE=="disk", ACTION=="add", RUN+="/etc/udev/limit_ncq_on_pmp.sh"

/etc/udev/limt_ncq_on_pmp.sh

#!/bin/sh

[ "${SUBSYSTEM}" = "block" -a "${DEVTYPE}" = "disk" -a "${ACTION}" = "add" ] || exit

# Don't bother if queue_depth is already at its minimum
ncq="$(realpath "/sys${DEVPATH}/device/queue_depth" 2>/dev/null)"
[ -f "${ncq}" ] && [ "$(cat "${ncq}")" -gt 1 ] || exit

# Detect if the device is on a branching link (port multiplier)
link="$(realpath "$(realpath "/sys${DEVPATH}")/../../../../..")"
ls -1 "${link}" | grep -q '^link.*\..*$' || exit

# Force queue_depth to 1
echo 1 > "${ncq}"
logger -t 'limit_ncq_on_pmp' "${DEVNAME} probably behind PMP; queue_depth forced to 1"