The single biggest clue to the source of a performance problem will be from something referred to as the wait stats.
What are waitstats?
In short wait stats are statistics gathered on what resources SQL Server is waiting on internally whilst executing queries.
Why are they useful?
They can often provide very quick and objective evidence of a performance bottleneck by showing cumulative wait times that SQL Server spent on getting access to the CPU, memory or disk access as well as numerous other internal resources.
This post is only going to concentrate on IO related waits, as it continues on from my SQL Server Disk IO posts.
Gathering this information on at least a daily basis can provide invaluable data on a system’s performance over time and can reveal sudden changes in behaviour caused by e.g. a recent change to an application. Speaking from personal experience, using historically captured wait stats and contrasting them with sudden changes to the topmost wait types is an invaluable way of proving that recent application changes were the trigger for a change in behaviour.
How do I gather these wait stats?
That’s fairly straightforward – it’s a T-SQL statement. First, the SQL Server ‘version independent’ method:
DBCC SQLPERF (WAITSTATS)
The sample output from my local SQL Server 2005 Express installation is below:
Wait Type Requests Wait Time Signal Wait Time
MISCELLANEOUS 0 0 0 PAGEIOLATCH_NL 0 0 0 PAGEIOLATCH_KP 0 0 0 PAGEIOLATCH_SH 178 327 0 PAGEIOLATCH_UP 16 436 0 PAGEIOLATCH_EX 14 0 0 PAGEIOLATCH_DT 0 0 0 IO_COMPLETION 173 1279 0 ASYNC_IO_COMPLETION 1 1294 0 CHKPT 1 202 0 BACKUPIO 0 0 0 DISKIO_SUSPEND 0 0 0 IMPPROV_IOWAIT 0 0 0 WRITELOG 29 639 15
From SQL Server 2005, the wait stats data has been split into two separate dynamic management views (DMVs) sys.dm_latch_wait_stats and sys.dm_os_wait_stats. The latch wait stats are counters for internal waits within the SQL Server database engine and the OS wait stats accumulate data on waits for external resources such as CPU, disk and memory.
As we’re concentrating on IO waits, its the sys.dm_os_wait_stats we’re interested in.
The following query will grab the IO wait types output:
select * from sys.dm_os_wait_stats where wait_type like ‘PAGEIO%’ or wait_type like ‘%IO_COMPLETION’ or wait_type like ‘DISKIO%’ or wait_type like ‘BACKUPIO%‘or wait_type like ‘WRITE%’ order by wait_time_ms
wait_type waiting_tasks_count wait_time_ms max_wait_time_ms signal_wait_time_ms
PAGEIOLATCH_SH 20892941 110133998 17895 332331 PAGEIOLATCH_EX 392948 7270457 2395 10353 IO_COMPLETION 719058 1061552 916 6929 PAGEIOLATCH_UP 4332 81307 3635 1243 BACKUPIO 3998 16393 240 18 ASYNC_IO_COMPLETION 133 14771 805 0 PAGEIOLATCH_DT 0 0 0 0 PAGEIOLATCH_NL 0 0 0 0 PAGEIOLATCH_KP 0 0 0 0 DISKIO_SUSPEND 0
Incidentally, these queries are only meant to show how to grab the data, if you hunt around on the web you’ll find some elegant queries for processing this data.
This KB from MS discusses these wait types (and more).
SQL Server 2008 R2 has even more granular wait-types to isolate IO related waits, but the above query has been limited to allow it to work from SQL Server 2005.
wait_time_ms is the total accumulated wait time (since that SQL Server instance was last restarted) spent on accessing that resource.
max_wait_time_ms is the peak wait time for one of these requests.
signal_wait_time_ms signal wait time is how long was spent waiting to get access to the runnable queue for that resource, and can indicate CPU pressure if this is high.
If these IO related wait types are high in the list of overall waits (see the Waits and Queues article from Microsoft lower down), then you have an IO issue.
This is either down to the performance of the IO subsytem, or the physical design of the database. The design issues could be due to the file/filegroup layout of the database concerned, but also related to the layout of tempdb (in a previous role it never ceased to amaze me how much thought went into the layout of an application database at various customer sites, whilst tempdb was left e.g. at it’s 8MB starting size with (just one) data file and log file sharing the same drives and/or sharing application database drives).
Take a close look at the particular wait type concerned. If the WRITELOG wait type is dominant, you know it’s the log drive if this is backed up by sys.dm_io_virtual_file_stats. If it’s BACKUPIO, we can infer that it’s the backup drive(s).
A word of caution, don’t talk the numbers at face value. For example, if the PAGEIOLATCH_SH wait type is high in the list, we know it’s the data drive(s), but that does not necessarily mean there is an issue with the disk subsystem, merely that a lot of IO requests are being issued and this could point to database design issues – if some of the tables had better indexes the query optimizer can come up with a more efficient query plan and retrieve the same data with far fewer IOs. Needless to say (almost), but corroborating evidence from the DMVs or perfmon should always be gathered whenever possible.
These are just a few examples, if you gather this data regularly a clear picture can be built up of what performance issues are affecting your systems, and by picking upon trends you can get advance warning of impending issues.
For further information on wait types and what all the output columns mean, there is an excellent Waits and Queues document on the Microsoft website which goes through them and is highly recommended reading.
Thomas Kejser also has an excellent bottleneck analysis script.
The most telling output can be from identifying IO stalls.
SQL Server 2000 had fn_virtual_filestats and this has now been incorporated into sys.dm_io_virtual_file_stats which is coming up in the next post, Tracking SQL Server IO Performance in this SQL Server and Disk IO series.