web 2.0

SQL Performance Tuning Basics

Once you start developing enterprise scale applications you will start to notice that things perform a little bit differently. Dealing with VLDBs (Very Large Databases) force a developer to learn about performance tuning. Therefore, I thought it would be good to outline some of the high level perfmon counters for those developers out there who are new to SQL Server.

Memory Related

Buffer Manager: Buffer Cache Hit Ratio - This is easily one of the most recognized counters in the world of SQL performance tuning. This counter indicates the percentage of data pages found in the buffer cache as opposed to disk. A value of 95% indicates that pages were found in memory 95% of the time. The other 5% required physical disk access. A consistent value below 90% indicates that more physical memory is needed on the server.

If you are using SQL 2005 and above then you can also capture Buffer Cache Hit Ratio using the DMVs. This could be useful if you want to store values in a table and then graph the results.

   1:  SELECT (a.cntr_value * 1.0 / b.cntr_value) * 100.0 [BufferCacheHitRatio]
   2:  FROM (SELECT *, 1 x
   3:       FROM sys.dm_os_performance_counters
   4:      WHERE counter_name = 'Buffer cache hit ratio'
   5:        AND object_name like '%:Buffer Manager%') a
   6:  JOIN
   7:    (SELECT *, 1 x
   8:    FROM sys.dm_os_performance_counters
   9:      WHERE counter_name = 'Buffer cache hit ratio base'
  10:        AND object_name like '%Buffer Manager%') b
  11:  ON a.x = b.x

Buffer Manager: Page Life Expectancy - This counter can be helpful in determining whether you have a memory problem, giving you a reasonably accurate view of whether your server has memory pressure. According to Microsoft, 300 seconds is the minimum target for page life expectancy. If the buffer pool flushes your pages in less than 300 seconds, you probably have a memory problem. Looking at this value is particularly handy when your page life expectancy is significantly higher or lower than 300 seconds. In order for this counter to be useful, you should watch this value over time. 64 bit servers with a large amounts of memory that serve high transaction databases can have extremely high values for page life expectancy. On servers with significant load a value of 300 would be considered low.

I/O Related

Logical Disk: Disk Reads/ sec & Logical Disk: Disk Writes/ sec - Microsoft recommends that the average latency on a disk drive should be 20 milliseconds or less (.020 seconds). However, I typically follow the recommendation of 20 ms or less for data and 10 ms or less for logs.

There is a test which you can run to see how fast you can read the database. This is done by using the Backup to NUL trick. Basically this script backs up the database to "never-land" but it reads each page of the database during the process.

   1:  backup database <db name> to disk = 'NUL:' with stats = 10

Sample output:

Processed 132776 pages for database 'DBA', file 'DBA' on file 1.
100 percent processed.
Processed 1 pages for database 'DBA', file 'DBA_log' on file 1.
BACKUP DATABASE successfully processed 132777 pages 
in 7.968 seconds (136.509 MB/sec).

Two of the most common causes of high latency are partition alignment or improper disk configuration. For more information about Disk Partition Alignment please read this whitepaper. Disk configuration issues tend to be a little more tricky but I can offer some very high level suggestions:

  1. Plan for I/O throughput and not capacity. Make sure you have enough spindles.
  2. Separate the data, logs and tempdb. Create as many tempdb files as you have processors.
  3. Try to separate sequential I/O from random I/O onto different spindles.
  4. Pick good initial sizes for your data and log files. Auto-growth is an expensive operation and it is also prone to fragmentation.

CPU Related

SQL Statistics: Batch Requests/sec - This counter measures the number of batch requests that SQL Server receives per second, and generally follows in step to how busy your server's CPUs are. Generally speaking, over 1000 batch requests per second indicates a very busy SQL Server, and could mean that if you are not already experiencing a CPU bottleneck, that you may very well soon. Of course, this is a relative number, and the bigger your hardware, the more batch requests per second SQL Server can handle. From a network bottleneck approach, a typical 100Mbs network card is only able to handle about 3000 batch requests per second. If you have a server that is this busy, you may need to have two or more network cards, or go to a 1Gbs network card.

SQL Statistics: SQL Compilations/sec - The number of times per second that SQL Server compilations have occurred. This value needs to be as low as possible. If you see a high value such as over 100, then it’s an indication that there are lots of adhoc queries running. One way to identify adhoc or dynamic SQL is to look through your stored procedures for sp_executeSQL or exec() calls. You can do this with the following query:

   1:  select * from syscomments where text like '%sp_executesql%' or
   2:  text like '% exec%'

I would also recommend setting up unattended traces that capture queries over a certain read limit. The use a tool like Clear Trace to aggregate the data by average reads, average writes or CPU. This will allow you to focus on the worst performing queries first, which are typically the ad-hoc queries.

SQL Statistics: SQL Recompilations/sec - This needs to be as low possible. A recompile can cause deadlocks and compile locks that are not compatible with any locking type. These type of issue will generally drive up CPU utilization.

Finally, I want to end this brief introduction with a warning. All counters should be looked at over time. You can not simply go onto a server once over a five day period and notice that the Buffer Cache Hit Ratio is low and tell your system admin to purchase more RAM. Chances are that someone may running a bad query or doing something to the system that flushed the cache. You really need to observe your counters over time. People who give you specific values are just trying to give you a starting point. Every system is unique and you really have to get familiar with each environment to know what "normal" really is.

Related Resources:

Tags: ,

DBA | SQL

blog comments powered by Disqus