CSE/ISE315
Spring 2006
Stony Brook
Database Transaction Processing Systems
Annie Liu
Assignment 6
Handout A6
Apr. 9, 2006
Due Apr. 27

Indexing and Performance

Sybase Adaptive Server's query optimizer attempts to produce an efficient query plan for each query it executes. An efficient plan minimizes the number of I/O operations required, since I/O represents the major execution cost. The optimizer generally produces several "good" plans, evaluates their cost, and chooses the best one.

Associated with each query plan, the optimizer must choose an access path to get the data from each table. Depending on the indices you have created, the optimizer's choices include:

  1. Table scan: reading all of the data pages of a table.
  2. Index access: using the proper index to find only the data pages needed and access those data pages.
  3. Index covering: accessing only index pages, since the query can be answered using only the search key attributes in an index, and no access to data pages is needed.

A table scan requires that the entire table be retrieved and hence may cause a large number of disk reads and result in poor response time. Sybase attempts to reduce this time by transferring multiple pages in a single I/O operation and by prefetching pages that have not yet been requested. With such techniques, table scans can often be done efficiently, but scans of large tables are still costly. The situation can be improved if you have appropriate indices on the tables, so that the query plan can use index access or index covering and require fewer I/O operations.

It is important to keep index entries small. The page size in Sybase is 4K, and you can create an index with entries of up to 1250 bytes, but few index entries will fit in a page, the index tree will have more levels, and each level will have more pages. As a result, a search must descend more levels, and the index scan for a range search needs to read more pages. It is also important to carefully choose between clustered index and unclustered. You can create only one clustered index per table, because the data for a clustered index is ordered by the search key.

The purpose of this experiment is to demonstrate how indices affect the performance of queries.

Cache issues

It is important to consider the effect of the data cache when measuring performance. When running a sequence of experiments, the data cache at the start of one experiment will contain the pages left by the previous experiment. If the next experiment uses the same pages, it will find them in the cache and no disk I/O is needed. Thus, no meaningful conclusion can be made by comparing the execution time of the two experiments. To avoid this, it is necessary to make sure that when an experiment is initiated, the pages in the cache are useless. Hence, we must flush the cache between experiments. So we need to know how Sybase uses the cache. When Sybase starts, it allocates memory for executable code and data structures. The remaining memory is used for a data cache, called the default cache. The command sp_cacheconfig returns information about the cache. Executing the command on the UnderMultiUser server yields:

Cache Name             Status    Type     Config Value Run Value   
---------------------- --------- -------- ------------ ------------
Default data cache     Active    Default  0.00 Mb      32.00 Mb
In this experiment, as in the previous experiment, statistics should be obtained by running on the SingleUser server. The default cache on that machine is 8 Mb. In order to flush the pages out, we create a small (1 Mb is the smallest allowed) named cache, called Test, and we bind the databases used in the experiment to Test. As a result, all pages in tables used in this experiment are buffered in Test. Executing sp_cacheconfig on the SingleUser server yields:
Cache Name             Status    Type     Config Value Run Value
---------------------- --------- -------- ------------ ------------
Default data cache     Active    Default  0.00 Mb      8.00 Mb
Test                   Active    Mixed    1.00 Mb      1.00 Mb

Generally, a DBMS uses some form of LRU (least recently used) policy in deciding which pages to overwrite in a cache. It would then appear that, by scanning a large dummy table between experiments, pages remaining in the cache from one experiment would be overwritten by the more recently referenced pages of the dummy table before the next experiment starts. Unfortunately, Sybase recognizes that when a query plan calls for a table scan, as opposed to using an index, once a page has been read it will not be needed in the immediate future. Hence, it uses an MRU (most recently used, fetch and discard) policy for scans, where pages that have been recently scanned are candidates to be overwritten by other pages that are subsequently scanned. Therefore, you need to force Sybase to use an index in the query plan for accessing the dummy table in order to get it to use a LRU algorithm for managing the table's pages in the cache.

The table tppt..initDummy (on UnderMultiUser server) or tppt.tppt.initDummy (on SingleUser server) can be used to flush the cache Test; it is in the TPPT database, so you do not need to create it.

CREATE TABLE initDummy (
	value		integer,
	attrib		integer,
	Filler1		char(255),
	Filler2		char(255),
	Filler3		char(255),
	Filler4		char(255),
	Filler5		char(255),
	Filler6		char(255),
	Filler7		char(255),
	Filler8		char(255),
	Filler9		char(255),
	Filler10	char(255)
)
CREATE NONCLUSTERED INDEX dummyindex ON initDummy (value)
Since a row must be entirely contained in a single page, each page of initDummy contains only one row. The query plan for the following statement uses dummyindex, not a table scan, and hence an LRU algorithm is used in Test as pages of the table are retrieved.
SELECT *
FROM tppt..initDummy
WHERE value > 100 AND attrib = 0
(Note that tppt..initDummy is for use on UnderMultiUser server; you need to replace it with tppt.tppt.initDummy for use on SingleUser server.) The table has been initialized so that no row satisfies the condition attrib = 0, the pages will be brought into the cache, but the result set produced by the SELECT statement is empty.

You can create a stored procedure, Flush, to be run as a transaction, that flushes the cache using the technique discussed above.

Query plan

The query plan used by Sybase to execute a query can be obtained by setting the showplan option on with the command:

SET SHOWPLAN ON
Since you don't want Sybase to output the same plan over and over again as you re-execute the same query during an experiment, to get the query plan for a particular SQL statement, you can use JISQL to (1) set showplan on, then (2) execute the statement (the query plan will be output), and then (3) set showplan off. Since you don't have an account on the SingleUser server, you can only do this on the UnderMultiUser server.

Experiments

The experiment has two parts and uses two tables.

CREATE TABLE Invest (
	Id		char(9)    NOT NULL,
	Name		char(50)   NOT NULL,
	BondOrStock	char(1)    NOT NULL,
	Amount	        float,
	Note		char(40)   NOT NULL,
)
CREATE TABLE Deal (
	InvestId	char(9)    NOT NULL,
	DealId		char(6)	   NOT NULL,
	Date		char(6)    NOT NULL,
	Amount	        float,
	BuyOrSell	char(1)    NOT NULL,
)
Part 1 uses only the table Invest, performs two queries, and has a total of 10 experiments.

Implement a stored procedure, InvestSearch, to be run as a transaction, that searches Invest using a SELECT statement. Create a version of the procedure for each of the following two statements, for a point query and a range query, respectively, where '123456789' is the Id of one Invest row that is after 50% of rows, and '500' and '3000' form a range of about half of the rows in the Invest table.

SELECT BondOrStock from Invest where Id = '123456789'
SELECT Id from Invest where Amount > '500' and Amount < '3000'
This procedure should include the with recompile option, which in Sybase should be specified just after the procedure name. This guarantees that its query plan will be rebuilt each time it is called; otherwise, the query plan created when the stored procedure is first introduced into the schema will be used despite new indexes that you subsequently create for different experiments. Also, the stored procedure Flush must be called before each execution of InvestSearch to empty the cache.

Initialize Invest table with 6000 rows; the amount in the rows should only be 500, 1000, 1500, ..., 6000 with the same number of Ids having each amount. Since each row occupies 100 bytes and the default page size in Sybase is 4k bytes, approximately 40 rows fit in one page. Since no key is specified, the table will be created as a heap storage structure.

Run the experiment for the point query using the Invest table with the following options:

  1. use no index
  2. CREATE CLUSTERED INDEX point1 ON Invest(Id)
  3. CREATE NONCLUSTERED INDEX point2 ON Invest(Id)
  4. CREATE NONCLUSTERED INDEX point3 ON Invest(Id, BondOrStock)
  5. CREATE NONCLUSTERED INDEX point4 ON Invest(BondOrStock, Id)
Run the experiment for the range query using the Invest table with the following options:
  1. use no index
  2. CREATE CLUSTERED INDEX range1 ON Invest(Amount)
  3. CREATE NONCLUSTERED INDEX range2 ON Invest(Amount)
  4. CREATE NONCLUSTERED INDEX range3 ON Invest(Amount, Id)
  5. CREATE NONCLUSTERED INDEX range4 ON Invest(Id, Amount)
Note that it takes much longer to rebuild the table with a clustered index than with an unclustered index. This is because each row has to be inserted in the table in alphabetical order based on the search key.

Part 2 uses both Invest and Deal tables, performs a join query, and has a total of 6 experiments.

Implement a stored procedure, DealIdDate, to be run as a transaction, that outputs deal Ids and dates using the following statement, where 'abcedefg' is the Name of one Invest row.

SELECT D.DealId, D.Date
FROM Invest I, Deal D
WHERE I.Id = D.InvestId AND I.Name = 'abcdefg'

(a) Initialize Invest table with 5000 rows and Deal table with 5000 rows (1 Deal row for each Invest row). Run the experiment with following options:

  1. use no index
  2. CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
  3. CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
    CREATE NONCLUSTERED INDEX Name ON Invest(Name)

(b) Initialize Invest table with 3000 rows and Deal table with 3000 rows (30 Invest rows each having 100 Deal row, and the other Invest rows having none). Run the experiment with following options:

  1. use no index
  2. CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
  3. CREATE CLUSTERED INDEX DealId ON Deal(DealId)
    CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)

What to do

  1. Create the stored procedures Flush, InvestSearch, and DealIdDate.
  2. Prepare to run TPPT; you may start with the example files from Assignment 3, but the following is what you need to set up:
  3. Run each of the following experiments for 20 seconds using a single terminal. In each case, run the experiment 3 times to check for consistency.
  4. The query optimizer understands that the cost of using a nonclustered index in the query plan when the result set is likely to be large may exceed the cost of a table scan. Hence the query plan for range query is likely to ignore the index range2. To investigate the circumstances under which the query optimizer uses, or does not use, a nonclustered index in a range scan, design a query involving a range of InvestId's in the WHERE clause. Vary the range and determine the point at which the optimizer switches from using the index range2 to not using that index. Note you should only have the index range2 in the schema to do this step; otherwise the query plan may choose to use another index instead of a table scan.
  5. Debug and test your programs on the UnderMultiUser server, and get the final statistics on the SingleUser server. Since you will not be taking any statistics off UnderMultiUser, you only need to use the smaller version of the Invest table when testing DealIdDate. Flushing only works on SingleUser, since in UnderMultiUser we do not bind the database to the smaller cache and thus statistics there are unreliable.
  6. Prepare a report analyzing the results.

Handins

Hand in your description and code electronically, using Blackboard, by midnight on the due date. Your handin should include

  1. all files required to run your experiments,
  2. timing results from TPPT running on SingleUser, and
  3. your report.

Grading

This homework will be graded based on 100 points, allocated in proportion to the required amount of work for each part. You may do this assignment in a team of two people; the two people will receive the same points.

Bonus (80% extra credit)

You will discover in the above experiments that the timing results are very hard to explain. This is caused by the extra unpredictable time spent in the network when the measurements are taken using Java in TPPT. To solve this problem, you may use a pair of commands in Sybase SQL for taking times directly on the server. The bonus task is to do all the experiments also in this fashion and report the results.