SB CSE/ISE315 Assignment 6

CSE/ISE315
Spring 2006
Stony Brook

Database Transaction Processing Systems
Annie Liu
Assignment 6

Handout A6
Apr. 9, 2006
Due Apr. 27

Indexing and Performance

Sybase Adaptive Server's query optimizer attempts to produce an efficient query plan for each query it executes. An efficient plan minimizes the number of I/O operations required, since I/O represents the major execution cost. The optimizer generally produces several "good" plans, evaluates their cost, and chooses the best one.

Associated with each query plan, the optimizer must choose an access path to get the data from each table. Depending on the indices you have created, the optimizer's choices include:

Table scan: reading all of the data pages of a table.
Index access: using the proper index to find only the data pages needed and access those data pages.
Index covering: accessing only index pages, since the query can be answered using only the search key attributes in an index, and no access to data pages is needed.

A table scan requires that the entire table be retrieved and hence may cause a large number of disk reads and result in poor response time. Sybase attempts to reduce this time by transferring multiple pages in a single I/O operation and by prefetching pages that have not yet been requested. With such techniques, table scans can often be done efficiently, but scans of large tables are still costly. The situation can be improved if you have appropriate indices on the tables, so that the query plan can use index access or index covering and require fewer I/O operations.

It is important to keep index entries small. The page size in Sybase is 4K, and you can create an index with entries of up to 1250 bytes, but few index entries will fit in a page, the index tree will have more levels, and each level will have more pages. As a result, a search must descend more levels, and the index scan for a range search needs to read more pages. It is also important to carefully choose between clustered index and unclustered. You can create only one clustered index per table, because the data for a clustered index is ordered by the search key.

The purpose of this experiment is to demonstrate how indices affect the performance of queries.

Cache issues

It is important to consider the effect of the data cache when measuring performance. When running a sequence of experiments, the data cache at the start of one experiment will contain the pages left by the previous experiment. If the next experiment uses the same pages, it will find them in the cache and no disk I/O is needed. Thus, no meaningful conclusion can be made by comparing the execution time of the two experiments. To avoid this, it is necessary to make sure that when an experiment is initiated, the pages in the cache are useless. Hence, we must flush the cache between experiments. So we need to know how Sybase uses the cache. When Sybase starts, it allocates memory for executable code and data structures. The remaining memory is used for a data cache, called the default cache. The command sp_cacheconfig returns information about the cache. Executing the command on the UnderMultiUser server yields:

Cache Name             Status    Type     Config Value Run Value   
---------------------- --------- -------- ------------ ------------
Default data cache     Active    Default  0.00 Mb      32.00 Mb

In this experiment, as in the previous experiment, statistics should be obtained by running on the SingleUser server. The default cache on that machine is 8 Mb. In order to flush the pages out, we create a small (1 Mb is the smallest allowed) named cache, called Test, and we bind the databases used in the experiment to Test. As a result, all pages in tables used in this experiment are buffered in Test. Executing sp_cacheconfig on the SingleUser server yields:

Cache Name             Status    Type     Config Value Run Value
---------------------- --------- -------- ------------ ------------
Default data cache     Active    Default  0.00 Mb      8.00 Mb
Test                   Active    Mixed    1.00 Mb      1.00 Mb

Generally, a DBMS uses some form of LRU (least recently used) policy in deciding which pages to overwrite in a cache. It would then appear that, by scanning a large dummy table between experiments, pages remaining in the cache from one experiment would be overwritten by the more recently referenced pages of the dummy table before the next experiment starts. Unfortunately, Sybase recognizes that when a query plan calls for a table scan, as opposed to using an index, once a page has been read it will not be needed in the immediate future. Hence, it uses an MRU (most recently used, fetch and discard) policy for scans, where pages that have been recently scanned are candidates to be overwritten by other pages that are subsequently scanned. Therefore, you need to force Sybase to use an index in the query plan for accessing the dummy table in order to get it to use a LRU algorithm for managing the table's pages in the cache.

The table tppt..initDummy (on UnderMultiUser server) or tppt.tppt.initDummy (on SingleUser server) can be used to flush the cache Test; it is in the TPPT database, so you do not need to create it.

CREATE TABLE initDummy (
	value		integer,
	attrib		integer,
	Filler1		char(255),
	Filler2		char(255),
	Filler3		char(255),
	Filler4		char(255),
	Filler5		char(255),
	Filler6		char(255),
	Filler7		char(255),
	Filler8		char(255),
	Filler9		char(255),
	Filler10	char(255)
)
CREATE NONCLUSTERED INDEX dummyindex ON initDummy (value)

Since a row must be entirely contained in a single page, each page of initDummy contains only one row. The query plan for the following statement uses dummyindex, not a table scan, and hence an LRU algorithm is used in Test as pages of the table are retrieved.

SELECT *
FROM tppt..initDummy
WHERE value > 100 AND attrib = 0

(Note that tppt..initDummy is for use on UnderMultiUser server; you need to replace it with tppt.tppt.initDummy for use on SingleUser server.) The table has been initialized so that no row satisfies the condition attrib = 0, the pages will be brought into the cache, but the result set produced by the SELECT statement is empty.

You can create a stored procedure, Flush, to be run as a transaction, that flushes the cache using the technique discussed above.

Query plan

The query plan used by Sybase to execute a query can be obtained by setting the showplan option on with the command:

SET SHOWPLAN ON

Since you don't want Sybase to output the same plan over and over again as you re-execute the same query during an experiment, to get the query plan for a particular SQL statement, you can use JISQL to (1) set showplan on, then (2) execute the statement (the query plan will be output), and then (3) set showplan off. Since you don't have an account on the SingleUser server, you can only do this on the UnderMultiUser server.

Experiments

The experiment has two parts and uses two tables.

CREATE TABLE Invest (
	Id		char(9)    NOT NULL,
	Name		char(50)   NOT NULL,
	BondOrStock	char(1)    NOT NULL,
	Amount	        float,
	Note		char(40)   NOT NULL,
)
CREATE TABLE Deal (
	InvestId	char(9)    NOT NULL,
	DealId		char(6)	   NOT NULL,
	Date		char(6)    NOT NULL,
	Amount	        float,
	BuyOrSell	char(1)    NOT NULL,
)

Part 1 uses only the table Invest, performs two queries, and has a total of 10 experiments.

Implement a stored procedure, InvestSearch, to be run as a transaction, that searches Invest using a SELECT statement. Create a version of the procedure for each of the following two statements, for a point query and a range query, respectively, where '123456789' is the Id of one Invest row that is after 50% of rows, and '500' and '3000' form a range of about half of the rows in the Invest table.

SELECT BondOrStock from Invest where Id = '123456789'
SELECT Id from Invest where Amount > '500' and Amount < '3000'

This procedure should include the with recompile option, which in Sybase should be specified just after the procedure name. This guarantees that its query plan will be rebuilt each time it is called; otherwise, the query plan created when the stored procedure is first introduced into the schema will be used despite new indexes that you subsequently create for different experiments. Also, the stored procedure Flush must be called before each execution of InvestSearch to empty the cache.

Initialize Invest table with 6000 rows; the amount in the rows should only be 500, 1000, 1500, ..., 6000 with the same number of Ids having each amount. Since each row occupies 100 bytes and the default page size in Sybase is 4k bytes, approximately 40 rows fit in one page. Since no key is specified, the table will be created as a heap storage structure.

Run the experiment for the point query using the Invest table with the following options:

use no index
CREATE CLUSTERED INDEX point1 ON Invest(Id)
CREATE NONCLUSTERED INDEX point2 ON Invest(Id)
CREATE NONCLUSTERED INDEX point3 ON Invest(Id, BondOrStock)
CREATE NONCLUSTERED INDEX point4 ON Invest(BondOrStock, Id)

Run the experiment for the range query using the Invest table with the following options:

use no index
CREATE CLUSTERED INDEX range1 ON Invest(Amount)
CREATE NONCLUSTERED INDEX range2 ON Invest(Amount)
CREATE NONCLUSTERED INDEX range3 ON Invest(Amount, Id)
CREATE NONCLUSTERED INDEX range4 ON Invest(Id, Amount)

Note that it takes much longer to rebuild the table with a clustered index than with an unclustered index. This is because each row has to be inserted in the table in alphabetical order based on the search key.

Part 2 uses both Invest and Deal tables, performs a join query, and has a total of 6 experiments.

Implement a stored procedure, DealIdDate, to be run as a transaction, that outputs deal Ids and dates using the following statement, where 'abcedefg' is the Name of one Invest row.

SELECT D.DealId, D.Date
FROM Invest I, Deal D
WHERE I.Id = D.InvestId AND I.Name = 'abcdefg'

(a) Initialize Invest table with 5000 rows and Deal table with 5000 rows (1 Deal row for each Invest row). Run the experiment with following options:

use no index
CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId) CREATE NONCLUSTERED INDEX Name ON Invest(Name)

(b) Initialize Invest table with 3000 rows and Deal table with 3000 rows (30 Invest rows each having 100 Deal row, and the other Invest rows having none). Run the experiment with following options:

use no index
CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
CREATE CLUSTERED INDEX DealId ON Deal(DealId) CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)

What to do

Create the stored procedures Flush, InvestSearch, and DealIdDate.
Prepare to run TPPT; you may start with the example files from Assignment 3, but the following is what you need to set up:
- Create a schema file containing the Invest and Deal tables and the stored procedures you created. You should prepare a different schema file for each experiment that uses a different index type or no index.
- Create an init file for each case, 1 for Part 1 and 2 for Part 2, that uses INSERT statements to initialize the tables.
- Create a prototype file containing, for Part 1 and Part 2, respectively int InvestSearch(Connection con); int Flush(Connection con);
  and int DealIdDate(Connection con); int Flush(Connection con);
- Create a script file, terminal0.scp, containing, for Part 1 and Part 2, respectively >> Flush InvestSearch
  and >> Flush DealIdDate
- Create Java programs FlushClass.java, InvestSearchClass.java, and DealIdDateClass.java to invoke the respective stored procedures.
Run each of the following experiments for 20 seconds using a single terminal. In each case, run the experiment 3 times to check for consistency.
- Run point query using no index, and index point1, point2, point3, and point4.
- Run range query using no index, and index range1, range2, range3, and range4.
- Run DealIdDate query with 5000 rows using no index, one index, and two indices.
- Run DealIdDate query with 3000 rows using no index, one index, and two indices.
The query optimizer understands that the cost of using a nonclustered index in the query plan when the result set is likely to be large may exceed the cost of a table scan. Hence the query plan for range query is likely to ignore the index range2. To investigate the circumstances under which the query optimizer uses, or does not use, a nonclustered index in a range scan, design a query involving a range of InvestId's in the WHERE clause. Vary the range and determine the point at which the optimizer switches from using the index range2 to not using that index. Note you should only have the index range2 in the schema to do this step; otherwise the query plan may choose to use another index instead of a table scan.
Debug and test your programs on the UnderMultiUser server, and get the final statistics on the SingleUser server. Since you will not be taking any statistics off UnderMultiUser, you only need to use the smaller version of the Invest table when testing DealIdDate. Flushing only works on SingleUser, since in UnderMultiUser we do not bind the database to the smaller cache and thus statistics there are unreliable.

Prepare a report analyzing the results.

Make a table showing the average response time for InvestSearch and DealIdDate measured by the tool for each experiment. You can use tables like

Part 1 point query		    Part 1 range query
Index type  Average response time   Index type  Average response time
----------  ---------------------   ----------  ---------------------
no index	number		    no index	    number
point1		number		    range1	    number
point2		number		    range2	    number
point3		number		    range3	    number
point4		number		    range4	    number

Part 2 db (a)			    Part 2 db (b)
Index type  Average response time   Index type  Average response time
----------  ---------------------   ----------  ---------------------
no index        number		    no index	    number
InvestId	number		    InvestId	    number
InvestId,Name	number		    DealId,InvestId number

Report the results of Step 4.
Compare the response times for the point query, compare the response times for the range queries, compare the response time for the join query. In each case, give the query plan used by Sybase that you got on UnderMultiUser, and explain the time that you got on SingleUser in terms of the query plan used.
What effect does the introduction of the clustered index DealId in Part 2 (b) have on the response time? Why?

Handins

Hand in your description and code electronically, using Blackboard, by midnight on the due date. Your handin should include

all files required to run your experiments,
timing results from TPPT running on SingleUser, and
your report.

Grading

This homework will be graded based on 100 points, allocated in proportion to the required amount of work for each part. You may do this assignment in a team of two people; the two people will receive the same points.

Bonus (80% extra credit)

You will discover in the above experiments that the timing results are very hard to explain. This is caused by the extra unpredictable time spent in the network when the measurements are taken using Java in TPPT. To solve this problem, you may use a pair of commands in Sybase SQL for taking times directly on the server. The bonus task is to do all the experiments also in this fashion and report the results.