CSE/ISE315 Spring 2006 Stony Brook |
Database Transaction Processing Systems
Annie Liu Assignment 6 |
Handout A6 Apr. 9, 2006 Due Apr. 27 |
Indexing and Performance
Sybase Adaptive Server's query optimizer attempts to produce an efficient query plan for each query it executes. An efficient plan minimizes the number of I/O operations required, since I/O represents the major execution cost. The optimizer generally produces several "good" plans, evaluates their cost, and chooses the best one.
Associated with each query plan, the optimizer must choose an access path to get the data from each table. Depending on the indices you have created, the optimizer's choices include:
A table scan requires that the entire table be retrieved and hence may cause a large number of disk reads and result in poor response time. Sybase attempts to reduce this time by transferring multiple pages in a single I/O operation and by prefetching pages that have not yet been requested. With such techniques, table scans can often be done efficiently, but scans of large tables are still costly. The situation can be improved if you have appropriate indices on the tables, so that the query plan can use index access or index covering and require fewer I/O operations.
It is important to keep index entries small. The page size in Sybase is 4K, and you can create an index with entries of up to 1250 bytes, but few index entries will fit in a page, the index tree will have more levels, and each level will have more pages. As a result, a search must descend more levels, and the index scan for a range search needs to read more pages. It is also important to carefully choose between clustered index and unclustered. You can create only one clustered index per table, because the data for a clustered index is ordered by the search key.
The purpose of this experiment is to demonstrate how indices affect the performance of queries.
Cache issues
It is important to consider the effect of the data cache when
measuring performance. When running a sequence of experiments, the
data cache at the start of one experiment will contain the pages left
by the previous experiment. If the next experiment uses the same
pages, it will find them in the cache and no disk I/O is needed.
Thus, no meaningful conclusion can be made by comparing the execution
time of the two experiments. To avoid this, it is necessary to make
sure that when an experiment is initiated, the pages in the cache are
useless. Hence, we must flush the cache between experiments. So we
need to know how Sybase uses the cache. When Sybase starts, it
allocates memory for executable code and data structures. The
remaining memory is used for a data cache, called the default cache.
The command sp_cacheconfig
returns information about the
cache. Executing the command on the UnderMultiUser server yields:
Cache Name Status Type Config Value Run Value ---------------------- --------- -------- ------------ ------------ Default data cache Active Default 0.00 Mb 32.00 MbIn this experiment, as in the previous experiment, statistics should be obtained by running on the SingleUser server. The default cache on that machine is 8 Mb. In order to flush the pages out, we create a small (1 Mb is the smallest allowed) named cache, called Test, and we bind the databases used in the experiment to Test. As a result, all pages in tables used in this experiment are buffered in Test. Executing
sp_cacheconfig
on the SingleUser server yields:
Cache Name Status Type Config Value Run Value ---------------------- --------- -------- ------------ ------------ Default data cache Active Default 0.00 Mb 8.00 Mb Test Active Mixed 1.00 Mb 1.00 Mb
Generally, a DBMS uses some form of LRU (least recently used) policy in deciding which pages to overwrite in a cache. It would then appear that, by scanning a large dummy table between experiments, pages remaining in the cache from one experiment would be overwritten by the more recently referenced pages of the dummy table before the next experiment starts. Unfortunately, Sybase recognizes that when a query plan calls for a table scan, as opposed to using an index, once a page has been read it will not be needed in the immediate future. Hence, it uses an MRU (most recently used, fetch and discard) policy for scans, where pages that have been recently scanned are candidates to be overwritten by other pages that are subsequently scanned. Therefore, you need to force Sybase to use an index in the query plan for accessing the dummy table in order to get it to use a LRU algorithm for managing the table's pages in the cache.
The table tppt..initDummy
(on UnderMultiUser server)
or tppt.tppt.initDummy
(on SingleUser server) can be used
to flush the cache Test; it is in the TPPT database, so you do not
need to create it.
CREATE TABLE initDummy ( value integer, attrib integer, Filler1 char(255), Filler2 char(255), Filler3 char(255), Filler4 char(255), Filler5 char(255), Filler6 char(255), Filler7 char(255), Filler8 char(255), Filler9 char(255), Filler10 char(255) ) CREATE NONCLUSTERED INDEX dummyindex ON initDummy (value)Since a row must be entirely contained in a single page, each page of
initDummy
contains only one row. The query plan for the
following statement uses dummyindex
, not a table scan,
and hence an LRU algorithm is used in Test as pages of the table are
retrieved.
SELECT * FROM tppt..initDummy WHERE value > 100 AND attrib = 0(Note that
tppt..initDummy
is for use on UnderMultiUser
server; you need to replace it with tppt.tppt.initDummy
for use on SingleUser server.) The table has been initialized so that
no row satisfies the condition attrib = 0
, the pages will
be brought into the cache, but the result set produced by the SELECT
statement is empty.
You can create a stored procedure, Flush, to be run as a transaction, that flushes the cache using the technique discussed above.
Query plan
The query plan used by Sybase to execute a query can be obtained by setting the showplan option on with the command:
SET SHOWPLAN ONSince you don't want Sybase to output the same plan over and over again as you re-execute the same query during an experiment, to get the query plan for a particular SQL statement, you can use JISQL to (1) set showplan on, then (2) execute the statement (the query plan will be output), and then (3) set showplan off. Since you don't have an account on the SingleUser server, you can only do this on the UnderMultiUser server.
Experiments
The experiment has two parts and uses two tables.
CREATE TABLE Invest ( Id char(9) NOT NULL, Name char(50) NOT NULL, BondOrStock char(1) NOT NULL, Amount float, Note char(40) NOT NULL, ) CREATE TABLE Deal ( InvestId char(9) NOT NULL, DealId char(6) NOT NULL, Date char(6) NOT NULL, Amount float, BuyOrSell char(1) NOT NULL, )Part 1 uses only the table Invest, performs two queries, and has a total of 10 experiments.
Implement a stored procedure, InvestSearch, to be run as a
transaction, that searches Invest using a SELECT statement. Create a
version of the procedure for each of the following two statements, for
a point query and a range query, respectively, where
'123456789'
is the Id of one Invest row that is after 50%
of rows, and '500'
and '3000'
form a range
of about half of the rows in the Invest table.
SELECT BondOrStock from Invest where Id = '123456789' SELECT Id from Invest where Amount > '500' and Amount < '3000'This procedure should include the
with recompile
option,
which in Sybase should be specified just after the procedure name.
This guarantees that its query plan will be rebuilt each time it is
called; otherwise, the query plan created when the stored procedure is
first introduced into the schema will be used despite new indexes that
you subsequently create for different experiments. Also, the stored
procedure Flush must be called before each execution of InvestSearch
to empty the cache.
Initialize Invest table with 6000 rows; the amount in the rows should only be 500, 1000, 1500, ..., 6000 with the same number of Ids having each amount. Since each row occupies 100 bytes and the default page size in Sybase is 4k bytes, approximately 40 rows fit in one page. Since no key is specified, the table will be created as a heap storage structure.
Run the experiment for the point query using the Invest table with the following options:
CREATE CLUSTERED INDEX point1 ON Invest(Id)
CREATE NONCLUSTERED INDEX point2 ON Invest(Id)
CREATE NONCLUSTERED INDEX point3 ON Invest(Id, BondOrStock)
CREATE NONCLUSTERED INDEX point4 ON Invest(BondOrStock, Id)
CREATE CLUSTERED INDEX range1 ON Invest(Amount)
CREATE NONCLUSTERED INDEX range2 ON Invest(Amount)
CREATE NONCLUSTERED INDEX range3 ON Invest(Amount, Id)
CREATE NONCLUSTERED INDEX range4 ON Invest(Id, Amount)
Part 2 uses both Invest and Deal tables, performs a join query, and has a total of 6 experiments.
Implement a stored procedure, DealIdDate, to be run as a
transaction, that outputs deal Ids and dates using the following
statement, where 'abcedefg'
is the Name of one Invest
row.
SELECT D.DealId, D.Date FROM Invest I, Deal D WHERE I.Id = D.InvestId AND I.Name = 'abcdefg'
(a) Initialize Invest table with 5000 rows and Deal table with 5000 rows (1 Deal row for each Invest row). Run the experiment with following options:
CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
CREATE NONCLUSTERED INDEX Name ON Invest(Name)
(b) Initialize Invest table with 3000 rows and Deal table with 3000 rows (30 Invest rows each having 100 Deal row, and the other Invest rows having none). Run the experiment with following options:
CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
CREATE CLUSTERED INDEX DealId ON Deal(DealId)
CREATE NONCLUSTERED INDEX InvestId ON Deal(InvestId)
What to do
int InvestSearch(Connection con);
int Flush(Connection con);
int DealIdDate(Connection con);
int Flush(Connection con);
>>
Flush
InvestSearch
>>
Flush
DealIdDate
range2
. To
investigate the circumstances under which the query optimizer uses, or
does not use, a nonclustered index in a range scan, design a query
involving a range of InvestId's in the WHERE clause. Vary the range
and determine the point at which the optimizer switches from using the
index range2
to not using that index. Note you should
only have the index range2
in the schema to do this step;
otherwise the query plan may choose to use another index instead of a
table scan.
Part 1 point query Part 1 range query Index type Average response time Index type Average response time ---------- --------------------- ---------- --------------------- no index number no index number point1 number range1 number point2 number range2 number point3 number range3 number point4 number range4 number Part 2 db (a) Part 2 db (b) Index type Average response time Index type Average response time ---------- --------------------- ---------- --------------------- no index number no index number InvestId number InvestId number InvestId,Name number DealId,InvestId number
Handins
Hand in your description and code electronically, using Blackboard, by midnight on the due date. Your handin should include
Grading
This homework will be graded based on 100 points, allocated in proportion to the required amount of work for each part. You may do this assignment in a team of two people; the two people will receive the same points.
Bonus (80% extra credit)
You will discover in the above experiments that the timing results are very hard to explain. This is caused by the extra unpredictable time spent in the network when the measurements are taken using Java in TPPT. To solve this problem, you may use a pair of commands in Sybase SQL for taking times directly on the server. The bonus task is to do all the experiments also in this fashion and report the results.