CSE 515: Database Transaction Processing Systems, Fall 2008

[Announcements] [Experiments] [Hints] [Home]

Experiment 5: Indexing and Performance

Deadline: Dec 14, Sunday, 2008

I am giving you a lot of things already written in ClientJava.zip, so you only need to make clear how many cases you need to handle and what to do for each case. Read the document VERY carefully (maybe except Cache Issues) and do as instruction. Since there are many cases in this experiment, please write the report in a very readable way. I suggest you use Word or PDF instead of TXT.

Introduction

Sybase Adaptive Server's query optimizer attempts to produce an efficient query plan for each query it executes. An efficient plan minimizes the number of I/O operations required, since I/O represents the major execution cost. The optimizer generally produces several "good" plans, evaluates their cost and chooses the best one.

Associated with each query plan the optimizer must choose an access path to get the data from each table. Depending on the indices you have created, the optimizer's choices include:

Table Scan - Reads all of the data pages of a table.
Index Access - Uses the proper index to find only the data pages needed.
Index Covering - The query can be satisfied using only the search key attributes of an unclustered index. Hence, only index pages need to be accessed; no access to data pages is required.

A table scan requires that the entire table be retrieved and hence may cause a large number of disk reads with a resulting poor response time. Sybase attempts to reduce this time by transferring multiple pages in a single I/O operations and by prefetching pages - fetching pages that have not yet been requested. With such techniques, table scans can often be done efficiently. However, scans of large tables are still costly. The situation can be improved if you have the right set of indices on your tables. In that case the query plan can use index access or index covering and will require fewer I/O operations.

It is important to keep index entries as small as possible. The page size in Sybase is 4k and you can create an index whose entries have up to 1250 bytes, but very few index entries will fit in a page, the index tree will have more levels, and each level will have more pages. As a result a search must descend more levels and the index scan needed to perform a range search will need to read more pages.

Carefully choose between clustered index and unclustered. You can create only one clustered index per table because the data for a clustered index is ordered by search key.

The following experiment will demonstrate how indices affect the performance of selection.

Experiment

Cache Issues

It is very important to consider the affect of the data cache when measuring performance. When running a sequence of experiments, the data cache at the start of one experiment will contain the pages left by the previous experiment. If the next experiment uses the same pages, it will find them in the cache and no disk I/O is needed. As a result, no meaningful conclusion can be deduced by comparing the execution time of the two experiments. To avoid this problem it is necessary to make sure that when an experiment is initiated, the pages in the cache are useless. Hence, we need to flush the cache between experiments. So we need to know how Sybase Adaptive Server uses the cache. When Adaptive Server starts, it allocates memory for executable code and data structures. The remaining memory is used for a data cache, called the default cache. The command sp_cacheconfig returns information about the cache. If you execute the command in RDB you will get:

Cache Name Status Type Config Value Run Value

------------------------------ --------- -------- ------------ ------------

default data cache Active Default 0.00 Mb 32.00 Mb

In this experiment, as in the previous experiment, statistics should be obtained in single-user mode by running on the bench machine. The default cache on that machine is 8 Mb. In order to flush the pages out it we create a small (1 Mb is the smallest allowed) named cache - called test - and we bind the databases used in the experiment to test. As a result, all pages in tables used in this experiment are buffered in test. Executing sp_cacheconfig on the bench machine yields:

Cache Name Status Type Config Value Run Value

Default data cache active Default 0.00 Mb 8.00 Mb

Test active Mixed 1.00 Mb 1.00 Mb

Generally, a DBMS will use some form of LRU policy in deciding which pages to keep in a cache. It would then appear that by scanning a large (dummy) table between experiments, pages remaining in the cache from one experiment would be overwritten by the more recently referenced pages of the dummy table before the next experiment starts. Unfortunately, Sybase recognizes that when a query plan calls for a table scan (as opposed to using an index), once a page has been read it will not be needed in the immediate future. Hence, an MRU (fetch and discard) policy is used for scans in which pages that have been recently scanned are candidates to be overwritten by other pages that are subsequently scanned. You need to force Sybase to use an index in the query plan for accessing the dummy table in order to get it to use an LRU policy for managing the table's pages in the cache.

The table tppt.tppt.initDummy should be used to flush the cache test. The table is in the tppt database - you do not need to create it.

CREATE TABLE initDummy (

value integer,

attrib integer,

Filler1 char(255),

Filler2 char(255),

Filler3 char(255),

Filler4 char(255),

Filler5 char(255),

Filler6 char(255),

Filler7 char(255),

Filler8 char(255),

Filler9 char(255),

Filler10 char(255)

)

CREATE NONCLUSTERED INDEX dummyindex ON initDUMMY (value)

Since a row must be entirely contained in a single page, each page of initDummy contains only one row. The query plan for the statement:

SELECT *

FROM tppt.tppt.initDummy

WHERE value > 100 AND attrib = 0

uses dummyindex (not a table scan) and hence an LRU algorithm is used in test as pages of the table are retrieved. Since the table is initialized so that no row satisfies the condition Attrib = 0, the pages will be brought into the cache, but the result set produced by the SELECT statement is empty.

Query Plan

The query plan used by Sybase to execute a query can be obtained by setting the showplan option on with the command:

set showplan on

Since you don't want the DBMS to output the same plan over and over again as you re-execute the same query during an experiment, to get the query plan for a particular SQL statement use jisql to (1) set showplan on, then (2) execute the statement (the query plan will be output), and then (3) set showplan off. Since you don't have an account on the bench machine you will only be able to do this on the multi-user server (RDB).

Description

The experiment consists of two parts and uses two tables.

CREATE TABLE Student (

Id char(9) NOT NULL,

Name char(50) NOT NULL,

Password char(20) NOT NULL,

Validity char(1) NOT NULL,

Town char(20)

)

CREATE TABLE Transcript (

StudId char(9) NOT NULL,

CourseId char(6) NOT NULL,

SectionNo INT,

Semester char(6) NOT NULL,

Year INT,

Grade char(1) NOT NULL

)

Part 1.

This part of the experiment uses only the table Student. There are 60000 rows in Student, and since each row occupies 100 bytes and the default page size in Sybase is 4k bytes, approximately 40 rows fit in one page. Since no key is specified, the table will be created as a heap storage structure. Initialize the table by copying rows from the read-only table tppt.tppt.initStudent using an insert statement:

insert into Student select * from tppt.tppt.initStudent

Run the experiment using the Student table with no index or with one of the following indexes:

1) create clustered index point1 on Student(Id)

2) create nonclustered index point2 on Student(Id)

3) create nonclustered index point3 on Student(Id, Validity)

4) create nonclustered index point4 on Student(Validity, Id)

5) create clustered index range1 on Student(Town)

6) create nonclustered index range2 on Student(Town)

7) create nonclustered index range3 on Student(Town, Id)

8) create nonclustered index range4 on Student(Id, Town)

Note that it takes much longer to rebuild the table with a clustered index than with an unclustered index. This is because each row has to be inserted in the table in alphabetical order based on the search key.

You should implement two stored procedures to be run as transactions. Flush flushes the cache using the technique discussed previously and must be called before each execution of StudentSearch to empty the cache.

StudentSearch: search Student using a SELECT statement. Create versions of the procedure for each of the statements:

select Validity from Student where Id = '100003000'; ( point query)

select Id from Student where Town >'city10' and Town<'city26'; (range query)

The StudentSearch stored procedure should include the "with recompile" option. This guarantees that its query plan will be rebuilt each time it is called. Without this option the query plan created when the stored procedure is first introduced into the schema will be used despite new indexes that you subsequently create.

Part 2.

This part of the experiment uses both Student and Transcript tables and involves a join. In addition to the Flush stored procedure, create a stored procedure CourseIdGrade.

CourseIdGrade: Output course Ids and grades using the following select statement.

SELECT T.CourseId, T.Grade

FROM Student S, Transcript T

WHERE S.Id = T.StudId AND S.Name = '100000001'

a) Initialize Student and Transcript using the following commands. This creates a Student table with 5000 rows and a Transcript table with 5000 rows (one course for each student).

INSERT INTO Student SELECT * FROM tppt.tppt.initStudent

WHERE StudId<='100005000'

INSERT INTO Transcript SELECT S.StudId , C.CourseId, 1, 'Fall', 2003, 'A'

FROM tppt.tppt.initStudent S, tppt.tppt.initCourse C

WHERE S.StudId <='100005000' and C.CourseId='CSE501'

Run the experiment with following options:

1) no indexes on either table

2) CREATE NONCLUSTERED INDEX StudId ON Transcript(StudId)

3) CREATE NONCLUSTERED INDEX StudId ON Transcript(StudId)

CREATE NONCLUSTERED INDEX Name ON Student(Name)

b) Initialize tables using the following commands. This creates a Student table with 3000 rows and a Transcript table with 3000 rows (30 students take 100 courses and the rest take none).

INSERT INTO Student SELECT * FROM tppt.tppt.initStudent

WHERE StudId<='100003000'

INSERT INTO Transcript SELECT S.StudId , C.CourseId, 1, 'Fall', 2003, 'A'

FROM tppt.tppt.initStudent S, tppt.tppt.initCourse C

WHERE S.StudId <='100000030' and C.CourseId<='CSE600'

Run the experiment with following options:

1) no index

2) CREATE NONCLUSTERED INDEX StudId ON Transcript(StudId)

3) CREATE CLUSTERED INDEX CourseId ON Transcript(CourseId)

CREATE NONCLUSTERED INDEX StudId ON Transcript(StudId)

Steps:

Prepare a schema file for the stored procedures. You should prepare a different schema file for each index or no index (although the files will differ in only one line).
Prepare init file for each case. (altogether 3 differrent init files, 1 for part1 and 2 for part2)
Prepare prototype.txt containing two entries:

int StudentSearch(Connection con);

int Flush(Connection con);

or ( depending on this is for part1 or part2)

int CourseIdGrade(Connection con);

int Flush(Connection con);

Create terminal0.scp as follows:

Flush

StudentSearch

or ( depending on this is for part1 or part2)

Flush

CourseIdGrade

Prepare java programs FlushClass.java, StudentSearchClass.java, and CourseIdGradeClass.javainvoking the stored procedures.
Test point query using no index, index point1, index point2, index point3, and index point4. Each experiment involves a single terminal and should run for 20 seconds. In each case, run the experiment 3 times to check for consistency.
Test range query using no index, index range1, index range2, index range3, and index range4. Each experiment involves a single terminal and should run for 20 seconds. In each case, run the experiment 3 times to check for consistency.
The query optimizer understands that the cost of using a nonclustered index in the query plan when the result set is likely to be large may exceed the cost of a table scan. Hence the query plan for range query is likely to ignore the index range2. To investigate the circumstances under which the query optimizer uses - or does not use - a nonclustered index in a range scan, design a query involving a range of StudId's in the WHERE clause. Vary the range and determine the point at which the optimizer switches from using the index range2 (not point2!) to not using that index. (And you should only have the index range2 in the schema to to this step, or else DBMS may choose to use other index instead of table scan.)
Run the CourseIdGrade experiment with 5000 rows in the Transcript table using no index, one index, or two indices. Each experiment involves a single terminal and should run for 20 seconds. In each case, run the experiment 3 times to check for consistency.
Run the CourseIdGrade experiment with 3000 rows in the Transcript table using no index, one index or two indices. Each experiment involves a single terminal and should run for 20 seconds. In each case, run the experiment 3 times to check for consistency.
Debug and test your program on RDB. Get the final statistics on the single user server (sbntbench). Since you will not be taking any statistics off RDB, you only need to use the smaller version of the Student table when running CourseIdGrade. Flushing only works on sbntbench since in RDB we do not bind the database to the smaller cache. Hence, statistics there are unreliable.

Report:

Since there are many cases in this experiment, please write the report in a very readable way. I suggest you use Word or PDF instead of TXT.

1. Turn in a TABLE showing the average response time for StudentSearch and CourseId Grade measured by the tool for each index type and query.

You can use some tables like:

Index Type\| Average response time \| Query	PointQuery	RangeQuery
No index	number	number
point1	number	-------no experiment--------
point2	number	-------no experiment--------
point3	number	-------no experiment--------
point4	number	-------no experiment--------
range1	-------no experiment--------	number
range2	-------no experiment--------	number
range3	-------no experiment--------	number
range4	-------no experiment--------	number

and

Index Type\| Average response time \|DB state	(a)	(b)
No index	number	number
StudId	number	-------no experiment--------
StudId , Name	number	-------no experiment--------
StudId	-------no experiment--------	number
CourseId , StudId	-------no experiment--------	number

2. Finish what are required in Step 8: .Vary the range and determine the point at which the optimizer switches from using the index range2 to not using that index.

3. Compare the response times for the point query, compare the response times for the range queries, compare the response time for the join query. (In each case give the query plan used by Sybase and explain the time that you got on SingleUser in terms of the query plan used - which you can get on MultiUser. )

4. What effect does the introduction of the clustered index CourseId in part 2.b have on the response time? Why?

5. Include in your submission all the original timing report files from TPPT SingleUser.

Last updated on Dec 8, 2008 by Tingbo Hou