Starting from:

$25

B561 - Assignment 6 -   Data Generation - Solved

Sorting

Indexing (Draft: Version 1)


For this assignment, you will need the material covered in the lectures

•    Lecture 15: External Merge Sorting

•    Lecture 16: Indexing

•    Lecture 17: B+ trees and Hashing

In addition, you should read the following sections in Chapter 14 and 15 in the textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and Widom:

•    Section 14.1: Index-structure Basics

•    Section 14.2: B-Trees

•    Section 14.3: Hash Tables

•    Section 14.7: Bitmap Indexes

•    Section 15.8.1: Multipass Sort-Based Algorithms

In the file performingExperiments.sql supplied with this assignment, we have include several PostgreSQL functions that should be useful for running experiments. Of course, you may wish to write your own functions and/or adopt the functions in this .sql to suite the required experiments for the various problems in this assignment.

The problems that need to be included in the assignment6.sql are marked with a blue bullet •. The problems that need to be included in the assignment6.pdf are marked with a red bullet •. (You should aim to use Latex to construct your .pdf file.) In addition, submit a file assignment6Code.sql that contains all the sql code that you developed for this assignment. Practice problems should not be submitted.

1           Data generation
PostgreSQL functions and clauses In this assignment there will be a need to do simulations with various-size relations. Many of these relations will have randomized data. PostgreSQL provides useful functions and clauses that make such relations:

random()
returns a random real number between 0 and 1 using the uniform distribution
floor(random() ∗ (u−l + 1) + l) :: int
returns a random integer in the range [l,u] using the uniform distribution
generate series(m,n)
generates the set of integers in the range [m,n]
generate series(m,n,k)
generates the set of integers in the range [m,n]) separated by step-size k
order by random()
randomly orders the tuples that are the result of a query
limit(n)
returns the first n tuples from the result of a query
offset(n)
returns all but the first n tuples from the result of a query
limit(m) offset(n)
returns the first m tuples from all but the first n tuples from the result of a query
row number()
is a window function that assigns a sequential integer to each row in a result set
vacuum
is a garbage collection function to clean and reclaim secondary memory space
For more detail, consult the manual pages

https://www.postgresql.org/docs/13/functions-math.html https://www.postgresql.org/docs/current/functions-srf.html https://www.postgresql.org/docs/current/queries-limit.html https://www.postgresql.org/docs/8.4/functions-window.html https://www.postgresql.org/docs/9.5/routine-vacuuming.html

In the file performingExperiments.sql supplied with this assignment, we have include several PostgreSQL functions that should be useful for running experiments. Of course, you may wish to write your own functions and/or adopt the functions in this .sql to suite the required experiments for the various problems in this assignment.

Generating sets To generate a set, i.e., a unary relation, of n randomly selected integers in the range [l,u], you can use the following function:[1]

create or replace function SetOfIntegers(n int, l int, u int) returns table (x int) as

$$ select floor(random() * (u-l+1) + l)::int from generate_series(1,n);

$$ language sql;

Example 1 To generate a unary relation with 3 randomly selected integers in the range 5 to 10, do the following:

select x from SetofIntegers(3,5,10);

Of course, running this query multiple times, result in different sets.

Generating binary relations The idea behind generating a set can be generalized to that for the generation of a binary relation.[2] To generate a binary relation of n randomly selected pairs of integers (x,y) such x ∈ [l1,u1] and y ∈ [l2,u2], you can use the following function:

create or replace function

BinaryRelationOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int) returns table (x int, y int) as

$$ select floor(random() * (u_1-l_1+1) + l_1)::int as x, floor(random() * (u_2-l_2+1) + l_2)::int as y

      from           generate_series(1,n);

$$ language sql;

Example 2 To generate a binary relation with 20 randomly selected pairs with first components in the range [3,8] and second components in the range [2,11], do the following:

select x, y from BinaryRelationOverIntegers(20,3,8,2,11);

Generating functions A relation generated by BinaryRelationOverIntegers is in general not a function since it is possible that the relation has pairs (x,y1) and (x,y2) with y1 6= y2. To create a (partial) function f : [l1,u1] → [l2,u2] of n randomly selected function pairs, we can use the following function:

create or replace function

FunctionOverIntegers(n int, l_1 int, u_1 int, l_2 int, u_2 int) returns table (x int, y int) as

$$ select x, floor(random() * (u_2-l_2+1) + l_2)::int as y from generate_series(l_1,u_1) x order by random() limit(n);

$$ language sql;

Example 3 To generate a partial function [1,20] → [3,8] of 15 randomly selection function pairs, do the following:[3]select x, y from FunctionOverIntegers(15,1,20,3,8);

Generating relations with categorical (non-numeric) data Thus far, the sets, binary relations, and functions have all just involved integer ranges. It is possible to include ranges that have different types including categorical data such as text strings. The technique to accomplish this is to first associate with a categorical range an integer range that associate with each element in the categorical range a unique value of the integer range. The next example illustrates this.

Example 4 Consider the jobSkill relation and assume that it contents is

skill

 

AI

Databases

Networks

OperatingSystems

Programming

Suppose that we want to generate a personSkill(pid, skill) relation. Let us assume that the pid’s are integers in the range [1,m].

There are 5 skills in the jobSkill and it is therefore natural to associate with each skill a separate integer (index value) in the range [1,5]. This can be done with a query involving the row number() window function:

select skill, row number() over (order by skill) as index from     Skill;

The result is the following relation:

                                                    skill                                  index

 

                                                  AI                                           1

                                                  Databases                            2

                                                  Networks                             3

                                                    OperatingSystems            4

                                                  Programming                      5

Using this technique, we can write a PostgreSQL function that generates a personSkill relation with n randomly selected (pid,skill) tuples, with pid’s in the range [l,u]:

create or replace function GeneratepersonSkillRelation(n int, l int, u int) returns table (pid int, skill text) as

$$

with skillNumber(skill, index) as (select skill, row_number() over (order by skill) from       Skill),

pS as (select x, y from   BinaryRelationOverIntegers(n,l,u,1, (select count(1) from Skill)::int)) select x as pid, skill from          pS join skillNumber on y = index group by (x, skill) order by 1,2; $$ language sql;

In this function, the skillNumber view associates with each job skill an integer index in the range [1,|Skill|]. The pS view is a randomly generated binary relation with n tuples, with pid’s in the range [l,u], and skill numbers in the range [1,|Skill|]. The join operation associates the numeric range with the Skill range. The ‘group by (x, skill) order by 1,2’ clause eliminates duplicate tuples and orders the result. The query select * from GeneratepersonSkillRelation(10,1,15);

may make the personSkill relation:

pid skill

 

1            AI

2            Programming

3            Databases

4            Databases

                                                      6         Networks

                                                      6          OperatingSystems

                                                      6        Programming

9 Databases 14 Databases

14 Networks

Problems       We now turn to the problems in this section.

1.    • Given a discrete probability mass function P, as specified in a relation

P(outcome: int, probability: float), over a range of possible outcomes [u2,l2], design a PostgreSQL function

RelationOverProbabilityFunction(n,l1,u1,l2,u2) that generates a relation of up to n pairs (x,y) such that

•    x is uniformly selected in the range [l1,u1]; and

•    y is selected in accordance with the probability mass function P in the range [l2,u2].

An example of a possible P as stored in relation P is as follows:[4]

P outcome probability

 

1              0.25 2    0.10 3    0.40 4                0.10

                                                                 5                     0.15

Note that when P is the uniform probability mass function, then

RelationOverProbabilityFunction

and

BinaryRelationOverIntegers

are the same binary relation producing functions.

Hint: For insight into this problem, consult the method of Inverse Transform Sampling for discrete probability mass functions.

Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql file.

2.    Practice Problem-not graded.

Use the technique in Problem 1 and the method for generating categorical data discussed above to write a PostgreSQL function that generates a personSkill relation, given a probability mass function over the Skill relation.

Your function should work for any Skill relation and any probability distribution defined over it.

Test your function for the cases listed in the Assignment-Script-2021-Fall-assignment6.sql file.

2           Sorting
We have learned about external sorting. The problems in this section are designed to look into this sorting method as it implemented in PostgreSQL.

 3. • Create successively larger sets of n randomly selected integers in the range [1,n]. You can do this using the following function and with the following experiment.[5]

create or replace function makeS (n integer) returns void as

$$ begin drop table if exists S; create table S (x int); insert into S select * from SetOfIntegers(n,1,n);

end;

$$ language plpgsql;

This function generates a bag S of size n, with randomly select integers in the range [1,n]. Now consider the following SQL statements:

select makeS(10); explain analyze select x from S; explain analyze select x from S order by 1;

•    The ‘select makeS(10)’ statement makes a bag S with 10 elements;

•    The ‘explain analyze select x from S’ statement provides the query plan and execution time in milliseconds (ms) for a simple scan of S;

•    The ‘explain analyze select x from S order by 1’ statement provides the query plan and execution time in milliseconds (ms) for sorting S.[6]

QUERY PLAN

------------------------------------------------------------------------------------------------------

Sort (cost=179.78..186.16 rows=2550 width=4) (actual time=0.025..0.026 rows=10 loops=1)

Sort Key: x

Sort Method: quicksort Memory: 25kB

-> Seq Scan on s (cost=0.00..35.50 rows=2550 width=4) (actual time=0.004..0.005 rows=10 loops=1)

Planning Time: 0.069 ms

Execution Time: 0.034 ms

Now construct the following timing table:[7]

size n of relation S
avg execution time to scan S (in ms)
avg execution time to sort S (in ms)
101

102

103

104

105

106

107

10[8]
 
 
(a)    What are your observations about the query plans for the scanning and sorting of such differently sized bags S?

(b)    What do you observe about the execution time to sort S as a function of n?

(c)    Does this conform with the formal time complexity of (external) sorting? Explain.

(d)   It is possible to set the working memory of PostgreSQL using the set work mem command. For example, set work mem = ’16MB’ sets the working memory to 16MB.8 The smallest possible working memory in postgreSQL is 64kB and the largest depends on you computer memory size. But you can try for example 1GB.

Repeat question 3a for memory sizes 64kB and 1GB and report your observations.

(e)    Now create a relation indexedS(x integer) and create a Btree index on indexedS and insert into indexedS the sorted relation S.[9]

create table indexedS (x integer); create index on indexedS using btree (x); insert into indexedS select x from S order by 1; Then run the range query select x from indexedS where x between 1 and n;

where n denotes the size of S.

Then construct the following table which contains (a) the average execution time to build the btree index and (2) the average time to run the range query.

What are your observations about the query plans and execution times to create indexedS and the execution times for sorting the differently sized bags indexedS? Compare your answer with those for the above sorting problems.

4. Practice problem-not graded.

size n
of relation S
avg execution time to create index indexedS
avg execution time to sort indexedS (in ms)
 
101

102

103

104

105

106

107

108
 
 
Typically, the makeS function returns a bag instead of a set. In the problems in this section, you are to conduct experiments to measure the execution times to eliminate duplicates.

(a)    Write a SQL query that uses the DISTINCT clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a.

(b)    Write a SQL query that uses the GROUP BY clause to eliminate duplicates in S and report you results in a table such as that in Problem 3a.

(c)    Compare and contrast the results you have obtained in problems 4a and 4b. Again, consider using explain analyze to look at query plans.

3           Indexes and Indexing
Indexes on data (1) permit faster lookup on data items and (2) may speed up query processing on such data. Speedups can be substantial. The purpose of the problems in this section are to explore this. Some other problems in this section are designed to understand the workings of the B+-tree and the extensible hashing data structures.

Discussion PostgreSQL permit the creation of a variety of indexes on tables. We will review such index creation and examine their impact on data lookup and query processing. For more details, see the PostgreSQL manual:

https://www.postgresql.org/docs/13/indexes.html

Example 5 The following SQL statements create indexes on columns or combinations of columns of the personSkill relation.[10] Notice that there are

2arity(personSkill) − 1 = 22 − 1 = 3

such possible indexes.

create index pid_index on personSkill (pid);
-- index on pid attribute
create index skill_index on personSkill (skill);
-- index on skill attribute
create index pid_skill_index on personSkill (pid,skill);
-- index (pid, skill)
Example 6 It is possible to declare the type of index: btree or hash. When no index type is specified, the default is btree. If instead of a Btree, a hash index is desired, then it is necessary to specify a using hash qualifier: create index pid_hash on personSkill using hash (pid); -- hash index on pid attribute

Example 7 It is possible to create an index on a relation based on a scalar expression or a function defined over the attributes of that relation. Consider the following (immutable) function which computes the number of skills of a person:

create or replace function numberOfSkills(p integer) returns integer as

$$ select count(1)::int from    personSkill where pid = p;

$$ language SQL immutable;

Then the following is an index defined on the numberOfSkills values of persons:

create index numberOfSkills_index on personSkill (numberOfSkills(pid)); Such an index is useful for queries that use this function such as select pid, skill from personSkill where numberOfSkills(pid) > 2;

We now turn to the problems in this section.

5. • Consider a relation Student(sid text, sname text, major, year). A tuple (s,n,m,y) is in Student when s is the sid of a student and n, m, and y are that student’s name, major, and birth year. Further, consider a relation Enroll(sid text, cno text, grade text). A triple (s,c,g) is in Enroll when the student with sid s was enrolled in the course with cname c and obtained a letter grade g in that course. We are interested in answering queries of the form

select sid, sname, major, byear from Student where sid in (select sid from         Enroll sid

where cno = c [and|or|and not] grade = g);

Here c denotes a course name and g denotes a letter grade.

Read Section 14.1.7 ‘Indirection in Secondary Indexes’ in your textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and Widom. Of particular interest are (a) the concept of buckets (Figure 14.7) which are sets of tids and (b) the technique of performing set operations (like intersections) on relevant buckets (Figure 14.8) to answer queries of the form as shown above.

The goal of this problem is to use object-relational SQL to simulate these concepts. To make things more concrete, consider the following Student and Enroll relations:

Student:

sid | sname | major | byear

------+--------+---------+------s100 | Eric | CS | 1988 s101 | Nick | Math | 1991 s102 | Chris | Biology | 1977

s103 | Dinska | CS
| 1978
s104 | Zanna | Math
| 2001
s105 | Vince | CS
| 2001
Enroll:

sid | cno | grade

------+------+------s100 | c200 | A s100 | c201 | B

s100 | c202 | A s101 | c200 | B s101 | c201 | A s101 | c202 | A s101 | c301 | C s101 | c302 | A s102 | c200 | B s102 | c202 | A s102 | c301 | B s102 | c302 | A s103 | c201 | A s104 | c201 | D

Now consider associating a tuple id (tid) with each tuple in Enroll:

tid | sid | cno | grade

-----+------+------+-------

1                  | s100 | c200 | A

2                  | s100 | c201 | B

3                  | s100 | c202 | A

4                  | s101 | c200 | B

5                  | s101 | c201 | A

6                  | s101 | c202 | A

7                  | s101 | c301 | C

8                  | s101 | c302 | A

9                  | s102 | c200 | B 10 | s102 | c202 | A

11   | s102 | c301 | B

12   | s102 | c302 | A

13   | s103 | c201 | A

14   | s104 | c201 | D

Use object-relational SQL to construct two secondary indexes indexOnCno and indexOnGrade on the Enroll relation. These indexes should be stored in two separate relations which you can conveniently call indexOnCno and indexOnGrade, respectively. two object-relational views in a manner that simulates the situation illustrated in Figure 14.8. In particular, do not use the ‘create index’ mechanism of SQL to construct these indexes.

Then, using the indexOnCno and indexOnGrade views and the technique of intersecting buckets, write a function FindStudents(booleanOperation text, cno text, grade text) that can be used to answer queries of the form as shown above. (Here the booleanOperation is a string which can be ’and’, ’or’, or ’and not’.) For example, the query

select * from FindStudents(’and’, ’c202’, ’A’); should return the same result as that of the query

select sid, sname, major, byear from Student where sid in (select sid from          Enroll sid where cno = ’c202’ and grade = ’A’);

Test your solution for the following cases on the Student and Enroll relation given for this problem:

(a)    select * from FindStudents(’and’, ’c202’, ’A’);

(b)    select * from FindStudents(’or’, ’c202’, ’A’);

(c)    select * from FindStudents(’and not’, ’c202’, ’A’);

6. Practice problem–not graded. Read Section 14.7 ‘Bitmap Indexes’ in your textbook Database Systems The Complete Book by Garcia-Molina, Ullman, and Widom. In particular, look at Example 14.39 for an example of a bitmap index for a secondary index.

Next, revisit Problem 5. There, we considered two secondary indexes indexOnCno and indexOnGrade. We will now consider the corresponding bitmap indexes bitmapIndexOnCno and bitmapIndexOnGrade:

bitmapIndexOnCno

               cno |             bit-vector

------+---------------c200 | 10010000100000 c201 | 01001000000011 c202 | 00100100010000 c301 | 00000010001000 c302 | 00000001000100

and

bitmapIndexOnGrade

              grade |            bit-vector

-------+----------------

A              | 10101101010110

B              | 01010000101000

C              | 00000010000000

D              | 00000000000001

Use object-relational SQL to construct two secondary indexes bitmapIndexOnCno and bitmapIndexOnGrade as two object-relational relations in a manner that simulates the situation just illustrated above.

Then, using the bitmapIndexOnCno and bitmapIndexOnGrade relations and the technique of forming the bitmap-AND, bitmap-OR, and bitmap-AND NOT of two bit-vectors, write a function FindStudents(booleanOperation text, cno text, grade text) that can be used to answer queries of the form as shown in Problem 5. For example, the query

select * from FindStudents(’and’, ’c202’, ’A’); should return the same result as that of the query

select sid, sname, major, byear from Student where sid in (select sid from          Enroll sid where cno = ’c202’ and grade = ’A’);

Test your solution for the following cases on the Student and Enroll relation given for this problem:

(a)    select * from FindStudents(’and’, ’c202’, ’A’);

(b)    select * from FindStudents(’or’, ’c202’, ’A’); (c) select * from FindStudents(’and not’, ’c202’, ’A’);

7.       • Consider the following parameters:

block size
=
4096 bytes
block-address size
=
9 bytes
block access time (I/O operation)
=
10 ms (micro seconds)
record size
=
150 bytes
record primary key size
=
8 bytes
Assume that there is a B+-tree, adhering to these parameters, that indexes 1 billion (109) records on their primary key values.

Provide answers to the following problems and show the intermediate computations leading to your answers.

(a)    Specify (in ms) the minimum time to determine if there is a record with key k in the B+-tree. (In this problem, you can not assume that a key value that appears in an non-leaf node has a corresponding record with that key value.)

(b)    Specify (in ms) the maximum time to insert a record with key k in the B+-tree assuming that this record was not already in the data file.

(c)    How large must main memory be to hold the first two levels of the B+-tree? How about the first three levels?

8.       • Consider the following B+-tree of order 2 that holds records with keys 2, 8, and 11.[11]

 

(a)    Show the contents of your B+-tree after inserting records with keys 4, 10, 12, 1, 7, and 5 in that order.

 
then n1 should be
k1|k2
and n2 should be
k3|
Strategy for splitting leaf nodes: when a leaf node n needs to be split into two nodes n1 and n2 (where n1 will point to n2), then use the rule that an even number of keys in n is moved into n1 and an odd number of keys is moved into n2. So if n becomes conceptually            and

(b)    Starting from your answer in question 8a, show the contents of your B+-tree after deleting records with keys 12, 2, and 11, in that order.

(c)    Starting from your answer in question 8b, show the contents of your B+-tree after deleting records with keys 5, 1, and 4, in that order.

9.       • Consider an extensible hashing data structure wherein (1) the initial global depth is set at 1 and (2) all directory pointers point to the same empty bucket which has local depth 0. So the hashing structure looks like this: Assume that a bucket can hold at most two records.

 

(a) Show the state of the hash data structure after each of the following insert sequences:[12]

i. records with keys 6 and 7. ii. records with keys 1 and 2. iii. records with keys 9 and 4.

iv. records with keys 8 and 3.

(b) Starting from the answer you obtained for Question 9a, show the state of the hash data structure after each of the following delete sequences:

i. records with keys 9 and 6. ii. records with keys 0 and 1. iii. records with keys 1 and 8.

As in the text book, the bit strings are interpreted starting from their left-most bit and continuing to the right subsequent bits.

The goal in the next problems is to study the behavior of indexes for various different sized instances[13] of the Person, worksFor, and Knows relations and for various queries:

10.   • Create an appropriate index on the worksFor relation that speedups the lookup query

select pid from worksFor where cname = c;

Here c is a company name.

Illustrate this speedup by finding the execution times for this query for various sizes of the worksFor relation.

11.   • Create an appropriate index on the worksFor relation that speedups the range query

select pid, cname from            worksFor where s1 <= salary and salary <= s2;

Here s1 and s2 are two salaries with s1 < s2.

Illustrate this speedup by finding the execution times for this query for various sizes of the worksFor relation.

12.   • Create indexes on the worksFor relation that speedup the multiple conditions query

select pid from          worksFor where salary = s and cname = c;

Here s is a salary and c is a company name.

Illustrate this speedup by finding the execution times for this query for various sizes of the worksFor relation.

13.   • Create indexes on the appropriate relations that speedup the semi-join [anti semi-join] query

select pid, pname from            Person

where pid [not] in (select pid from worksFor where cname = c);

Here c is a company name.

Illustrate this speedup by finding the execution times for this query for various sizes of the Person and worksFor relations.

14.   Practice problem–not graded.

Create indexes that speedup the path-discovery query

select distinct k1.pid1, k3.pid2 from     knows k1, knows k2, knows k3 where k1.pid2 = k2.pid1 and k2.pid2 = k3.pid1;

Illustrate this speedup by finding the execution times for this query for various sizes of the Knows relation.


 
[1] Typically the function SetOfIntegers returns a bag (multiset) but this is fine for this assignment. In case we want a set, we can always eliminate duplicates.
[2] Clearly, all of this can be generalized to higher-arity relations.
[3] When using this function, it is customary to use n such that n ∈ [0,u1− l1 + 1].
[4] Notice that the sum of the probabilities in the probability column in P sum to 1.
[5] You should make it a habit to use the PostgreSQL vacuum function to perform garbage collection between experiments.
[6] Recall that 1ms issecond.
[7] It is possible that you may not be able to run the experiments for the largest S. If that is the case, just report the results for the smaller sizes.
[8] The default working memory size is 4MB.
[9] For     information                    about           indexes       in                    PostgreSQL                     consult       the                 manual       page https://www.postgresql.org/docs/13/indexes.html.
[10] Incidentally, when a primary key is declared when a relation is created, PostgreSQL will create a btree index on this key for the relation.
[11] Recall that (a) an internal node of a B+-tree of order 2 can have either 1 or 2 keys values, and 2 or 3 sub-trees, and (b) a leaf node can have either 1 or 2 key values.
[12] You should interpret the key values as bit strings of length 4. So for example, key value 7 is represented as the bit string 0111 and key value 2 is represented as the bit string 0010.
[13] (Three different sizes, small, medium, large suffice for your experiment.

More products