$25
This assignment relies on the lectures
• SQL Part 1 and SQL Part 2 (Pure SQL);
• Relational Algebra (RA);
• joins and semijoins;
• Translating Pure SQL queries into RA expressions; and
• Query optimization
with particular focus on the last two lectures.
You should use the Assignment-Script-2021-Fall-Assignment3.sql file to construct the assignment3.sql file. (note that the data to be used for this assignment is included in this file.) In addition, you will need to upload a separate assignment3.txt file that contains the results of running your queries. Finally, you need to upload a file assignment3.pdf that contains the solutions to the problems that require it.
The problems that need to be included in the assignment3.sql are marked with a blue bullet •. The problems that need to be included in the assignment3.pdf are marked with a red bullet •. (You should aim to use
Latex to construct your .pdf file.)
Database schema and instances
For the problems in this assignment we will use the following database schema:[1]
Person(pid, pname, city)
Company(cname,headquarter) Skill(skill) worksFor(pid, cname, salary) companyLocation(cname,city) personSkill(pid,skill) hasManager(eid,mid) Knows(pid1,pid2)
In this database we maintain a set of persons (Person), a set of companies (Company), and a set of (job) skills (Skill). The pname attribute in Person is the name of the person. The city attribute in Person specifies the city in which the person lives. The cname attribute in Company is the name of the company. The headquarter attribute in Company is the name of the city wherein the company has its headquarter. The skill attribute in Skill is the name of a (job) skill.
A person can work for at most one company. This information is maintained in the worksFor relation. (We permit that a person does not work for any company.) The salary attribute in worksFor specifies the salary made by the person.
The city attribute in companyLocation indicates a city in which the company is located. (Companies may be located in multiple cities.)
A person can have multiple job skills. This information is maintained in the personSkill relation. A job skill can be the job skill of multiple persons. (A person may not have any job skills, and a job skill may have no persons with that skill.)
A pair (e,m) in hasManager indicates that person e has person m as one of his or her managers. We permit that an employee has multiple managers and that a manager may manage multiple employees. (It is possible that an employee has no manager and that an employee is not a manager.) We further require that an employee and his or her managers must work for the same company.
The relation Knows maintains a set of pairs (p1,p2) where p1 and p2 are pids of persons. The pair (p1,p2) indicates that the person with pid p1 knows the person with pid p2. We do not assume that the relation Knows is symmetric: it is possible that (p1,p2) is in the relation but that (p2,p1) is not.
The domain for the attributes pid, pid1, pid2, salary, eid, and mid is integer. The domain for all other attributes is text. We assume the following foreign key constraints:
• pid is a foreign key in worksFor referencing the primary key pid in Person;
• cname is a foreign key in worksFor referencing the primary key cname in Company;
• cname is a foreign key in companyLocation referencing the primary key cname in Company;
• pid is a foreign key in personSkill referencing the primary key pid in Person;
• skill is a foreign key in personSkill referencing the primary key skill in Skill;
• eid is a foreign key in hasManager referencing the primary key pid in Person;
• mid is a foreign key in hasManager referencing the primary key pid in Person;
• pid1 is a foreign key in Knows referencing the primary key pid in Person; and
• pid2 is a foreign key in Knows referencing the primary key pid in
Person
Pure SQL and RA SQL
In this assignemt, we distinguish between Pure SQL and RA SQL. Below we list the only features that are allowed in Pure SQL and in RA SQL.
In particular notice that
• join, NATURAL join, and CROSS join are not allowed in Pure SQL.
• The predicates [not] IN, SOME, ALL, [not] exists are not allowed in RA SQL.
The only features allowed in Pure SQL
select ... from ... where WITH ...
union, intersect, except operations exists and not exists predicates
IN and not IN predicates
ALL and SOME predicates
VIEWs that can only use the above RA SQL features
The only features allowed in RA SQL
select ... from ... where WITH ...
union, intersect, except operations
join ... ON ..., natural join, and CROSS join operations VIEWs that can only use the above RA SQL features commas in the from clause are not allowed
1 Theoretical problems related to query translation and optimization
1. Consider two RA expressions E1 and E2 over the same schema. Furthermore, consider an RA expression F with a schema that is not necessarily the same as that of E1 and E2. Consider the following if-then-else query:
if F = ∅ then return E1 else return E2
So this query evaluates to the expression E1 if F = ∅ and to the expression E2 if F 6= ∅.
We can formulate this query in SQL as follows[2]:
select e1.*
from E1 e1
where not exists (select distinct row() from F) union select e2.*
from E2 e1
where exists (select distinct row() from F); Remark 1 The subquery query
select distinct row() from F returns the empty set if F = ∅ and returns the tuple () if F 6= ∅.[3]In RA, this query can be written as
π()(F).
I.e., the projection of F on an empty list of attributes.
• In function of E1, E2, and F, write an RA expression in standard notation that expresses this if-then-else query.[4]
2. Let R(x) be a unary relation that can store a set of integers R.
Consider the following boolean SQL query:
select not exists(select 1
from R r1, R 2
where r1.x <> r2.x) as fewerThanTwo;
This boolean query returns the constant “true” if R has fewer than two elements and returns the constant “false” otherwise.
• Using the insights you gained from Problem 1, write an RA expression in standard notation that expresses the above boolean SQL query.[5]
3. In the translation algorithm from Pure SQL to RA we tacitly assumed that the argument of each set predicate was a (possibly parameterized) Pure SQL query that did not use a union, intersect, nor an except operation.
In this problem, you are asked to extend the translation algorithm from Pure SQL to RA such that the set predicates [not] exists are eliminated that have as an argument a Pure SQL query (possibly with parameters) that uses a union, intersect, or except operation.
More specifically, consider the following types of queries using the [not] exists set predicate.
select L1(r1,...,rn)
from R1 r1, ..., Rn rn where C1(r1,...,rn) and
[not] exists (select L2(s1,...,sm) from S1 s1,..., S1 sm
where C2(s1,...,sm,r1,...,rn) [union | intersect | except] select L3(t1,...,tk)
from T1 t1, ..., Tk tk
where C3(t1,...,tk,r1,...,rn)) Observe that there are six cases to consider:
(a) exists (... union ...)
(b) exists (... intersect ...)
(c) exists (... except ...)
(d) not exists (... union ...)
(e) not exists (... intersect ...)
(f) not exists (... except ...)
• Show how such SQL queries can be translated to equivalent RA expressions in standard notation. Be careful in the translation since you should take into account that projections do not in general distribute over intersections and over set differences.
To get practice, first consider the following special case where n = 1, m = 1, and k = 1. I.e., the following case: [6]
select L1(r)
from R r
where C1(r) [not] exists (select L2(s)
from S s
where C2(s,r) [union | intersect | except] select L3(t)
from T t
where C3(t,r))
4. • Let R be a relation with schema (a,b,c) and let S be a relation with schema (d,e).
Prove, from first principles[7], the correctness of the following rewrite rule:
πa,d(R ./c=d S) = πa,d(πa,c(R) ./c=d πd(S)).
5. • Consider the same rewrite rule
πa,d(R ./c=d S) = πa,d(πa,c(R) ./c=d πd(S)) as in problem 4.
Furthermore assume that S has primary key d and that R has foreign key c referencing this primary key in S.
How can you simplify this rewrite rule? Argue why this rewrite rule is correct.
2 Translating Pure SQL queries to RA expressions and optimized RA expressions
In this section, you are asked to translate Pure SQL queries into RA SQL queries as well as in standard RA expressions using the translation algorithm given in class. You are required to show the intermediate steps that you took during the translation. After the translation, you are asked to optimize the resulting RA expressions.
You can use the following letters, or indexed letters, to denote relation names in RA expressions:
P, P1, P2,···Person
C, C1, C2,···Company
S, S1, S2,···Skill
W, W1, W2,···worksFor cL, cL1, cL2,···companyLocation pS, pS1, pS2,···personSkill
hM, hM1, hM2,···hasManager
K, K1, K2,···Knows
We illustrate what is expected using an example.
Example 1 Consider the query “Find each (p,c) pair where p is the pid of a person who works for a company c located in Bloomington and whose salary is the lowest among the salaries of persons who work for that company.
A possible formulation of this query in Pure SQL is
select w.pid, w.cname from worksfor w where w.cname in (select cl.cname from companyLocation cl where cl.city = ’Bloomington’) and
w.salary <= ALL (select w1.salary from worksfor w1 where w1.cname = w.cname);
which is translated to[8]
select q.pid, q.cname from (select w.* from worksfor w where w.cname in (select cl.cname from companyLocation cl where cl.city = ’Bloomington’)
intersect
select w.* from worksfor w where w.salary <= ALL (select w1.salary from worksfor w1 where w1.cname = w.cname)) q;
which is translated to[9]
select q.pid, q.cname from (select w.* from worksfor w, companyLocation cl where w.cname = cl.cname and cl.city = ’Bloomington’ intersect
(select w.* from worksfor w
except select w.* from worksfor w, worksfor w1
where w.salary > w1.salary and w1.cname = w.cname)) q; which is translated to[10]
select q.pid, q.cname from (select w.* from worksfor w, (select cl.* from companyLocation cl where cl.city = ’Bloomington’) cl where w.cname = cl.cname intersect
(select w.* from worksfor w
except select w.* from worksfor w, worksfor w1
where w.salary > w1.salary and w1.cname = w.cname)) q;
which is translated to the RA SQL query[11]
select q.pid, q.cname from (select w.* from worksfor w natural join (select cl.* from companyLocation cl where cl.city = ’Bloomington’) cl intersect
(select w.* from worksfor w
except select w.*
from worksfor w join worksfor w1 on (w.salary > w1.salary and w1.cname = w.cname))) q;
This RA SQL query can be formulated as an RA expression in standard notation as follows:
πW.pid,W.cname(E ∩ (W − F))
where
E = πW.∗(W ./ σcity=Bloomington(cL))
and
F = πW.∗(W ./W.salary>W1.salary ∧W1.cname=W.cname W1).
We can now commence the optimization.
Step 1 Observe the expression E∩(W −F). This expression is equivalent with (E∩W)−F. Then observe that, in this case, E ⊆ W. Therefore E∩W = E, and therefore E∩(W −F) can be replaced by E − F. So the expression for the query becomes
πW.pid,W.cname(E − F).
Step 2 We now concentrate on the expression
E = πW.∗(W ./ σcity=Bloomington(cL)).
We can push the projection over the join and get
πW.∗(W ./ πcname(σcity=Bloomington(cL))).
Which further simplifies to
W n σcity=Bloomington(cL).
We will call this expression Eopt.
Step 3 We now concentrate on the expression
F = πW.∗(W ./W.salary>W1.salary ∧W1.cname=W.cname W1).
We can push the projection over the join and get the expression
πW.∗(W ./W.salary>W1.salary ∧W1.cname=W.cname πW1.cname,W1.salary(W1)).
We will call this expression Fopt.
Therefore, the fully optimized RA expression is
πW.pid,W.cname(Eopt − Fopt).
I.e.,
πW.pid,W.cname(W n σcity=Bloomington(cL)−
πW.∗(W ./W.salary>W1.salary ∧W1.cname=W.cname πW1.cname,W1.salary(W1))).
We now turn to the problems in this section.
6. Consider the query “Find the cname and headquarter of each company that employs persons who earn less than 55000 and who do not live in Bloomington.”
A possible way to write this query in Pure SQL is
select c.cname, c.headquarter from company c where c.cname in (select w.cname from worksfor w where w.salary < 55000 and
w.pid = SOME (select p.pid
from person p where p.city <> ’Bloomington’));
(a) • Using the Pure SQL to RA SQL translation algorithm, translate this Pure SQL query to an equivalent RA SQL query. Show the translation steps you used to obtain your solution.
(b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation. Specify at least three conceptually different rewrite rules that you used during the optimization.
7. Consider the query “Find the pid of each person who has all-butone job skill.”
A possible way to write this query in Pure SQL is
select p.pid from person p where exists (select 1
from skill s
where (p.pid, s.skill) not in (select ps.pid, ps.skill from personSkill ps)) and
not exists (select 1 from skill s1, skill s2 where s1.skill <> s2.skill and
(p.pid, s1.skill) not in (select ps.pid, ps.skill from personSkill ps) and
(p.pid, s2.skill) not in (select ps.pid, ps.skill
from personSkill ps));
(a) • Using the Pure SQL to RA SQL translation algorithm, translate this Pure SQL query to an equivalent RA SQL query. Show the translation steps you used to obtain your solution.
(b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation. Specify at least three conceptually different rewrite rules that you used during the optimization.
8. Consider the query “Find the pid and name of each person who works for a company located in Bloomington but who does not know any person who lives in Chicago.”
A possible way to write this query in Pure SQL is
select p.pid, p.pname from person p where exists (select 1 from worksFor w, companyLocation c where p.pid = w.pid and w.cname = c.cname and c.city = ’Bloomington’) and
p.pid not in (select k.pid1
from knows k where exists (select 1 from person p
where k.pid2 = p.pid and p.city = ’Chicago’));
(a) • Using the Pure SQL to RA SQL translation algorithm, translate this Pure SQL query to an equivalent RA SQL query. Show the translation steps you used to obtain your solution.
(b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation. Specify at least three conceptually different rewrite rules that you used during the optimization.
9. Consider the query “Find the cname and headquarter of each company that (1) employs at least one person and (2) whose workers who make at most 70000 have both the programming and AI skills.”
A possible way to write this query in Pure SQL is
select c.cname, c.headquarter from company c where exists (select 1 from worksfor w where w.cname = c.cname) and not exists (select 1 from worksfor w where w.cname = c.cname and w.salary <= 70000 and
(w.pid not in (select ps.pid from personskill ps where skill = ’Programming’) or
w.pid not in (select ps.pid from personskill ps where skill = ’AI’)));
(a) • Using the Pure SQL to RA SQL translation algorithm, translate this Pure SQL query to an equivalent RA SQL query. Show the translation steps you used to obtain your solution.
(b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation. Specify at least three conceptually different rewrite rules that you used during the optimization.
10. Consider the following Pure SQL query.
select p.pid, exists (select 1 from hasManager hm1, hasManager hm2 where hm1.mid = p.pid and hm2.mid = p.pid and hm1.eid <> hm2.eid)
from Person p;
This query returns a pair (p,t) if p is the pid of a person who manages at least two persons and returns the pair (p,f) otherwise.[12]
(a) • Using the insights gained from Problem 2, translate this Pure SQL query to an equivalent RA SQL query.
(b) • Optimize this RA SQL query and provide the optimized expression in standard RA notation.
[1] The primary key, which may consist of one or more attributes, of each of these relations is underlined.
[2] In this SQL query E1, E2, and F denote SQL queries corresponding to the RA expressions E1, E2, and F, respectively.
[3] The tuple () is often referred to as the empty tuple, i.e., the tuple without components. It is akin to the empty string in the theory of formal languages. I.e., the string wihout alphabet characters.
[4] Hint: consider using the Pure SQL to RA SQL translation algorithm.
[5] Hint: recall that, in general, a constant value “a” can be represented in RA by an expression of the form (A: a). (Here, A is some arbitrary attribute name.) Furthermore, recall that we can express (A: a) in SQL as “select a as A”. Thus RA expressions for the constants “true” and “false” can be the expressions (A: true) and (A: false), respectively.
[6] Once you can handle this case, the general case is a similar.
[7] In particular, do not use the rewrite rule of pushing projections over joins. Rather, use Predicate Logic or TRC to provide a proof.
[8] Translation of ‘and’ in the ‘where’ clause.
[9] Translation of ‘in’ and ‘<= ALL’.
[10] Move ’constant’ condition.
[11] Introduction of natural join and join.
[12] t represent the boolean value true and f represents the boolean value false.