$30
PROJECT 2: UNDERSTANDING YOUR FRIENDLY QUERY OPTIMIZER
CX4031 DATABASE SYSTEM PRINCIPLES
A DBMS query optimizer executes a query execution plan (QEP) to process a query. As a database administrator, you would like to understand and visualize the performance of the query optimizer for processing queries. This is a typical task an administrator may perform in an organization to comprehend why certain queries or applications are slow. Fortunately, you do not have to write a software to do it. There is a tool called PICASSO (http://dsl.cds.iisc.ac.in/projects/PICASSO/) that can aid you to understand and visualize the performance of a query optimizer.
PROJECT DESCRIPTION
Download the PICASSO tool and install it on your database. Guideline for installing it detailed in the Appendix. Note that you may need to update the JDBC driver to make it work. Specifically, you will use the PICASSO tool to achieve the following two key goals of the project:
(a) Generate and describe the plan, cost, and cardinality diagrams for a set of queries. You should use the TPC-H benchmark data and queries described below. You are encouraged to read the software documentation of PICASSO
(https://dsl.cds.iisc.ac.in/projects/PICASSO/picasso_download/doc/Picasso2Doc .pdf) as well related information in the PICASSO website to get familiarize with the concepts related to plan, cost, and cardinality estimation diagrams.
(b) Design and implement an algorithm that takes as input a query, retrieves its query execution plan, and returns as output an explanation (i.e., reason) that describes the reason why the underlying RDBMS selected this plan among many others. You should exploit the PICASSO tool to generate the explanation. Note that the structure of explanation is open-ended. It can be in the form of natural language or in a more structured form. You are encouraged to develop a visual interface that facilitates user-friendly interaction with your software.
Your goal is to ensure generality of the solution (i.e., it can handle a wide variety of queries) and the explanation should be precise and accurate. Hint: You need to transform the input query to a PICASSO query template in order to generate explanation. Check the guideline in the Appendix to see which part of the PICASSO code you need to understand.
You should use Python or Java as the host language on Windows platform for your project. The DBMS allowed in this project is PostgreSQL. The dataset and queries you should use for this project is TPC-H (see Appendix). You are free to use any off-theshelf toolkits for your project.
For students using Mac platform, you can install Windows on your Mac by following instructions in https://support.apple.com/en-sg/HT201468.
Note that several parts of the project are left open-ended (e.g., what will be the content of the explanation? how the GUI should look like? What are the functionalities we should support?) deliberately so that the project does not curb a group’s creative endeavors. You are free to make realistic assumptions to achieve these tasks.
I.
Creating TPC-H database in PostgreSQL
Follow the following steps to generate the TPC-H data:
1) Go to http://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp and download TPC-H Tools v2.18.0.zip. Note that the version may defer as the tool may have been updated by the developer.
2) Unzip the package. You will find a folder “dbgen” in it.
3) To generate an instance of the TPC-H database:
§ Open up tpch.vcproj using visual studio software.
§ Build the tpch project. When the build is successful, a command prompt will appear with “TPC-H Population Generator <Version 2.17.3>” and several *.tbl files will be generated. You should expect the following .tbl files: customer.tbl, lineitem.tbl, nation.tbl, orders.tbl, part.tbl, partsupp.tbl, region.tbl, supplier.tbl
§ Save these .tbl files as .csv files
§ These .csv files contain an extra “|” character at the end of each line. These “|” characters are incompatible with the format that PostgreSQL is expecting. Write a small piece of code to remove the last “|” character in each line. Now you are ready to load the .csv files into PostgreSQL § Open up PostgreSQL. Add a new database “TPC-H”.
§ Create new tables for “customer”, “lineitem”, “nation”, “orders”, “part”,
“partsupp”, “region” and “supplier”
§ Import the relevant .csv into each table. Note that pgAdmin4 for PostgreSQL (windows version) allows you to perform import easily. You can select to view the first 100 rows to check if the import has been done correctly. If encountered error (e.g., ERROR: extra data after last expected column) while importing, create columns of each table first before importing. Note that the types of each column has to be set appropriately. You may use the SQL commands in Appendix II to create the tables.
II.
SQL commands for creating TPC-H data tables
Region table
1) Nation table
2) Part table
3) Supplier table
4) Partsupp table
5) Customer table
6) Orders table
7) Lineitem table
III.
Installing and Using PICASSO
1) Download picasso2.1.zip from https://dsl.cds.iisc.ac.in/projects/PICASSO/picasso_download/license.htm
Follow installation instruction in:
https://dsl.cds.iisc.ac.in/projects/PICASSO/picasso_download/doc/Installation/installati on.htm Note:
Install Java 3D 1.5.1 (https://www.oracle.com/java/technologies/java-archivedownloads-java-client-downloads.html#java3d-1.5.1-oth-JPR).
Upon installation, copy the three .jar (j3dcore.jar, j3dutils.jar and vecmath.jar) and j3dcore-ogl.dll (for Windows) to ..\picasso2.1\picasso2.1\Libraries.
2) Guide on how to use the PICASSO GUI is found here: https://dsl.cds.iisc.ac.in/projects/PICASSO/picasso_download/doc/Usage/controls.h tm
Note: For Part (b) of the project, you would need to explore the code of PICASSO. A good place to start is PicassoDiagram.java inside the PicassoServer folder. The .java provides the logic of generating the diagrams used in PICASSO GUI.