$24.99
Scope
The objectives of Assignment 2 include conceptual modelling of a data warehouse, implementation of 0NF tables in HQL, implementation of external tables in HQL, and querying a data cube.
This assignment is worth 30% of the total evaluation in the subject.
Only electronic submission through Moodle at:
https://moodle.uowplatform.edu.au/login/index.php
will be accepted. All email submissions will be deleted and mark 0 ("zero") will be immediately granted for Assignment 2. A submission procedure is explained at the end of Assignment 2 specification.
Only one submission of Assignment 2 is allowed and only one submission per student is accepted.
A submission that contains an incorrect file attached is treated as a correct submission with all consequences coming from the evaluation of the file attached.
All files left on Moodle in a state "Draft(not submitted)" will not be evaluated.
A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, … etc) is not allowed. The compressed files will not be evaluated.
Task 1 (6 marks) Intuitive design of a data cube from a conceptual schema of an operational database
Consider the following conceptual schema of an operational database owned by a multinational real estate company. The database contains information about the real estate properties offered for sale, owners of the properties, potential buyers who are interested in the properties and real estate agents involved in selling of the properties.
Whenever a property is put on a market by an owner, a description of the property is entered into an operational database. Whenever a property is purchased, its description is removed from an operational database.
The real estate company would like to create a data warehouse to keep information about the finalized real estate transactions, properties involved in the transactions, sellers/owners, and agents involved in the real estate transactions. The real estate company would like to use a data warehouse to implement the following classes of analytical applications.
(1) Find the total number of real estate properties sold per month, year, street, city, country, agent involved.
(2) Find an average asked price of real estate properties sold per month, year, street, city, country, agent involved.
(3) Find an average final price of real estate properties sold per month, year, street, city, country, agent involved.
(4) Find an average period of time on the marked of real estate properties sold per month, year, street, city, country, agent involved.
(5) Find the total number of times each real estate property has been sold in a given period of time.
(6) Find the total number of buyers interested in purchases of real estate properties sold per day, month, year, street, city, country, agent involved.
Note, the operational database does not contain all information necessary to implement the classes of applications listed above. Additional information must be added when data is transferred from an operation database to a data warehouse.
(1) Use a short explanation of a database domain and a conceptual schema given above, to find a data cube, that should be implemented by the multinational real estate company to create a data warehouse. In your specification of a data cube, list the facts, the measures, the names of dimensions and the hierarchies.
(2) Pick any three dimensions from a data cube found in the previous step and at least 4 values in each dimension and one measure to draw a sample three-dimensional data cube in a perspective view similar to a view included in a presentation 09 Data Warehouse Concepts, slide 6.
Deliverables
A file solution1.pdf that contains
(1) a specification of data cube as a list of facts, measures, dimensions, and hierarchies obtained as a result of task (1),
(2) a perspective drawing of three-dimensional data cube as a result of task (2).
Task 2 (7
Conceptual modelling of a data warehouse
An objective of this task is to create a conceptual schema of a sample data warehouse domain described below. Read and analyse the following specification of a data warehouse domain.
A person is represented as either a patient or a medical worker or an administration worker. Medical and administration workers work in the medical facilities that have a name, address, possibly (not obligatory) specialization. Each medical worker is described a unique staff number at a facility, name, address, and phone number.
A patient is eligible for his or her company health care benefits. Patient data includes name, id number (social security number), address (street, city, state, zip), and phone.
A medical worker must hold one or more credentials that are granted to work in a particular medical facility. Doctors are allowed to deliver diagnosis and give treatment based on their specialization Paramedics are allowed to deliver only emergency diagnosis and treatment for any type of life-threatening problems. Nurses do not deliver diagnosis, but they do participate in treatment, particularly if the patient must be prepared for surgery or remain at the facility overnight.
Medical facilities are located in different suburbs of different cities. A medical facility is uniquely identified by an address.
A data warehouse must be designed such it should be possible to easily implement the following classes of applications.
A management of the medical facilities would like to get from a data warehouse information about
- the total number of medical services performed per medical facility, per year, per month per day, per city and per medical worker,
- total length of medical services per medical facility, per year, per month per day, per city and per medical worker,
- average length of medical services per medical facility, per year, per month per day, per city and per medical worker,
- total number of doctors/ paramedics/nurses involved in medical services, per year, per month per day, per medical facility, per city,
- average time spend on medical services per year, month, day,
- total costs of medical services per year, month, day, per medical facility, per city.
To draw a conceptual schema, use a graphical notation explained to you in a presentation 11 Conceptual Data Warehouse Design.
To create a conceptual schema of a sample data warehouse domain, follow the steps listed below.
Step 1 Find a fact entity, find the measures describing a fact entity.
Step 2 Find the dimensions.
Step 3 Find the hierarchies over the dimensions.
Step 4 Find the descriptions (attributes) of all entity types.
Step 5 Draw a conceptual schema.
To draw a conceptual schema, you must use a graphical notation explained to you in a presentation 11 Conceptual Data Warehouse Design.
To draw your diagram, you can use UMLet diagram drawing tool and apply a
"Conceptual modelling" notation, Selection of a drawing notation is available in the right upper corner of the main menu of UMLet diagram drawing tool. UMLet 14.3 software is can be downloaded from the subject's Moodle Web site in a section WEB LINKS. A neat hand drawing is still all right.
Deliverables
A file solution2.pdf with a drawing of a conceptual schema of a sample data warehouse domain.
Task 3 (6 Implementation of a table with a complex column type (0NF table) in Hive
008,Harry Potter|DB/3:70,Oracle:75|
010,Robin Banks| |C,Rust
009,Robin Hood| | ...
(2) Include into the script INSERT statements that load sample data into the table. Insert at least 5 rows into the relational table created in the previous step. Two employees must participate in few projects and must know few programming languages. One employee must participate in few projects and must not know any programming languages. One employee must know few programming languages and must not participate in any projects. One employee must not know programming languages and must not participate in the projects.
(3) Include into the script SELECT statements that lists the contents of the table.
When ready, use a command line interface beeline to process a script solution3.hql and to save a report from processing in a file solution3.rpt.
If the processing of the file returns the errors then you must eliminate the errors! Processing of your script must return NO ERRORS! A solution with errors is worth no marks!
Deliverables
A file solution3.rpt with a report from processing of HQL script solution3.hql. The report MUST NOT include any errors, and the report must list all SQL statements processed.
Task 4 (6 Implementation of a data warehouse as a collection of external tables in Hive
Consider the following two-dimensional data cube.
The data cube contains information about that parts that can be shipped by the suppliers.
Download and unzip a file task4.zip. You should obtain a folder task4 with the following files: part.tbl, supplier.tbl, partsupp.tbl.
Use an editor to examine the contents of *.tbl files. Note, that the contents of the files can be loaded into the relational tables obtained from the transformation of the twodimensional data cube given above into the relational table PART, SUPPLIER, and PARTSUPP.
Transfer the files into HDFS.
Implement HQL script solution4.hql that creates the external tables obtained from a step of logical design performed earlier. The external tables must overlap on the files transferred to HDFS in the previous step. Note, that a header in each *.tbl file must be removed before creating the external tables.
Include into solution4.hql script SELECT statements that any 5 rows from each one of the external tables implemented in the previous step and the total number of rows included in each table.
When ready, use a command line interface beeline to process a script solution4.hql and to save a report from processing in a file solution3.rpt.
Deliverables
A file solution4.rpt with a report from processing of HQL script solution4.hql.
Task 5 (5 Querying a data cube
Download a file task5.zip and unzip the file. You should obtain a folder task5 with the following files: dbcreate.hql, dbdrop.hql, partsupp.tbl, lineitem.tbl, and orders.tbl.
A file orders.tbl contains information about the orders submitted by the customers. A file lineitem.tbl contains information about the items included in the orders. A file partsupp.tbl contains information about the items and suppliers of items included in the orders.
Open Terminal window and use cd command to navigate to a folder with the just unzipped files. Start Hive Server 2 in the terminal window (remember to start Hadoop first). When ready process a script file dbcreate.hql to create the internal relational tables and to load data into the tables. You can use either beeline or SQL Developer. A script dbdrop.hql can be used to drop the tables.
The relational tables PARTSUPP, LINEITEM, ORDERS implement a simple twodimensional data cube. The relational tables PARTSUPP and ORDERS implement the dimensions of parts supplied by suppliers and orders. A relational table LINEITEM implements a fact entity of a data cube.
(1) For each part list its key (PS_PARTKEY), all its available quantities (PS_AVAILQTY), the smallest available quantity, and the average available quantity. Consider only the parts with the keys 5 and 15.
(2) For each part list its key (PS_PARTKEY), all its available quantities (PS_AVAILQTY), and summarized all available quantities. List the results in the order of increasing results of summarization. Consider only the parts with the keys 5, 10, 15, and 25.
(3) For each part list its key (PS_PARTKEY), all its available quantities (PS_AVAILQTY), and summarized all available quantities. List the results in the order of increasing results of summarization and supply costs (PS_SUPPLYCOST). Consider only the parts with the keys 5, 10, 15, and 25.
(4) For each part list its key (PS_PARTKEY) and all its available quantities (PS_AVAILQTY) sorted in descending order and a rank (position number in an ascending order) of each quantity. Consider only the parts with the keys 10 and 20. Use an analytic function ROW_NUMBER().
(5) For each part list its key (PS_PARTKEY), its available quantity, and an average available quantity (PS_AVAILQTY) of the current quantity and all previous quantities in the ascending order of available quantities. Consider only the parts with
the keys 15 and 25. Use ROWS UNBOUNDED PRECEEDING sub-clause within PARTITION BY clause.
When ready, save your SELECT statements in a file solution5.hql. Then, process a script file solution5.hql and save the results in a report solution5.rpt.
If the processing of the file returns the errors then you must eliminate the errors! Processing of your script must return NO ERRORS! A solution with errors is worth no marks!
Deliverables
A file solution5.rpt with a report from processing of HQL script solution5.hql. The report MUST NOT include any errors, and the report must list all SQL statements processed.
Submission of Assignment 2
Note, that you have only one submission. So, make it absolutely sure that you submit the correct files with the correct contents. No other submission is possible !
Submit the files solution1.pdf, solution2.pdf, solution3.rpt, solution4.rpt, and solution5.rpt through Moodle in the following way:
(1) Access Moodle at http://moodle.uowplatform.edu.au/
(2) To login use a Login link located in the right upper corner the Web page or in the middle of the bottom of the Web page
(3) When logged select a site ISIT312 (SP222) Big Data Management
(4) Scroll down to a section SUBMISSIONS
(5) Click at In this place you can submit the outcomes of your work on the tasks included in Assignment 2 link.
(6) Click at a button Add Submission
(7) Move a file solution1.pdf into an area You can drag and drop files here to add them. You can also use a link Add…
(8) Repeat step (7) for the remaining files solution2.pdf, solution3.rpt, solution4.rpt, and solution5.rpt
(9) Click at a button Save changes
(10) Click at a button Submit assignment
(11) Click at the checkbox with a text attached: By checking this box, I confirm that this submission is my own work, … in order to confirm authorship of your submission.
(12) Click at a button Continue
End of specification