Starting from:

$25

CS5340-6340 - Project - Solved

The project for this course will be to design and build an information extraction (IE) system for the domain of corporate acquisition events.[1] You can work in a 2-person team or as a solo team, depending on your preference. Your team’s program should read short news stories about corporate acquisition events and extract several pieces of information. A sample news story is shown below:

Text 379
Four Seasons Hotels said it and VMS Realty Partners of Chicago have agreed to purchase the Santa Barbara Biltmore Hotel from Marriott Corp for an undisclosed amount.

It said the venture will rename the hotel the Four Seasons Biltmore at Santa Barbara and invest over 13 mln dlrs in improvements on the 228-room property. Reuter
Each news story will describe exactly one relevant corporate event. For each story, your IE system will have to fill out an event template with the 8 “slots” defined below. The TEXT field will contain the story’s identifier and the other 7 slots should contain information automatically extracted from the story. Most stories will not contain all of this information, in which case only a subset of these slots should be filled. Some stories may have only a couple slots filled, while others may have most or even all of these slots filled.

TEXT: filename identifier

ACQUIRED: entities that were acquired

ACQBUS: the business focus of the acquired entities ACQLOC: the location of the acquired entities

DLRAMT: the amount paid for the acquired entities

PURCHASER: entities that purchased the acquired entities

SELLER: entities that sold the acquired entities

STATUS: status description of the acquisition event

You will be provided with a collection of corporate acquisition news stories as well as “gold” answer key templates, which you can use to help design and develop your system.[2] The output of your system will be scored against these answer keys.

Template formatting is as follows:

•    Every template MUST have at least 8 rows corresponding to the 8 slots defined earlier. The slots must be printed in exactly the order shown.

•    If no information can (or should) be extracted for a slot, then you should print three dashes (- - -) to indicate that the slot is empty.

•    Every non-empty answer should be printed in double quotes (e.g., “this is an acceptable answer string”).

•    Every slot except the TEXT field can potentially have more than one answer. Each distinct answer should be printed on a separate line with the same slot field name. For example, if someone acquires 3 companies, then your output template should contain 3 rows for the slot type ACQUIRED. The order in which you print these answers does not matter (but you should print them in adjacent rows).

•    The answer key templates will sometimes contain a disjunction of acceptable answers for a slot, with the disjuncts separated by a slash. For example, an answer key might list “IBM” / “IBM Corp”, indicating that the strings “IBM” and “IBM Corp” are both acceptable answers. IMPORTANT: Your system should not use slashes in its output templates! Each slot should be filled by a single extracted string in your output.

As an example, the answer key template for Text 379 is shown below:

Answer Key Template

TEXT: 379

ACQUIRED: “Biltmore Hotel” / “Santa Barbara Biltmore Hotel”

ACQBUS: - - -

ACQLOC: “Santa Barbara”

DLRAMT: “undisclosed amount”

PURCHASER: “Four Seasons Hotels”

PURCHASER: “VMS Realty Partners”

SELLER: “Marriott Corp”

STATUS: “agreed to purchase”
This answer key template indicates that the Biltmore Hotel should be extracted as the ACQUIRED entity, and that the strings “Biltmore Hotel” and “Santa Barbara Biltmore Hotel” are both acceptable answers. So if your system extracts either one of those strings, its output will be scored as correct.

The ACQBUS slot is empty in the answer key template, indicating that nothing should be extracted for that slot.

The ACQLOC, DLRAMT, SELLER, and STATUS slots each have a single answer that should be extracted. Note that the DLRAMT slot should be filled by a phrase, rather than a monetary amount, in this particular story.

The story mentions that two entities purchased the Biltmore Hotel, so both entities need to be extracted as PURCHASERs. Each one is printed on a separate line, with the same PURCHASER: slot field name. The order they are listed does not matter.

 

Input
Your IE system should accept a single input file as a command-line argument, which will list the texts to be processed. We should be able to run your program like this if you use python:

python3 extract.py <doclist>

If you use Java, you should invoke Java similarly and be sure to accept the same argument on the command-line.

The doclist file will contain a list of full pathnames for the files to be processed, one per line. For example, a doclist file might look like this:

/home/kermit/docsA/10

/home/kermit/docsA/XYZ

/home/kermit/docsB/story.txt

Each pathname should be split into a path (everything up to and including the rightmost slash /) and the filename itself (the string after the rightmost slash). For example, the pathnames in the doclist above should be split into:

              PATH = /home/kermit/docsA/         FILENAME = 10

              PATH = /home/kermit/docsA/          FILENAME = XYZ

              PATH = /home/kermit/docsB/         FILENAME = story.txt

The filename should be treated as a text’s identifier when creating its output template. You can assume that the filenames will be unique.

 

Output
As output, your IE system should produce a set of output templates, one template per story. Your system should always produce a template for a story and the TEXT: slot should always be filled.

 

Your system should print the output templates to a single file that has the same name as the input file but with an added extension of “.templates”. For example, if the input file is called “doclistX” then the output file should be named “doclistX.templates”. The output templates should correspond to the stories in the doclist input file in exactly the same

 

order. Print a blank line between the templates for different stories.

For example, if the input file lists three stories with the identifiers 10, 22, and 43 (listed in that order), then the output file should contain exactly three templates, where the first one corresponds to text 10, the second one corresponds to text 22, and the third one corresponds to text 43.

 

The Data Sets
You will be given three sets of data at different points in the project.

Development Set: approximately 400 stories and answer keys

Test Set #1: 100 stories and answer keys

Test Set #2: 100 stories and answer keys

 

Project Phases
The project will involve three phases:

Development Phase: A Development Set is available on CANVAS for you to use when creating your IE system. You may use these stories and the answer keys in any way that you wish.

In addition, we will give you the scoring program that we will use to evaluate your IE system. You can use this scoring program to assess the performance of your system yourself as you experiment with different ideas. The arguments that it takes are described at the beginning of the file.

More products