The U.S. National Library of Medicine (NLM) maintains a database called MEDLINE that contains more than 25 million references to journal articles in biomedicine, whose access is mediated by PUBMED. In this , you are asked to go through the following steps for biomedical relation extraction: (1) Search the MEDLINE abstracts to collect 100 sentences, each of which contains at least one of the five verbs below
(including their inflected forms), (2) provide annotations to these sentences for triples <X, ACTION, Y to extract, (3) develop a relation extraction module based on a randomly selected set of 80 sentences and assess its performance, and (4) apply the module to the remaining (annotated but unseen) 20 sentences and assess its performance.
Additional constraints are shown below:
(1) There is no need for an automated method to collect such sentences from MEDLINE. However, you must include the verbs activate, inhibit, and bind, together with another verb with positive actions, such as accelerate, augment, induce, stimulate, require, and up-regulate, and another verb with negative actions, such as abolish, block, down-regulate, and prevent, with 20 sentences for each of the five distinct verbs. You should also prefer recency of publication, starting with the year 2020, limiting up to 30 sentences per year, up to 10 sentences per journal, and up to two sentences per organization, as identified by the affiliation of the corresponding author.
(2) The following shows some guidelines for your annotation of expected triples.
A. Inorganic phosphate inhibited HPr kinase but activated HPR phosphatase.
<Inorganic phosphate, inhibited, HPr kinase,
<Inorganic phosphate, activated, HPR phosphatase
B. All vasodilators activated K-Cl cotransport in LK SRBCs and HYZ in VSMCs, and this activation was inhibited by calyculin and genistein, two inhibitors of K-Cl cotransport.
<All vasodilators, activated, K-CI cotransport
<All vasodilators, activated, HYZ
<this activation, was inhibited by, calyculin OR <calyculin, inhibited, this activation
<this activation, was inhibited by, genistein OR <genistein, inhibited, this activation <this activation, was inhibited by, two inhibitors OR <two inhibitors, inhibited, this activation
(3) In developing a relation extraction module, you should not use any of the thirdparty modules for coreference resolution, NER, relation extraction, event extraction, or parsers specifically made available for biomedicine, except for the NER module in NLTK. You should not use any of the external corpora for training.
Report the performance in terms of Precision/Recall/F-score for (3) and (4).
As before, you should use techniques that can be implemented in Python and NLTK.
A Write a Python code for relation extraction.
B Show 100 sentences, annotated with expected triples, together with tags for 80 sentences.
Discuss your results, to explain how you addressed the goal, and to suggest how you can improve the quality of the results further