$35
1.
Task Introduction
Task: Extractive Question Answering
https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/bert_v8.pdf#page=23
Dataset: Chinese Reading Comprehension
Dataset: Chinese Reading Comprehension
Paragraph
新加坡、馬來西亞的華文學術界在 1970年代後開始統一使用簡體中文;然而繁體字在媒體中普遍存在著,例如華人商店的招牌、舊告示及許多非學術類中文書籍,香港和臺灣所出版的書籍也有在市場上流動。當地許多中文報章都會使用「標題繁體字,內容簡化字」的方式讓簡繁中文並存。除此之外,馬來西亞所有本地中文報章之官方網站都是以繁體中文為主要文字。
Question:
新加坡的華文學術界在哪個年代後開始使用簡體中文 ?
Answer:
1970
Question: 馬來西亞的華人商店招牌主要使用什麼文字 ?
Baselines
Public Score
Simple (1pt + 1pt)
0.44622
Medium (1pt + 1pt)
0.68421
Strong (0.5pt + 0.5pt)
0.79290
Boss (0.5pt + 0.5pt)
0.84897
hw7.ipynb
3.
Tutorial
A toy example
[Link to Demo]
Toy Example
文章: 李宏毅幾班大金。題目: 李宏毅幾班?答案: 大金
Paragraph: Jeanie likes Tom because Tom is good at deep learning. Question: Why does Jeanie like Tom?
Answer: Tom is good at deep learning
Why Long Paragraph is an Issue?
Total sequence length = question length + paragraph length + 3 (special tokens) Maximum input sequence length of BERT is restricted to 512, why?
Self-Attention in transformer has complexity
Therefore, we may not be able to process the whole paragraph.
What can we do?
Training
We know where the answer is in training!
Assumption: Info needed to answer the question can be found near the answer!
Simple solution: Just draw a window (as large as possible) around the answer!
e.g. window size = max_paragraph_len = 32
新加坡、馬來西亞的華文學術界在 1970年代後開始統一使用簡體中文;然而繁體字在媒體中普遍存在著,例如華人商店的招牌、舊告示及許多非學術類中文書籍,香港和臺灣所出版的書籍也有在市場上流動 ...
Q: 新加坡的華文學術界在哪個年代後開始使用簡體中文 ? A: 1970
Q: 馬來西亞的華人商店招牌主要使用什麼文字 ? A: 繁體字
Testing
We do not know where the answer is in testing split into windows!
e.g. window size = max_paragraph_len = 32
新加坡、馬來西亞的華文學術界在 1970年代後開始統一使用簡體中文;然而繁體字在媒體中普遍存在著,例如華人商店的招牌、舊告示及許多非學術類中文書籍 ......
Q: 新加坡的華文學術界在哪個年代後開始使用簡體中文 ? A: 1970
Q: 馬來西亞的華人商店招牌主要使用什麼文字 ? A: 繁體字
For each window, model predicts a start score and an end score take the maximum to be answer!
start score
start position
end score
end position
total score
window 1
0.5
23
0.4
26
0.9
window 2
0.3
35
0.7
37
1.0
Answer:
position 35 to 37
4.
Hints
Hints for beating baselines
Simple:Sample code ❖ Medium:
Apply linear learning rate decay
Change value of “doc_stride”
Strong:Improve preprocessing ➢ Try other pretrained models
Boss:Improve postprocessing
Further improve the above hints
Estimated training time
K80
T4
T4 (FP16)
V100
Simple
40m
20m
8m
7m
Medium
40m
20m
8m
7m
Strong
2h
1h
25m
20m
Boss
10h
5h
2h
1h30m
Training Tips (Optional):Automatic mixed precision (fp16)
Gradient accumulation
Ensemble
Linear Learning rate decay
Method 1: Adjust learning rate manually
Decrement optimizer.param_groups[0][“lr”] by learning_rate / total training step per step
1st window doc_stride is set to “max_paragraph_len” in sample code (i.e. no overlap) What if answer is near the boundary of windows or across windows? Hint: Overlapping windows
註:影片中「翻倍」有機會為口誤
Preprocessing
Hint: How to prevent model from learning something it should not learn during training? (i.e. answers are not always near the middle of window)
Other pretrained models
You can choose any model you like! [Link to pretrained models in huggingface]
Note 1: You are NOT allowed to use pretrained models outside huggingface!
(Violation = cheating = final grade x 0.9)
Note 2: Some models have describing details of the model
Note 3: Changing models may lead to error message, try it solve it yourself
Postprocessing
Hint: Open your prediction file to see what is wrong
(e.g. what if predicted end_index < predicted start_index?)
Training Tip: Automatic mixed precision
PyTorch trains with 32-bit floating point (FP32) arithmetic by default Estimated training time
T4
T4 (FP16)
Simple
20m
8m
Medium
20m
8m
Strong
1h
25m
Boss
5h
2h
Automatic Mixed Precision (AMP) enables automatic conversion of certain GPU operations from FP32 precision to half-precision (FP16)
Offer about 1.5-3.0x speed up while maintaining accuracy
Warning: only work on some gpu
(e.g. T4, V100)
Intro to native pytorch automatic mixed precision
Training Tip: Gradient accumulation
Use it when gpu memory is not enough but you want to use larger batch size
Split global batch into smaller mini-batches
For each mini-batch: Accumulate gradient without updating model parameters Update model parameters
Reference: Gradient Accumulation in PyTorch
5.
Regulations
Regulations
You should NOT plagiarize, if you use any other resource, you should cite it in the reference. (*)
You should NOT modify your prediction files manually.
Do NOT share codes or prediction files with any living creatures.
Do NOT use any approaches to submit your results more than 5 times a day.
Do NOT search or use additional data.
Do NOT use any pre-trained models outside huggingface.
Your final grade x 0.9 if you violate any of the above rules.
Lee & TAs preserve the rights to change the rules & grades.