1) It is a general version of knowledge distillation. For example, this paper from 2016 describes the same technique: Sequence-Level Knowledge Distillation [0]
> This sequence-level approximation leads to a simple training procedure wherein the student network is trained on a newly generated dataset that is the result of running beam search with the teacher network
2) Fine-tuning is a step in the training process. Language models are first pre-trained, then fine-tuned. This is a pedantic quibble.
3) It is unsurprising that you don't understand ad hominem. Giving background information and pointing out the style of writing is relevant to arguments made.
It's arguable that saying "EY has a very shallow understanding of ML" is even lower than ad hominem (which is DH1) on the pg scale [4], since pg specifically gives "The author is a self-important dilettante." as an example of DH0.
> This sequence-level approximation leads to a simple training procedure wherein the student network is trained on a newly generated dataset that is the result of running beam search with the teacher network
2) Fine-tuning is a step in the training process. Language models are first pre-trained, then fine-tuned. This is a pedantic quibble.
3) It is unsurprising that you don't understand ad hominem. Giving background information and pointing out the style of writing is relevant to arguments made.
[0] https://arxiv.org/abs/1606.07947