Skip to content

Next Goal Plan

We have implemented some features such as multi-task learning, gradient fusion, SimCSE, PALs, etc. This Issue is to clarify our future plans.


We aim to achieve good performance on SST, QA, and STS. The three databases are (DOMAIN and TASK)

  1. movie/mood categorization,
  2. high-frequency question sentences on the Internet/synonym detection,
  3. news headlines/forums/similarity detection.

My idea is to merge many different methods:

[Further Pretrain] First of all, we need to use databases that are useful for all three (e.g., Wiki103, which has a wide range of knowledge and a high overlap with the domains of our SST and STS databases) to perform "further-pre-train" to extract more knowledge and structures related to our target databases and structure.

In this step, we can use the classical language modeling task for training (Wiki103+Language Modeling). Then we can also use SimCSE, which I have already implemented as a target function and with related databases (e.g. Wiki_for_Sim, which I am using now as well as NLI), and I have already shown that the SimCSE approach does not harm the performance on other tasks, which is better than the normal further-pretrain. (I'll do this step, and in August, I'll figure out the best practices that belong to us)

[Distillation Model] Secondly, we can use more specific databases (e.g. MS, FP, etc. related to emotion categorization) but the problem is that if we train with these databases our performance on the other two items will drop, so we will use distillation model (teacher-student setup) to use these specific databases (which are closely related to our target task, but probably with the regular further-pretrain process would affect the overall performance of the databases). (I have written the training process for both MS and FP here, I can import other databases if necessary, Yasir handles the distillation logic part)

[Multi-task] Finally, we then do multitasking (multitasking using the distillation model, of course). Theoretically, our teacher model (in part 2) should have at least a 5% performance advantage over our student model (further-pre-train + multitask learning) on a given task.

Edited by Qumeng Sun