Exploiting hanja-based resources in processing korean historic documents written by common literati
ESWA
- Impact Factor 2024: 7.5
Authors
- Suhyune Son, Chanjun Park, Jungseob Lee, Midan Shim, Chanhee Lee, Yoonna Jang, Jaehyung Seo, Jungwoo Lim, Heuiseok Lim
Abstract As recent pre-trained language models require enormous corpus and resources, an inequality between rich-resource languages and scarce-resource languages has become prominent. To mitigate this problem, studies on cross-lingual transfer learning and multilingual training have attempted to endow long-tail languages with the knowledge acquired from rich-resource languages. Although successful, existing work has mainly focused on experimenting with as many languages as possible, leaving the targeted in-depth analysis absent. In this study, we spotlight a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT). To make the transfer scenario challenging, we adopt Korean as a target language due to its low linguistic similarity to English, which is suitable to show XPT’s efficiency in transferring capability. With the comprehensive experiments, we observe that XPT outperforms monolingual models trained with a large amount of corpus in language understanding tasks and shows comparable performance even with limited training data. Also, we found that XPT-based method is effective when transferring the source languages into the target language, which has a low similarity.
Check out the This Link for more info on our paper.