Publications | Willy Chung

2023

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Bang, Yejin, Cahyawijaya, Samuel, Lee, Nayeon, Dai, Wenliang, Su, Dan, Wilie, Bryan, Lovenia, Holy, Ji, Ziwei, Yu, Tiezheng, Chung, Willy, Do, Quyet V., Xu, Yan, and Fung, Pascale

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) Nov 2023

Abs arXiv Bib PDF Code

This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn prompt engineering fashion. We also release codebase for evaluation set extraction.
@inproceedings{bang-etal-2023-multitask, title = {A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity}, author = {Bang, Yejin and Cahyawijaya, Samuel and Lee, Nayeon and Dai, Wenliang and Su, Dan and Wilie, Bryan and Lovenia, Holy and Ji, Ziwei and Yu, Tiezheng and Chung, Willy and Do, Quyet V. and Xu, Yan and Fung, Pascale}, editor = {Park, Jong C. and Arase, Yuki and Hu, Baotian and Lu, Wei and Wijaya, Derry and Purwarianti, Ayu and Krisnadhi, Adila Alfa}, booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = nov, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.ijcnlp-main.45}, doi = {10.18653/v1/2023.ijcnlp-main.45}, pages = {675--718}, }
Learn What NOT to Learn: Towards Generative Safety in Chatbots

Khalatbari, Leila, Bang, Yejin, Su, Dan, Chung, Willy, Ghadimi, Saeed, Sameti, Hossein, and Fung, Pascale

arXiv preprint arXiv:2304.11220 Nov 2023

Abs arXiv Bib PDF

Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this paper, we present a novel framework, named LOT (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. Our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. The LOT framework utilizes divergence to steer the generations away from the unsafe subspace and towards the safe subspace while sustaining the flow of conversation. Our approach is memory and time-efficient during decoding and effectively reduces toxicity while preserving engagingness and fluency. Empirical results indicate that LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models. Our findings are further corroborated by human evaluation.
@article{khalatbari2023learn, title = {Learn What NOT to Learn: Towards Generative Safety in Chatbots}, author = {Khalatbari, Leila and Bang, Yejin and Su, Dan and Chung, Willy and Ghadimi, Saeed and Sameti, Hossein and Fung, Pascale}, journal = {arXiv preprint arXiv:2304.11220}, year = {2023}, }
InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

Cahyawijaya, Samuel, Lovenia, Holy, Yu, Tiezheng, Chung, Willy, and Fung, Pascale

In Proceedings of the First Workshop in South East Asian Language Processing Nov 2023

Abs arXiv Bib PDF

Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages. However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data. Additionally, directly adapting new languages to instruction-tuned LLMs can result in catastrophic forgetting, which leads to the loss of multitasking ability. To address this issue, we propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages. Our results demonstrate the effectiveness of InstructAlign in enabling the model to understand low-resource languages with limited parallel data while preventing catastrophic forgetting. Our work contributes to the advancement of language adaptation methods, particularly for adapting instruction-tuned LLMs to underrepresented languages.
@inproceedings{cahyawijaya-etal-2023-instructalign, title = {InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning}, author = {Cahyawijaya, Samuel and Lovenia, Holy and Yu, Tiezheng and Chung, Willy and Fung, Pascale}, editor = {Wijaya, Derry and Aji, Alham Fikri and Vania, Clara and Winata, Genta Indra and Purwarianti, Ayu}, booktitle = {Proceedings of the First Workshop in South East Asian Language Processing}, month = nov, year = {2023}, address = {Nusa Dua, Bali, Indonesia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.sealp-1.5}, doi = {10.18653/v1/2023.sealp-1.5}, pages = {55--78}, }
PICK: Polished & Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems

Wilie, Bryan, Xu, Yan, Chung, Willy, Cahyawijaya, Samuel, Lovenia, Holy, and Fung, Pascale

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) Nov 2023

Abs arXiv Bib PDF Code

Grounding dialogue response generation on external knowledge is proposed to produce informative and engaging responses. However, current knowledge-grounded dialogue (KGD) systems often fail to align the generated responses with human-preferred qualities due to several issues like hallucination and the lack of coherence. Upon analyzing multiple language model generations, we observe the presence of alternative generated responses within a single decoding process. These alternative responses are more faithful and exhibit a comparable or higher level of relevance to prior conversational turns compared to the optimal responses prioritized by the decoding processes. To address these challenges and driven by these observations, we propose Polished & Informed Candidate Scoring (PICK), a generation re-scoring framework that empowers models to generate faithful and relevant responses without requiring additional labeled data or model tuning. Through comprehensive automatic and human evaluations, we demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history. Furthermore, PICK consistently improves the system’s performance with both oracle and retrieved knowledge in all decoding strategies.
@inproceedings{wilie-etal-2023-pick, title = {PICK: Polished \& Informed Candidate Scoring for Knowledge-Grounded Dialogue Systems}, author = {Wilie, Bryan and Xu, Yan and Chung, Willy and Cahyawijaya, Samuel and Lovenia, Holy and Fung, Pascale}, editor = {Park, Jong C. and Arase, Yuki and Hu, Baotian and Lu, Wei and Wijaya, Derry and Purwarianti, Ayu and Krisnadhi, Adila Alfa}, booktitle = {Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = nov, year = {2023}, address = {Nusa Dua, Bali}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.ijcnlp-main.63}, doi = {10.18653/v1/2023.ijcnlp-main.63}, pages = {980--995}, }
Contrastive Learning for Inference in Dialogue

Ishii, Etsuko, Xu, Yan, Wilie, Bryan, Ji, Ziwei, Lovenia, Holy, Chung, Willy, and Fung, Pascale

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Dec 2023

Abs arXiv Bib PDF

Inference, especially those derived from inductive processes, is a crucial component in our conversation to complement the information implicitly or explicitly conveyed by a speaker. While recent large language models show remarkable advances in inference tasks, their performance in inductive reasoning, where not all information is present in the context, is far behind deductive reasoning. In this paper, we analyze the behavior of the models based on the task difficulty defined by the semantic information gap – which distinguishes inductive and deductive reasoning. Our analysis reveals that the information gap between dialogue contexts and desired inferences renders the inductive inference process more challenging. To mitigate this information gap, we investigate a contrastive learning approach by feeding negative samples. Our experiments suggest negative samples help models understand what is wrong and improve their inference generations.
@inproceedings{ishii-etal-2023-contrastive, title = {Contrastive Learning for Inference in Dialogue}, author = {Ishii, Etsuko and Xu, Yan and Wilie, Bryan and Ji, Ziwei and Lovenia, Holy and Chung, Willy and Fung, Pascale}, editor = {Bouamor, Houda and Pino, Juan and Bali, Kalika}, booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing}, month = dec, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.emnlp-main.631}, doi = {10.18653/v1/2023.emnlp-main.631}, pages = {10202--10221}, }
InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems

Chung, Willy, Cahyawijaya, Samuel, Wilie, Bryan, Lovenia, Holy, and Fung, Pascale

In Proceedings of the Second Workshop on Natural Language Interfaces Nov 2023

Abs arXiv Bib PDF Code

Large language models (LLMs) have been used for diverse tasks in natural language processing (NLP), yet remain under-explored for task-oriented dialogue systems (TODS), especially for end-to-end TODS. We present InstructTODS, a novel off-the-shelf framework for zero-shot end-to-end task-oriented dialogue systems that can adapt to diverse domains without fine-tuning. By leveraging LLMs, InstructTODS generates a proxy belief state that seamlessly translates user intentions into dynamic queries for efficient interaction with any KB. Our extensive experiments demonstrate that InstructTODS achieves comparable performance to fully fine-tuned TODS in guiding dialogues to successful completion without prior knowledge or task-specific data. Furthermore, a rigorous human evaluation of end-to-end TODS shows that InstructTODS produces dialogue responses that notably outperform both the gold responses and the state-of-the-art TODS in terms of helpfulness, informativeness, and humanness. Moreover, the effectiveness of LLMs in TODS is further supported by our comprehensive evaluations on TODS subtasks: dialogue state tracking, intent classification, and response generation.
@inproceedings{chung-etal-2023-instructtods, title = {InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems}, author = {Chung, Willy and Cahyawijaya, Samuel and Wilie, Bryan and Lovenia, Holy and Fung, Pascale}, editor = {Chen, Kehai and Ku, Lun-Wei}, booktitle = {Proceedings of the Second Workshop on Natural Language Interfaces}, month = nov, year = {2023}, address = {Bali, Indonesia}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.nlint-1.1}, doi = {10.18653/v1/2023.nlint-1.1}, pages = {1--21}, }
Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech Emotion Recognition

Cahyawijaya, Samuel*, Lovenia, Holy*, Chung, Willy*, Frieske, Rita, Liu, Zihan, and Fung, Pascale

In INTERSPEECH 2023, 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20-24 August 2023 Nov 2023

Abs arXiv Bib PDF Code

Speech emotion recognition plays a crucial role in human-computer interactions. However, most speech emotion recognition research is biased toward English-speaking adults, which hinders its applicability to other demographic groups in different languages and age groups. In this work, we analyze the transferability of emotion recognition across three different languages–English, Mandarin Chinese, and Cantonese; and 2 different age groups–adults and the elderly. To conduct the experiment, we develop an English-Mandarin speech emotion benchmark for adults and the elderly, BiMotion, and a Cantonese speech emotion dataset, YueMotion. This study concludes that different language and age groups require specific speech features, thus making cross-lingual inference an unsuitable method. However, cross-group data augmentation is still beneficial to regularize the model, with linguistic distance being a significant influence on cross-lingual transferability.
@inproceedings{cahyawijaya2023cross, title = {Cross-Lingual Cross-Age Group Adaptation for Low-Resource Elderly Speech Emotion Recognition}, author = {Cahyawijaya, Samuel* and Lovenia, Holy* and Chung, Willy* and Frieske, Rita and Liu, Zihan and Fung, Pascale}, year = {2023}, booktitle = {INTERSPEECH 2023, 24th Annual Conference of the International Speech Communication Association, Dublin, Ireland, 20-24 August 2023}, publisher = {ISCA}, }

2022

Clozer: Adaptable Data Augmentation for Cloze-style Reading Comprehension

Lovenia, Holy*, Wilie, Bryan*, Chung, Willy*, Min, Zeng*, Cahyawijaya, Samuel, Su, Dan, and Fung, Pascale

In Proceedings of the 7th Workshop on Representation Learning for NLP Nov 2022

Abs arXiv Bib PDF

Task-adaptive pre-training (TAPT) alleviates the lack of labelled data and provides performance lift by adapting unlabelled data to downstream task. Unfortunately, existing adaptations mainly involve deterministic rules that cannot generalize well. Here, we propose Clozer, a sequence-tagging based cloze answer extraction method used in TAPT that is extendable for adaptation on any cloze-style machine reading comprehension (MRC) downstream tasks. We experiment on multiple-choice cloze-style MRC tasks, and show that Clozer performs significantly better compared to the oracle and state-of-the-art in escalating TAPT effectiveness in lifting model performance, and prove that Clozer is able to recognize the gold answers independently of any heuristics.
@inproceedings{lovenia2022clozer, title = {Clozer: Adaptable Data Augmentation for Cloze-style Reading Comprehension}, author = {Lovenia, Holy* and Wilie, Bryan* and Chung, Willy* and Min, Zeng* and Cahyawijaya, Samuel and Su, Dan and Fung, Pascale}, booktitle = {Proceedings of the 7th Workshop on Representation Learning for NLP}, pages = {60--66}, year = {2022}, }
Every picture tells a story: Image-grounded controllable stylistic story generation

Lovenia, Holy, Wilie, Bryan, Barraud, Romain, Cahyawijaya, Samuel, Chung, Willy, and Fung, Pascale

In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature Nov 2022

Abs arXiv Bib PDF

Generating a short story out of an image is arduous. Unlike image captioning, story generation from an image poses multiple challenges: preserving the story coherence, appropriately assessing the quality of the story, steering the generated story into a certain style, and addressing the scarcity of image-story pair reference datasets limiting supervision during training. In this work, we introduce Plug-and-Play Story Teller (PPST) and improve image-to-story generation by: 1) alleviating the data scarcity problem by incorporating large pre-trained models, namely CLIP and GPT-2, to facilitate a fluent image-to-text generation with minimal supervision, and 2) enabling a more style-relevant generation by incorporating stylistic adapters to control the story generation. We conduct image-to-story generation experiments with non-styled, romance-styled, and action-styled PPST approaches and compare our generated stories with those of previous work over three aspects, i.e., story coherence, image-story relevance, and style fitness, using both automatic and human evaluation. The results show that PPST improves story coherence and has better image-story relevance, but has yet to be adequately stylistic.
@inproceedings{lovenia2022every, title = {Every picture tells a story: Image-grounded controllable stylistic story generation}, author = {Lovenia, Holy and Wilie, Bryan and Barraud, Romain and Cahyawijaya, Samuel and Chung, Willy and Fung, Pascale}, booktitle = {Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature}, pages = {40--52}, year = {2022}, }