Enhancing Vaccine Adverse Event Detection on Social Media through LLM-Driven Synthetic Data Augmentation

Authors

  • Abdusalam Nwesri University of Tripoli Author
  • Mai Elbaabaa University of Tripoli Author
  • Nabila Shinbir College of Science and Technology Author
  • Hasan Ebrahem University of Tripoli Author
  • Marwa Solla University of Tripoli Author

DOI:

https://doi.org/10.65568/gujes.2026.020112

Keywords:

LLMs, Vaccine Adverse Event Reporting, Synthetic Augmentation

Abstract

This paper evaluates the performance impact of synthetic data augmentation on the detection of personally experienced vaccine reactions in social media posts. Our study uses the UoT team's submission for Task 6 of the 10th Social Media Mining for Health (#SMM4H) Shared Tasks as a foundation. By establishing a baseline through the fine-tuning six Large Language Models (LLMs), we analyze how augmenting the training set with synthetically generated examples influences classification metrics. Our experiment  shows that synthetic augmentation leads to substantial performance improvements across all models with an additional benefit to small models.

References

[1] Guellil, I., Berrachedi, Y., Chenni, N. et al. Detecting Adverse Drug Events in Social Media: A Brief Literature Review. SN COMPUT. SCI. 7, 199 (2026). https://doi.org/10.1007/s42979-026-04752-9

[2] Amin Khademi and et al. Extracting adverse events from covid-19 vaccine con- versations on twitter. In Proceedings of the International Conference on Social Media Mining for Health, 2022.

[3] Sedigheh Khademi Habibabadi, Pari Delir Haghighi, Frada Burstein, and Jim Buttery. Vaccine adverse event mining of twitter conversations: 2-phase clas- sification study. JMIR Med Inform, 10(6):e34305, Jun 2022.

[4] Abeed Sarker et al. (2016). Social Media Mining for Toxicovigilance: Automatic Monitoring of Prescription Medication Abuse from Twitter. Drug Safety. 39. 10.1007/s40264-015-0379-4.

[5] Bosung Kim and Ndapa Nakashole. 2022. Data Augmentation for Rare Symptoms in Vaccine Side-Effect Detection. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 310–315, Dublin, Ireland. Association for Computational Linguistics.

[6] Ahmet Okan Arık, Gizem Parlayandemir, Serra Çelik (2026), LLM-based data augmentation for text classification on imbalanced datasets: A case study on fake news detection, Egyptian Informatics Journal, Volume 33, 2026,100886, ISSN 1110-8665, https://doi.org/10.1016/j.eij.2026.100886.

[7] Ari Z. Klein, Tirthankar Dasgupta, Ivan Flores Amaro, Sudeshna Jana, Sedigh Khademi, Guillermo Lopez-Garcia, Takeshi Onishi, Jeanne Powell, Lisa Raithel, Swati Rajwal, Roland Roller, Abeed Sarker, Manjira Sinha, Philippe Thomas, Elena Tutubalina, Dongfang Xu, Pierre Zweigenbaum, and Graciela Gonzalez- Hernandez. Overview of the 10th Social Media Mining for Health (#SMM4H) and Health Real-World Data (HeaRD) Shared Tasks at ICWSM 2025. In Work- shop Proceedings of the 19th International AAAI Conference on Web and Social Media. AAAI Press, 2025.

[8] Bosung Kim and Ndapa Nakashole. 2022. Data Augmentation for Rare Symptoms in Vaccine Side-Effect Detection. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 310–315, Dublin, Ireland. Association for Computational Linguistics.

[9] Yuan Chen, Zhisheng Zhang, An easy numeric data augmentation method for early-stage COVID-19 tweets exploration of participatory dynamics of public attention and news coverage, Information Processing & Management, Volume 59, Issue 6, 2022, 103073, ISSN 0306-4573, https://doi.org/10.1016/j.ipm.2022.103073.

[10] Simone Scaboro, Beatrice Portelli, and Giuseppe Serra, Detection of Adverse Drug Events from Social Media Texts - Research Project Overview77-86, in proceedings of HC@AIxIA 2022: 1st AIxIA Workshop on Artificial Intelligence For Healthcare, November 30, 2022, Udine, IT

[11] Feng X, Luo J, Yang Y, El Baz D, Shi L. Health Misinformation Detection: Approaches, Challenges and Opportunities. Inquiry. 2025 Jan-Dec;62:469580251384784. doi: 10.1177/00469580251384784. Epub 2025 Nov 4. PMID: 41189452; PMCID: PMC12589804.

[12] Abdelsalam Nwesri, Mai Elbaabaa, Nabila Shinbir, Enhancing Vaccine Reaction Detection from Social Media Using Optimized Transformer Fine-Tuning, Libyan Journal of InformaticsVolume 03, No. 02, December. 2025.

[13] Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke, and Leonardo Neves. TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online, November 2020. Association for Computational Linguistics.

[14] Daniel Loureiro, Kiamehr Rezaee, Talayeh Riahi, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. Tweet insights: A visualization platform to extract temporal insights from twitter. arXiv preprint arXiv:2308.02142, 2023.

[15] Sedigh Khademi, Christopher Palmer, Gerardo Luis Dimaguila, Muhammad Javed, and Jim Buttery. Exploring Large Language Models for Detecting Online Vaccine Reactions. In Proceedings of HIC 2024 - Health. Innovation. Commu- nity: It Starts With Us, volume 318, pages 30–35, 2024.

[16] Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. CoRR, abs/2111.09543, 2021.

[17] Mae ̈l Jullien, Marco Valentino, Hannah Frost, Paul O’regan, Donal Landers, and Andre ́ Freitas. SemEval-2023 task 7: Multi-evidence natural language inference for clinical trial data. In Atul Kr. Ojha, A. Seza Dog ̆ruo ̈z, Giovanni Da San Martino, Harish Tayyar Madabushi, Ritesh Kumar, and Elisa Sartori, editors, Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2216–2226, Toronto, Canada, July 2023. Association for Computational Linguistics.

[18] JacobDevlin,Ming-WeiChang,KentonLee,andKristinaToutanova.BERT:pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.

Downloads

Published

2026-03-15 — Updated on 2026-03-16

Versions