The rise of large language models (LLMs) makes distinguishing between human-written and LLM-generated text increasingly challenging. Detecting LLM-generated text is essential for preserving academic integrity, preventing plagiarism, and ensuring research ethics. Despite its significance, most existing methods focus on English. Leveraging language-specific features is beneficial and necessary—an aspect neglected in current approaches. With its unique spacing rules, rich morphology, and distinct punctuation patterns, the Korean language requires specialized approaches that English-centric methods fail to accommodate.
Meet 🐟 KatFish (KoreAn LLM-generated Text Benchmark For Identifying AuthorSHip), the first benchmark dataset for detecting LLM-generated Korean text constructed from four LLMs over three genres. By analyzing key linguistic features—spacing patterns, part-of-speech n-gram diversity, and punctuation—we reveal fundamental differences between human and LLM-generated text. Building on these insights, we introduce 🐟 KatFishNet, a detection method tailored for Korean text. KatFishNet significantly outperforms existing approaches, highlighting the effectiveness of leveraging linguistic features for detection.
Our work provides a foundation for further research into LLM-generated text detection in Korean and highlights the potential of methodologies that leverage inherent linguistic features tailored to the language.
: LLM-Generated Korean Text Detection Dataset
Our dataset reflects real-world challenges by including text from multiple LLMs with varying characteristics, ensuring a robust and comprehensive benchmark for detection while also facilitating analysis from multiple aspects.
KatFish spans **three distinct genres—Persuasive Essays, Poetry, and Paper Abstracts—**capturing a wide range of linguistic and structural variations. By incorporating these genres, KatFish provides a rich testing ground for detecting LLM-generated text across different writing styles.
The urgent need to detect LLM-generated text is highly significant across our three domains:
Essays: LLM-generated essays threaten academic integrity, facilitating plagiarism and undermining critical thinking.
Poetry: The rise of AI-generated poetry raises concerns about plagiarism, copyrights, and artistic authenticity.
Paper Abstracts: LLM-generated abstracts can introduce misinformation, undermining the credibility of research.