تشخیص خودکار هویت نویسندۀ متن در زبان فارسی براساس دستور نقش‌گرای نظام‌مند

سلطان زاده, فاطمه; میرزایی, آزاده; بحرانی, محمد; مدرس خیابانی, شهرام

doi:10.22126/jlw.2023.9391.1716

تشخیص خودکار هویت نویسندۀ متن در زبان فارسی براساس دستور نقش‌گرای نظام‌مند

نوع مقاله : مقاله پژوهشی

نویسندگان

¹ دکتری زبان‌شناسی، گروه زبان‌شناسی، دانشکده ادبیات فارسی و زبان‎های خارجه، دانشگاه علامه طباطبایی، تهران، ایران

² دانشیار گروه زبان‌شناسی، دانشکده ادبیات فارسی و زبان‎های خارجه، دانشگاه علامه طباطبایی، تهران، ایران

³ استادیار گروه رایانه، دانشکده آمار، ریاضی و رایانه، دانشگاه علامه طباطبایی، تهران، ایران

⁴ استادیار گروه آموزش مترجمی زبان انگلیسی، واحد کرج، دانشگاه آزاد اسلامی، کرج، ایران.

10.22126/jlw.2023.9391.1716

چکیده

تشخیص خودکار هویت نویسندۀ متن یکی از مسائل مهم زبان‌شناسی حقوقی تلقی می‌شود. در پژوهش حاضر تلاش می‌شود کارایی ویژگی‌های مبتنی‌بر مفاهیم دستور نقش‌گرای نظام‌مند هالیدی (هالیدی و متیسن، 2014) با کارایی واژ‎ه‌های دستوری در تشخیص هویت نویسنده مقایسه شود. به این منظور، در ابتدا، پیکره‌ای از آثار هفت نویسندۀ معاصر ایرانی گردآوری شد. در مرحلۀ دوم، از واژه‌های دستوری استخراج‌شده از پیکره فهرستی تهیه شد؛ به‌علاوه، یک مجموعة واژگان براساس شبکۀ نظام حروف ربط، شبکۀ نظام افزودۀ وجه و شبکۀ نظام افزودۀ نگرشی با استفاده از منابع زبانی تهیه شد. سپس بسامد نسبی واژه‌های دستوری و ویژگی‌های مبتنی‌بر دستور نقش‌گرای نظام‌مند در هر متن محاسبه شد. طبقه‌بند پرسپترون چند لایه، نوعی شبکة عصبی، برای مرحلۀ آموزش سامانه به کار گرفته شد و به دقت مطلوبی در مرحلۀ ارزیابی منجر شد. بررسی نتایج ارزیابی سامانه نشان داد که روش محاسبۀ بسامد واژه‌های دستوری نسبت‌به روش مبتنی‌بر دستور نقش‌گرای نظام‌مند در تشخیص هویت نویسندۀ متون فارسی برتری دارد؛ باوجوداین، هنگامی که ویژگی‌های دستور نقش‌گرای نظام‌مند هالیدی درکنار ویژگی بسامد واژه‌های دستوری به کار روند، کارایی سامانه نسبت‌به حالتی که تنها از ویژگی بسامد واژه‌های دستوری استفاده شود، ارتقا می‌یابد.

کلیدواژه‌ها

20.1001.1.23452579.1403.12.3.4.6

عنوان مقاله [English]

Automatic Recognition of Authors Identity in Persian based on Systemic Functional Grammar

نویسندگان [English]

Fatemeh Soltanzadeh ¹
Azadeh Mirzaei ²
Mohammad Bahrani ³
Shahram Modarres Khiabani ⁴

¹ Ph.D. in Linguistics, Department of Linguistics, Faculty of Persian Literature and Foreign Languages, Allameh Tabataba'i University, Tehran, Iran

² Associate Professor of Linguistics, Department of Linguistics, Faculty of Persian Literature and Foreign Languages, Allameh Tabataba'i University, Tehran, Iran.

³ Assistant Professor, Department of Computer, Faculty of Statistics, Mathematics and Computer, Allameh Tabataba’i University, Tehran, Iran

⁴ Assistant Professor, Department of English Language and Translation, Islamic Azad University, Karaj, Iran

چکیده [English]

Automated author identification is one of the important fields in forensic linguistics. In this study, the effectiveness of systemic functional grammar (Halliday and Matthiessen, 2014) features in Persian authorship attribution was compared with that of function words. First, a corpus composed of documents written by seven contemporary Iranian authors was collected. Second, a list of function words was extracted from the corpus. Moreover, conjunction, modality and comment adjunct system networks were applied to form a lexicon using linguistics resources. Then, the relative frequency of function words in addition to systemic functional features were calculated in each document. Multilayer perceptron classifier, a type of neural network, was used for learning phase which resulted in a desirable accuracy in evaluation phase. The results of the study showed that using function words method is superior to systemic functional approach alone in Persian author identification, however, simultaneous use of the two methods increases the effectiveness in comparison to each alone.
Introduction
Recently, automated author identification has become a key focus for forensic linguistics. Author identification involves determining the writer of a text from a set of potential authors. The text in question could be a threatening letter, an email, a literary work, or a scientific article or book. The basis for author identification rests on the idea that different authors may write about the same topic using overlapping, yet distinct, lexico-grammatical units—an issue referred to as idiolect (Coulthard, 2004).
The first significant attempt to identify writing styles was Mendenhall's study of Shakespeare's plays (1887). The play Henry VIII is widely recognized as a collaborative work, not solely authored by William Shakespeare. Plechac (2020) investigated the use of accent or stress to identify the contributions of other authors to the play.
In Persian, several studies have been conducted to determine authorship (Farahamandpour et al., 2013; Arefi et al., 2021). These studies utilized repetitive features, such as lexical richness, frequency of syntactic groups, collocations, and the relative frequency of punctuation marks, to detect writing styles. Measuring the frequency of function words is one valid method for author identification. Function words, which have limited meanings, indicate the functional relationships between components of a sentence. Golshaie (2019) and Dabagh (2007) applied the frequency of Persian function words to identify authors. This study aims to compare the efficacy of function word frequency with systemic functional grammar methods in automatically identifying writing styles.
Theoretical Framework
Systemic Functional Grammar (SFG) is a component of the social semiotic approach to language known as systemic functional linguistics (Halliday & Matthiessen, 2014). SFG conceptualizes language as a network of systems, or interrelated sets of options for creating meaning. Since the 1960s, SFG has been applied in various contexts within computational linguistics (Matthiessen & Bateman, 1991; Teich, 1995). In SFG, the clause is considered the fundamental unit of language, and it is analyzed through three perspectives, defined as the ideational, interpersonal, and textual metafunctions. The ideational function is further divided into the experiential and logical aspects (Halliday & Matthiessen, 2014).
This study employs three system networks: conjunction, modality, and comment. These networks correspond to three types of adjuncts: conjunctive adjuncts, mood adjuncts, and comment adjuncts, respectively. In the systemic environment of conjunction, conjunctions function as conjunctive adjuncts within the clause structure. They establish relationships where one segment of text elaborates on, extends, or enhances another segment (Halliday & Matthiessen, 2014).
Modal adjuncts express the speaker's or writer's judgment or attitude toward the content of the message. There are two types of modal adjuncts: (i) mood adjuncts and (ii) comment adjuncts. Mood adjuncts and comment adjuncts are categorized within the modality and comment adjunct system networks, respectively. Modality encompasses intermediate degrees between positive and negative poles, defining the region of uncertainty between 'yes' and 'no.' The modality system allows writers to qualify events or entities in terms of their probability, typicality, obligation, or inclination (Halliday & Matthiessen, 2014). The comment adjunct system provides a means for the writer to comment on the status of a message concerning the textual and interactive context of the discourse (Argamon et al., 2007). Comments can target either the ideational content of the proposition or the interpersonal aspects of the speech function (Halliday & Matthiessen, 2014).
Method
A corpus was compiled from the works of seven contemporary Persian writers: Hoshang Golshiri, Bozorg Alavi, Ahmad Mahmoud, Mahmoud Dolatabadi, Nader Ebrahimi, Jalal Al-e Ahmad, and Gholamhossein Saedi, totaling 2,069,243 words. From this corpus, a list of 197 function words was extracted using the Term Frequency-Inverse Document Frequency (TF-IDF) technique. Conjunction, modality, and comment adjunct system networks were then used to create a lexicon.
An author identification system was designed using machine learning techniques. The system tokenized the texts, extracted instances of lexical units specified in the lexicon, and computed the relative frequencies of semantic attribute values for each text, resulting in an overall "feature vector" that described each text. This approach was inspired by the method introduced by Argamon et al. (2007). For the learning phase, a multilayer perceptron classifier, a type of neural network, was utilized.
Results
To evaluate the system, the collected corpus was divided into five segments, and a 5-fold cross-validation method was applied. The 5-fold cross-validation demonstrated a satisfactory accuracy when focusing exclusively on function words. The combined use of function words and SFG methods achieved an accuracy of 74.47% for Persian author identification. Subsequent feature selection identified the most effective features for the machine learning phase. The results indicated that the relative frequency of function words outperformed SFG-based attributes in terms of effectiveness.
Discussion and Conclusions
The evaluation phase revealed that the function words-based method outperformed the systemic functional grammar (SFG) approach in identifying authors. However, the simultaneous use of both methods improved effectiveness compared to using either method alone. The superior performance of the function words-based method may be attributed to the high frequency of function words and the author’s unconscious control over their use.
Among the SFG-based features, the combination of top features—namely conjunctive, mood, and comment adjuncts—produced higher accuracy than any single system network alone. Additionally, the results from feature selection indicated that features derived from the modality system network were more effective than those from the conjunction and comment adjunct system networks for Persian author identification.
Overall, while the function words-based method proved to be highly effective on its own, integrating it with SFG-based methods provided a more comprehensive approach, enhancing the accuracy of author identification.

کلیدواژه‌ها [English]

author identification
forensic linguistics
systemic functional grammar
function words
conjunctive adjunct
mood adjunct
comment adjunct

مراجع

آل‌احمد، جلال (1346). نفرین زمین. تهران: فردوس.

آل‌احمد، جلال (1350). پنج داستان. تهران: فردوس.

ابراهیمی، نادر (1374). یک عاشقانۀ آرام. تهران: روزبهان.

ابراهیمی، نادر (1399). بر جاده‌های آبی سرخ. تهران: روزبهان.

جعفری، آزیتا (1388). بررسی افزوده‎ها در زبان فارسی: براساس رویکردهای نقشی و صوری. دستور (ویژه‌نامه نامة فرهنگستان)، 5(1)، 128-155.

حسین حمه، همزه؛ علی‎اکبری، نسرین؛ کریمی، یادگار (1400). بررسی ‌وجه و وجهیت در کردی سورانی: تحلیلی نقش‌گرا. مطالعات زبان‌ها و گویش‌های غرب ایران، 9(4)، 1-23.

دولت‌آبادی، محمود (1395). روزگار سپری‌شدۀ مردم سالخورده. تهران: چشمه.

دولت‌آبادی، محمود (1401). کلیدر. چاپ 37. تهران: فرهنگ معاصر.

ساعدی، غلامحسین (1377). آشفته‌حالان بیداربخت. تهران: نگاه.

ساعدی، غلامحسین (1397). غریبه در شهر. تهران: نگاه.

شمس‎فرد، مهرنوش؛ بی‌جن‎خان، محمود (1401). پردازش متن و گفتار فارسی: مروری بر مبانی نظری و آخرین یافته‎های پژوهشی. تهران: سمت.

عارفی، سمیه؛ بصیری، محمداحسان؛ روزمند، امید (1400). انتخاب ویژگی برای شناسایی نویسنده در متون کوتاه برخط فارسی. فنّاوری اطلاعات و ارتباطات ایران، 13(47-48)، 35-57.

علوی، بزرگ (1386). ورق‌پاره‌های زندان. تهران: نگاه.

علوی، بزرگ (1399). گیله‌مرد. تهران: نگاه.

فرهمندپور، زینب؛ نیک‎مهر، هومن؛ منصوری‎زاده، محرم؛ طبیب‎زاده، امید. (1391). یک سیستم نوین هوشمند تشخیص هویت نویسنده فارسی زبان براساس سبک نوشتاری. محاسبات نرم، 1(2)، 26-35.

گلشائی، رامین (1398). واژه‌های دستوری به‌مثابه نشانگرهای گویش فردی: رویکردی پیکره‌ای به شناسایی هویت نویسنده در زبان فارسی. جستارهای زبانی، 10(3)، 317-293.

گلشیری، هوشنگ (1350). کریستین و کید. تهران: کتاب زمان.

گلشیری، هوشنگ (1370). در ولایت هوا. استکهلم: عصر جدید.

گلشیری، هوشنگ (1400). شازده احتجاب. چاپ 18. تهران: نیلوفر.

محمود، احمد (1353). همسایه‌ها. تهران: امیرکبیر.

محمود، احمد (1381). غریبه‌ها و پسرک بومی. تهران: معین.

میرزایی، آزاده (1397). بازتعریف مفاهیم بندِ پایه و بندِ پیرو براساس رویکرد نقش‎گرا. زبان و زبان‌شناسی، 13(26)، 117-132.

میرزایی، آزاده (1400). رابطۀ قطبیت و وجهیت بندی در زبان فارسی. مطالعات زبان‌ها و گویش‌های غرب ایران، 9(1)، 113-135.

میرزایی، آزاده؛ صفری، پگاه (1394). ساختِ واژه-متن‌های تخصصی و عمومی زبان فارسی براساس بسامدگیری واژه‌های نقشی و محتوایی. مجموعه‌مقالات نخستین همایش ملّی زبان‌شناسی پیکره‌ای (صص. 175-191). تهران: نویسه ‌پارسی.

تعداد مشاهده مقاله: 1,286
تعداد دریافت فایل اصل مقاله: 527

تشخیص خودکار هویت نویسندۀ متن در زبان فارسی براساس دستور نقش‌گرای نظام‌مند

Automatic Recognition of Authors Identity in Persian based on Systemic Functional Grammar

مراجع

دوره 12، شماره 3
مهر 1403
صفحه 59-84

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

تشخیص خودکار هویت نویسندۀ متن در زبان فارسی براساس دستور نقش‌گرای نظام‌مند

Automatic Recognition of Authors Identity in Persian based on Systemic Functional Grammar

مراجع

دوره 12، شماره 3مهر 1403صفحه 59-84

فایل ها

هم رسانی

ارجاع به این مقاله

آمار

دوره 12، شماره 3
مهر 1403
صفحه 59-84