نوع مقاله : مقاله پژوهشی
نویسندگان
1 دکتری زبانشناسی، گروه زبانشناسی، دانشکده ادبیات فارسی و زبانهای خارجه، دانشگاه علامه طباطبایی، تهران، ایران
2 دانشیار گروه زبانشناسی، دانشکده ادبیات فارسی و زبانهای خارجه، دانشگاه علامه طباطبایی، تهران، ایران
3 استادیار گروه رایانه، دانشکده آمار، ریاضی و رایانه، دانشگاه علامه طباطبایی، تهران، ایران
4 استادیار گروه آموزش مترجمی زبان انگلیسی، واحد کرج، دانشگاه آزاد اسلامی، کرج، ایران.
چکیده
کلیدواژهها
عنوان مقاله [English]
نویسندگان [English]
Automated author identification is one of the important fields in forensic linguistics. In this study, the effectiveness of systemic functional grammar (Halliday and Matthiessen, 2014) features in Persian authorship attribution was compared with that of function words. First, a corpus composed of documents written by seven contemporary Iranian authors was collected. Second, a list of function words was extracted from the corpus. Moreover, conjunction, modality and comment adjunct system networks were applied to form a lexicon using linguistics resources. Then, the relative frequency of function words in addition to systemic functional features were calculated in each document. Multilayer perceptron classifier, a type of neural network, was used for learning phase which resulted in a desirable accuracy in evaluation phase. The results of the study showed that using function words method is superior to systemic functional approach alone in Persian author identification, however, simultaneous use of the two methods increases the effectiveness in comparison to each alone.
Introduction
Recently, automated author identification has become a key focus for forensic linguistics. Author identification involves determining the writer of a text from a set of potential authors. The text in question could be a threatening letter, an email, a literary work, or a scientific article or book. The basis for author identification rests on the idea that different authors may write about the same topic using overlapping, yet distinct, lexico-grammatical units—an issue referred to as idiolect (Coulthard, 2004).
The first significant attempt to identify writing styles was Mendenhall's study of Shakespeare's plays (1887). The play Henry VIII is widely recognized as a collaborative work, not solely authored by William Shakespeare. Plechac (2020) investigated the use of accent or stress to identify the contributions of other authors to the play.
In Persian, several studies have been conducted to determine authorship (Farahamandpour et al., 2013; Arefi et al., 2021). These studies utilized repetitive features, such as lexical richness, frequency of syntactic groups, collocations, and the relative frequency of punctuation marks, to detect writing styles. Measuring the frequency of function words is one valid method for author identification. Function words, which have limited meanings, indicate the functional relationships between components of a sentence. Golshaie (2019) and Dabagh (2007) applied the frequency of Persian function words to identify authors. This study aims to compare the efficacy of function word frequency with systemic functional grammar methods in automatically identifying writing styles.
Theoretical Framework
Systemic Functional Grammar (SFG) is a component of the social semiotic approach to language known as systemic functional linguistics (Halliday & Matthiessen, 2014). SFG conceptualizes language as a network of systems, or interrelated sets of options for creating meaning. Since the 1960s, SFG has been applied in various contexts within computational linguistics (Matthiessen & Bateman, 1991; Teich, 1995). In SFG, the clause is considered the fundamental unit of language, and it is analyzed through three perspectives, defined as the ideational, interpersonal, and textual metafunctions. The ideational function is further divided into the experiential and logical aspects (Halliday & Matthiessen, 2014).
This study employs three system networks: conjunction, modality, and comment. These networks correspond to three types of adjuncts: conjunctive adjuncts, mood adjuncts, and comment adjuncts, respectively. In the systemic environment of conjunction, conjunctions function as conjunctive adjuncts within the clause structure. They establish relationships where one segment of text elaborates on, extends, or enhances another segment (Halliday & Matthiessen, 2014).
Modal adjuncts express the speaker's or writer's judgment or attitude toward the content of the message. There are two types of modal adjuncts: (i) mood adjuncts and (ii) comment adjuncts. Mood adjuncts and comment adjuncts are categorized within the modality and comment adjunct system networks, respectively. Modality encompasses intermediate degrees between positive and negative poles, defining the region of uncertainty between 'yes' and 'no.' The modality system allows writers to qualify events or entities in terms of their probability, typicality, obligation, or inclination (Halliday & Matthiessen, 2014). The comment adjunct system provides a means for the writer to comment on the status of a message concerning the textual and interactive context of the discourse (Argamon et al., 2007). Comments can target either the ideational content of the proposition or the interpersonal aspects of the speech function (Halliday & Matthiessen, 2014).
Method
A corpus was compiled from the works of seven contemporary Persian writers: Hoshang Golshiri, Bozorg Alavi, Ahmad Mahmoud, Mahmoud Dolatabadi, Nader Ebrahimi, Jalal Al-e Ahmad, and Gholamhossein Saedi, totaling 2,069,243 words. From this corpus, a list of 197 function words was extracted using the Term Frequency-Inverse Document Frequency (TF-IDF) technique. Conjunction, modality, and comment adjunct system networks were then used to create a lexicon.
An author identification system was designed using machine learning techniques. The system tokenized the texts, extracted instances of lexical units specified in the lexicon, and computed the relative frequencies of semantic attribute values for each text, resulting in an overall "feature vector" that described each text. This approach was inspired by the method introduced by Argamon et al. (2007). For the learning phase, a multilayer perceptron classifier, a type of neural network, was utilized.
Results
To evaluate the system, the collected corpus was divided into five segments, and a 5-fold cross-validation method was applied. The 5-fold cross-validation demonstrated a satisfactory accuracy when focusing exclusively on function words. The combined use of function words and SFG methods achieved an accuracy of 74.47% for Persian author identification. Subsequent feature selection identified the most effective features for the machine learning phase. The results indicated that the relative frequency of function words outperformed SFG-based attributes in terms of effectiveness.
Discussion and Conclusions
The evaluation phase revealed that the function words-based method outperformed the systemic functional grammar (SFG) approach in identifying authors. However, the simultaneous use of both methods improved effectiveness compared to using either method alone. The superior performance of the function words-based method may be attributed to the high frequency of function words and the author’s unconscious control over their use.
Among the SFG-based features, the combination of top features—namely conjunctive, mood, and comment adjuncts—produced higher accuracy than any single system network alone. Additionally, the results from feature selection indicated that features derived from the modality system network were more effective than those from the conjunction and comment adjunct system networks for Persian author identification.
Overall, while the function words-based method proved to be highly effective on its own, integrating it with SFG-based methods provided a more comprehensive approach, enhancing the accuracy of author identification.
کلیدواژهها [English]