Automatic Recognition of Authors Identity in Persian based on Systemic Functional Grammar

Document Type : Original Article

Authors

1 Ph.D. in Linguistics, Department of Linguistics, Faculty of Persian Literature and Foreign Languages, Allameh Tabataba'i University, Tehran, Iran

2 Associate Professor of Linguistics, Department of Linguistics, Faculty of Persian Literature and Foreign Languages, Allameh Tabataba'i University, Tehran, Iran.

3 Assistant Professor, Department of Computer, Faculty of Statistics, Mathematics and Computer, Allameh Tabataba’i University, Tehran, Iran

4 Assistant Professor, Department of English Language and Translation, Islamic Azad University, Karaj, Iran

Abstract

Automated author identification is one of the important fields in forensic linguistics. In this study, the effectiveness of systemic functional grammar (Halliday and Matthiessen, 2014) features in Persian authorship attribution was compared with that of function words. First, a corpus composed of documents written by seven contemporary Iranian authors was collected. Second, a list of function words was extracted from the corpus. Moreover, conjunction, modality and comment adjunct system networks were applied to form a lexicon using linguistics resources. Then, the relative frequency of function words in addition to systemic functional features were calculated in each document. Multilayer perceptron classifier, a type of neural network, was used for learning phase which resulted in a desirable accuracy in evaluation phase. The results of the study showed that using function words method is superior to systemic functional approach alone in Persian author identification, however, simultaneous use of the two methods increases the effectiveness in comparison to each alone.
Introduction
Recently, automated author identification has become a key focus for forensic linguistics. Author identification involves determining the writer of a text from a set of potential authors. The text in question could be a threatening letter, an email, a literary work, or a scientific article or book. The basis for author identification rests on the idea that different authors may write about the same topic using overlapping, yet distinct, lexico-grammatical units—an issue referred to as idiolect (Coulthard, 2004).
The first significant attempt to identify writing styles was Mendenhall's study of Shakespeare's plays (1887). The play Henry VIII is widely recognized as a collaborative work, not solely authored by William Shakespeare. Plechac (2020) investigated the use of accent or stress to identify the contributions of other authors to the play.
In Persian, several studies have been conducted to determine authorship (Farahamandpour et al., 2013; Arefi et al., 2021). These studies utilized repetitive features, such as lexical richness, frequency of syntactic groups, collocations, and the relative frequency of punctuation marks, to detect writing styles. Measuring the frequency of function words is one valid method for author identification. Function words, which have limited meanings, indicate the functional relationships between components of a sentence. Golshaie (2019) and Dabagh (2007) applied the frequency of Persian function words to identify authors. This study aims to compare the efficacy of function word frequency with systemic functional grammar methods in automatically identifying writing styles.
Theoretical Framework
Systemic Functional Grammar (SFG) is a component of the social semiotic approach to language known as systemic functional linguistics (Halliday & Matthiessen, 2014). SFG conceptualizes language as a network of systems, or interrelated sets of options for creating meaning. Since the 1960s, SFG has been applied in various contexts within computational linguistics (Matthiessen & Bateman, 1991; Teich, 1995). In SFG, the clause is considered the fundamental unit of language, and it is analyzed through three perspectives, defined as the ideational, interpersonal, and textual metafunctions. The ideational function is further divided into the experiential and logical aspects (Halliday & Matthiessen, 2014).
This study employs three system networks: conjunction, modality, and comment. These networks correspond to three types of adjuncts: conjunctive adjuncts, mood adjuncts, and comment adjuncts, respectively. In the systemic environment of conjunction, conjunctions function as conjunctive adjuncts within the clause structure. They establish relationships where one segment of text elaborates on, extends, or enhances another segment (Halliday & Matthiessen, 2014).
Modal adjuncts express the speaker's or writer's judgment or attitude toward the content of the message. There are two types of modal adjuncts: (i) mood adjuncts and (ii) comment adjuncts. Mood adjuncts and comment adjuncts are categorized within the modality and comment adjunct system networks, respectively. Modality encompasses intermediate degrees between positive and negative poles, defining the region of uncertainty between 'yes' and 'no.' The modality system allows writers to qualify events or entities in terms of their probability, typicality, obligation, or inclination (Halliday & Matthiessen, 2014). The comment adjunct system provides a means for the writer to comment on the status of a message concerning the textual and interactive context of the discourse (Argamon et al., 2007). Comments can target either the ideational content of the proposition or the interpersonal aspects of the speech function (Halliday & Matthiessen, 2014).
Method
A corpus was compiled from the works of seven contemporary Persian writers: Hoshang Golshiri, Bozorg Alavi, Ahmad Mahmoud, Mahmoud Dolatabadi, Nader Ebrahimi, Jalal Al-e Ahmad, and Gholamhossein Saedi, totaling 2,069,243 words. From this corpus, a list of 197 function words was extracted using the Term Frequency-Inverse Document Frequency (TF-IDF) technique. Conjunction, modality, and comment adjunct system networks were then used to create a lexicon.
An author identification system was designed using machine learning techniques. The system tokenized the texts, extracted instances of lexical units specified in the lexicon, and computed the relative frequencies of semantic attribute values for each text, resulting in an overall "feature vector" that described each text. This approach was inspired by the method introduced by Argamon et al. (2007). For the learning phase, a multilayer perceptron classifier, a type of neural network, was utilized.
Results
To evaluate the system, the collected corpus was divided into five segments, and a 5-fold cross-validation method was applied. The 5-fold cross-validation demonstrated a satisfactory accuracy when focusing exclusively on function words. The combined use of function words and SFG methods achieved an accuracy of 74.47% for Persian author identification. Subsequent feature selection identified the most effective features for the machine learning phase. The results indicated that the relative frequency of function words outperformed SFG-based attributes in terms of effectiveness.
Discussion and Conclusions
The evaluation phase revealed that the function words-based method outperformed the systemic functional grammar (SFG) approach in identifying authors. However, the simultaneous use of both methods improved effectiveness compared to using either method alone. The superior performance of the function words-based method may be attributed to the high frequency of function words and the author’s unconscious control over their use.
Among the SFG-based features, the combination of top features—namely conjunctive, mood, and comment adjuncts—produced higher accuracy than any single system network alone. Additionally, the results from feature selection indicated that features derived from the modality system network were more effective than those from the conjunction and comment adjunct system networks for Persian author identification.
Overall, while the function words-based method proved to be highly effective on its own, integrating it with SFG-based methods provided a more comprehensive approach, enhancing the accuracy of author identification.
 
 

Keywords


Alavi, B. (2007). Scrap papers from prison. Tehran: Negah. (In Persian)
Alavi, B. (2020). Gilemard. Tehran: Negah. (In Persian)
Al-e-Ahmad, J. (1967). The cursing of the land. Tehran: Ferdous. (In Persian)
Al-e-Ahmad, J. (1971). Five stories. Tehran: Ferdous. (In Persian)
Arefi, S., Basiri, M. E., & Roozmand, O. (2021). Feature selection for author identification of Persian online short texts. Journal of Information and Communication Technology, 13(47-48), 35-57.    https://dorl.net/dor/20.1001.1.27170414.1400.13.47.4.0 (In Persian)
Argamon, S., & Koppel, M. (2013). A systemic functional approach to automated authorship analysis. Journal of Low & Policy, 21(2), 299-315.
Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., & Levitan, S. (2007). Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 58(6), 802-822. https://doi.org/10.1002/asi.20553.
Assi, S. M. (1997). Farsi linguistic database (FLDB). International journal of Lexicography, 10(3), 5.
Bijankhan, M., Sheykhzadegan, J., Bahrani, M., & Ghayoomi, M. (2011). Lessons from building a Persian written corpus: Peykare. Language resources and evaluation, 45(2), 143-164. https://doi.org/10.1007/s10579-010-9132-x.
Coulthard, M. (2004). Author identification, idiolect, and linguistic uniqueness. Applied Linguistics, 25(4), 431-447. https://doi.org/10. 1093/applin/25.4.431.
Dabagh, R. M. (2007). Authorship attribution and statistical text analysis. Advances in Methodology and Statistics, 4(2), 149-163. https://doi.org/10.51936/uvjx7198.
Darooneh, A. H., & Shariati, A. (2014). Metrics for evaluation of the author's writing styles: Who is the best? Chaos: An Interdisciplinary Journal of Nonlinear Science, 24(3). https://doi.org/10.1063/1.4895468.
Dowlatabadi, M. (2016). The elderly people’s elapsed time. Tehran: Cheshmeh. (In Persian)
Dowlatabadi, M. (2022). Kelidar (37th print). Tehran: Farhange Moaser. (In Persian)
Ebrahimi, N. (1995). A quiet Romance. Tehran: Rouzbahan. (In Persian)
Ebrahimi, N. (2020). On the blue and red paths. Tehran: Rouzbahan. (In Persian)
Farahmandpoor, Z., Nikmehr, H., Mansoorizade, M., & Tabibzadeh Ghamsary, O. (2013). A novel intelligent Persian authorship system based on writing style. Soft Computing Journal, 1(2), 26-35.           https://dorl.net/dor/20.1001.1.23223707.1391.1.2.60.9 (In Persian)
Golshaie, R. (2019). Function words as idiolect markers: A corpus-based approach to authorship attribution in Farsi. Language Related Research, 10(3),293-317. (In Persian)
Golshiri, H. (1971). Christine and kid. Tehran: Ketab-e Zaman. (In Persian)
Golshiri, H. (1991). Dar velayat-e Hava. Stockholm: Asr-e Jadid. (In Persian)
Golshiri, H. (2021). Prince Ehtejab (18th print). Tehran: Niloufar. (In Persian)
Halliday, M. A. K., & Matthiessen, C. M. M. (2014). Halliday’s introduction to functional Grammar (4th ed.). Oxon: Routledge.
Hussein Hama, H., Ali-akbari, N., & Karimi, Y. (2022). Mood and modality in Sorani Kurdish: A functional analysis. Researches in Western Iranian Languages and Dialects, 9(4), 1-23. https://doi.org/10.22126/JLW.2021.6008.1504. (In Persian)
Jafari, A. (2008). An analysis of adjuncts in Persian: A syntacto-discoursal approach. Dastoor, 5(1), 128-155. (In Persian)
Mahmoud, A. (1974). The neighbors. Tehran: Amirkabir. (In Persian)
Mahmoud, A. (2002). The strangers and the little native boy. Tehran: Moin. (In Persian)
Mandenhal, T. C. (1887). The characteristics curves of composition science. Science, 9(214s), 237-246.
Martinez-Galicia, J. A., Embarcadero-Ruiz, D., Ríos-Orduña, A., & Gómez-Adorno, H. (2022). Graph-based siamese network for authorship verification. CLEF 2022 Labs and Workshops, Notebook Papers. Italy: Bologna.
Matthiessen, C. M., & Bateman, J. A. (1991). Text generation and systemic-functional linguistics: Experiences from English and Japanese. London and New York: Frances Pinter Publishers.
Mirzaei, A. (2017). Redefining the concepts of dependent and independent clauses according to a functional approach. Language and Linguistics, 13(26), 117-132. (In Persian)
Mirzaei, A. (2018). An introduntion to corpus linguistics. Tehran: Allameh Tabataba'i University Press. (In Persian)
Mirzaei, A. (2021). The relationship between polarity and clausal modality in Persian. Journal of Western Iranian Languages and Dialects, 9(1), 113-135. https://doi.org/10.22126/jlw.2020.5372.144. (In Persian)
Mirzaei, A., & Safari, P. (2014). Building specialized and general documents in Persian based on the frequency of function and content words. In proceeding of The 1st National Conference on Corpus Linguistics (PP. 175-192). Tehran: Neviseh Parsi. (In Persian)
Mirzaei, A., & Safari, P. (2018). Persian discourse treebank and coreference corpus. In Proceedings of the eleventh international conference on language resources and evaluation (lrec 2018). Japan: Miazaky.
Najafi, M., & Tavan, E. (2022). Text-to-text transformer in authorship verification via stylistic and semantical analysis. CLEF 2022 Labs and Workshops, Notebook Papers. Italy: Bologna.
Plecháč, P. (2021). Relative contributions of Shakespeare and Fletcher in Henry VIII: An analysis based on most frequent words and most frequent rhythmic patterns. Digital Scholarship in the Humanities, 36(2), 430-438.       https://doi.org/10.1093/llc/fqaa032.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn discourse treebank 2.0. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco: EUropean Language Resources Association.
Radford, A. (2004). Minimalist syntax: Exploring the structure of English: Cambridge University Press.
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. Encyclopedia of database systems, 5, 532-538 . Boston: Springer. https://doi.org/10.1007/978-0- 387-39940-9_565.
Sa'edi, Gh. (1998). Ashoftehalan-e Bidarbakht. Tehran: Negah. (In Persian)
Sa'edi, Gh. (2018). Stranger in the town. Tehran: Negah. (In Persian)
Schütze, H., Manning, C. D., & Raghavan, P. (2008). Introduction to information retrieval (Vol. 39, pp. 234-265). Cambridge: Cambridge University Press.
Segarra, S., Eisen, M., & Ribeiro, A. (2015). Authorship attribution through function word adjacency networks. IEEE Transactions on Signal Processing, 63(20), 5464-5478. https://doi.org/ 10.1109/TSP.2015.2451111.
Shamsfard, M., & Bijankhan, M., (2022). Text and speech processing for the Persian language: The state of the art and a brief review of the theoretical foundations. Tehran: Samt. (In Persian)
Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S., & Assi, S. M. (2010). Semi automatic development of farsnet: The persian wordnet. In Proceedings of 5th global WordNet conference (Vol. 29). India: Mumbai.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3), 538-556 . https://doi.org/10.1002/asi.21001
Teich, E. (1995). A proposal for dependency in systemic functional grammar–metasemiosis in computational systemic functional linguistics. Ph.D. Doctoral dissertation, University of the Saarland and GMD/IPSI, Darmstadt, Germany.
Uchendu, A., Le, T., Shu, K., & Lee, D. (2020). Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing) (pp. 8384-8395). (EMNLP 2020-2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference). Association for Computational Linguistics (ACL).
Weerasinghe, J., Singh, R., & Greenstadt, R. (2021). Feature vector difference based authorship verification for open-world settings. In Proceedings of working of the evaluation Forum (pp.2201-2207). Romania: Bucharest.