Author gender identification from text using Bayesian Random Forest

Authors

Sajedi, Hedieh University of Tehran

Taslimi, Mahnaz Azad Islamic University, Qazvin Branch

Abstract:

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields, from personalized advertising to law enforcement of reputation management. Text posts represent a large portion of user generated content, and contain information which can be relevant to discovering undisclosed user attributes, or investigating the honesty of self-reported age and gender. Because the highest rate of information exchanges is in text format, author identification from the aspects like age, gender, political and religious opinions from these contents will seem more considerable. Gender identification that could be useful in security and marketing, also answers the following question: given a short text document, can we identify if the author is a male or a female? This question is motivated by recent events where people faked their gender on the Internet. In this paper, author gender identification in blog’s data is investigated. In this regard, four groups of features include syntactic features, word-based features, character-based features, and function words are employed. In addition, character n-gram features is used for improving the accuracy of classification. For evaluation of the proposed method, 3212 texts were collected from Technorati.com and blogger.com. Experimental results demonstrate that these types of features are practical. furthermore, a new classification method called "Bayesian Random Forest" is introduced. Each tree in Bayesian Random Forest is a Bayes tree. The results of experiment show that this method attains noticeable results in comparison with other classification algorithms such as Naïve Bayes, Naïve Bayes Tree, and Random Forest and it increases accuracy of gender identification to 89.5%.

Download for Free

Already have an account?login

similar resources

Author Identification using Random Forest and Sequential Minimal Optimization

Author identification is a significant factor in the global economic loss due to computer-related crimes. According to the Center for Strategic and International Studies (CSIS), an estimated 375 to 575 billion dollars is lost each year due to computer or cybercrimes. Recently, various techniques have been used to improve the accuracy of author identification. In this paper, we propose combining...

full text

Author Identification: Using Text Mining, Feature Engineering & Network Embedding

Authorship analysis is a challenging area that has been developed through centuries and with research done widely scattered across multiple disciples of mainly computational linguistics, text mining, data mining, stylometry and machine learning. Conventional techniques from the past relied heavily on stylometry and text-based content analysis of document text for authorship analysis. More recen...

full text

Bayesian Multinomial Logistic Regression for Author Identification

Motivated by high-dimensional applications in authorship atttribution, we describe a Bayesian multinomial logistic regression model together with an associated learning algorithm.

full text

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

full text

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

full text

Author Identification from Citations

Machine Learning techniques can be applied to citation data from a network of papers to predict the author of a paper that is currently outside of the network. Using a series of models we have found that we can increase the accuracy from past experiments with citation data, by considering the citations as a network. This allows us to predict with confidence the author of a blind paper.

full text

My Resources

Save resource for easier access later

Save to my library Already added to my library

{@ msg_add @}

Journal title

پردازش علائم و داده ها

volume 16 issue 1

pages 143- 157

publication date 2019-06

unfollow

{@ msg @}

By following a journal you will be notified via email when a new issue of this journal is published.

Keywords

No Keywords

Hosted on Doprax cloud platform doprax.com