Detecting Offensive Tweets via Topical Feature Discovery over a Large Scale Twitter Corpus

ثبت نشده

چکیده

In this paper, we propose a novel approach for detecting cussing-related offensive content in Twitter. Our approach exploits the lexical collocation of swearing language via statistical topic modeling on a huge Twitter corpus and detects offensive tweets with automatically generated features under a machine learning framework. Our approach performed stably and competitively under a variety of machine learning algorithms. For instance, our approach achieved a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular and highly effective keyword matching baseline which has a TP of 69.7%, while keeping the false positive rate (FP) on the same level as the baseline at about 3.77%. In addition to the good performance, our approach also provides an alternative to large scale hand annotation efforts required by supervised learning approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Abusive Language Detection on Arabic Social Media

In this paper, we present our work on detecting abusive language on Arabic social media. We extract a list of obscene words and hashtags using common patterns used in offensive and rude communications. We also classify Twitter users according to whether they use any of these words or not in their tweets. We expand the list of obscene words using this classification, and we report results on a n...

متن کامل

A Model for Detecting of Persian Rumors based on the Analysis of Contextual Features in the Content of Social Networks

The rumor is a collective attempt to interpret a vague but attractive situation by using the power of words. Therefore, identifying the rumor language can be helpful in identifying it. The previous research has focused more on the contextual information to reply tweets and less on the content features of the original rumor to address the rumor detection problem. Most of the studies have been in...

متن کامل

A German Twitter Snapshot

We present a new corpus of German tweets. Due to the relatively small number of German messages on Twitter, it is possible to collect a virtually complete snapshot of German twitter messages over a period of time. In this paper, we present our collection method which produced a 24 million tweet corpus, representing a large majority of all German tweets sent in April, 2013. Further, we analyze t...

متن کامل

Machine Translation for Twitter

We carried out a study in which we explored the feasibility of machine translation for Twitter for the language pair English and German. As a first step we created a small bilingual corpus of 1,000 tweets. Using this corpus we carried out an analysis of the linguistic features of tweets. We tested different strategies of domain adaptation and found that they improved translation performance. In...

متن کامل