Predicting Tag Spam Examining Cooccurrences, Network Structures and URL Components

نویسندگان

Nicolas Neubauer

Klaus Obermayer

چکیده

The task of the ECML/PKDD Discovery Challenge 2008 is to identify spammers in a social bookmarking system. We classify users using three different types of features, based on cooccurences, network properties and url parts. Cooccurrence features are based on the assumption that users associated with similar documents and tags as spammers are likely to be spammers themselves. Network-based features work on a collective scale, assuming common behavioural patterns which can be identified in the graph structures created by tagging activities. Finally, a text classification on the URLs’ components identifies frequent terms in spam URLs. With these features, we train an SVM for classification. Our submission run, combining all three classes of features, performed worse than expected from previous tests. With the wisdom of hindsight, we find an optimal choice of features is to leave out network features entirely but to strengthen URL classification. This is, however, a side effect of wrong assumptions about the test set; network features, used alone, still yield positive results. As network features do not depend on the presence of labeled users, they should be further explored to identify structural properties of tag spam even when no ground truth exists.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the properties of spam-advertised URL addresses

The main purpose of most spam e-mail messages distributed on Internet today is to entice recipients into visiting World Wide Web pages that are advertised through spam. In essence, e-mail spamming is a campaign that advertises URL addresses at a massive scale and at minimum cost for the advertisers and those advertised. Nevertheless, the characteristics of URL addresses and of web sites adverti...

متن کامل

Social network analysis of web links to eliminate false positives in collaborative anti-spam systems

The performance of today’s email anti-spam systems is primarily measured by the percentage of false positives (non-spam messages detected as spam) rather than by the percentage of false negatives (real spam messages left unblocked). One reliable anti-spam technique is the Universal Resource Locator (URL)-based filter, which is utilized by most collaborative signature-based filters. URL-based fi...

متن کامل

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...

متن کامل

An Empirical Study of Clustering Behavior of Spammers and Group-based Anti-Spam Strategies

We conducted an empirical study of the clustering behavior of spammers and explored the group-based anti-spam strategies. We propose to block spammers as groups instead of dealing with each spam individually. We empirically observe that, with a certain grouping criteria such as having the same URL in the spam mail, the relationship among the spammers has demonstrated highly clustering structure...

متن کامل

Resisting Tag Spam by Leveraging Implicit User Behaviors

Tagging systems are vulnerable to tag spam attacks. However, defending against tag spam has been challenging in practice, since adversaries can easily launch spam attacks in various ways and scales. To deeply understand users’ tagging behaviors and explore more effective defense, this paper first conducts measurement experiments on public datasets of two representative tagging systems: Del.icio...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Predicting Tag Spam Examining Cooccurrences, Network Structures and URL Components

نویسندگان

چکیده

منابع مشابه

On the properties of spam-advertised URL addresses

Social network analysis of web links to eliminate false positives in collaborative anti-spam systems

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

An Empirical Study of Clustering Behavior of Spammers and Group-based Anti-Spam Strategies

Resisting Tag Spam by Leveraging Implicit User Behaviors

عنوان ژورنال:

اشتراک گذاری