Predicting Tag Spam Examining Cooccurrences, Network Structures and URL Components
نویسندگان
چکیده
The task of the ECML/PKDD Discovery Challenge 2008 is to identify spammers in a social bookmarking system. We classify users using three different types of features, based on cooccurences, network properties and url parts. Cooccurrence features are based on the assumption that users associated with similar documents and tags as spammers are likely to be spammers themselves. Network-based features work on a collective scale, assuming common behavioural patterns which can be identified in the graph structures created by tagging activities. Finally, a text classification on the URLs’ components identifies frequent terms in spam URLs. With these features, we train an SVM for classification. Our submission run, combining all three classes of features, performed worse than expected from previous tests. With the wisdom of hindsight, we find an optimal choice of features is to leave out network features entirely but to strengthen URL classification. This is, however, a side effect of wrong assumptions about the test set; network features, used alone, still yield positive results. As network features do not depend on the presence of labeled users, they should be further explored to identify structural properties of tag spam even when no ground truth exists.
منابع مشابه
On the properties of spam-advertised URL addresses
The main purpose of most spam e-mail messages distributed on Internet today is to entice recipients into visiting World Wide Web pages that are advertised through spam. In essence, e-mail spamming is a campaign that advertises URL addresses at a massive scale and at minimum cost for the advertisers and those advertised. Nevertheless, the characteristics of URL addresses and of web sites adverti...
متن کاملSocial network analysis of web links to eliminate false positives in collaborative anti-spam systems
The performance of today’s email anti-spam systems is primarily measured by the percentage of false positives (non-spam messages detected as spam) rather than by the percentage of false negatives (real spam messages left unblocked). One reliable anti-spam technique is the Universal Resource Locator (URL)-based filter, which is utilized by most collaborative signature-based filters. URL-based fi...
متن کاملFeature-based Malicious URL and Attack Type Detection Using Multi-class Classification
Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...
متن کاملAn Empirical Study of Clustering Behavior of Spammers and Group-based Anti-Spam Strategies
We conducted an empirical study of the clustering behavior of spammers and explored the group-based anti-spam strategies. We propose to block spammers as groups instead of dealing with each spam individually. We empirically observe that, with a certain grouping criteria such as having the same URL in the spam mail, the relationship among the spammers has demonstrated highly clustering structure...
متن کاملResisting Tag Spam by Leveraging Implicit User Behaviors
Tagging systems are vulnerable to tag spam attacks. However, defending against tag spam has been challenging in practice, since adversaries can easily launch spam attacks in various ways and scales. To deeply understand users’ tagging behaviors and explore more effective defense, this paper first conducts measurement experiments on public datasets of two representative tagging systems: Del.icio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008