Exploring text datasets by visualizing relevant words
نویسندگان
چکیده
When working with a new dataset, it is important to first explore and familiarize oneself with it, before applying any advanced machine learning algorithms. However, to the best of our knowledge, no tools exist that quickly and reliably give insight into the contents of a selection of documents with respect to what distinguishes them from other documents belonging to different categories. In this paper we propose to extract ‘relevant words’ from a collection of texts, which summarize the contents of documents belonging to a certain class (or discovered cluster in the case of unlabeled datasets), and visualize them in word clouds to allow for a survey of salient features at a glance. We compare three methods for extracting relevant words and demonstrate the usefulness of the resulting word clouds by providing an overview of the classes contained in a dataset of scientific publications as well as by discovering trending topics from recent New York Times article snippets.
منابع مشابه
Discovering topics in text datasets by visualizing relevant words
When dealing with large collections of documents, it is imperative to quickly get an overview of the texts’ contents. In this paper we show how this can be achieved by using a clustering algorithm to identify topics in the dataset and then selecting and visualizing relevant words, which distinguish a group of documents from the rest of the texts, to summarize the contents of the documents belon...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملUser-Directed Sentiment Analysis: Visualizing The Affective Content Of Documents
Recent advances in text analysis have led to finer-grained semantic analysis, including automatic sentiment analysis— the task of measuring documents, or chunks of text, based on emotive categories, such as positive or negative. However, considerably less progress has been made on efficient ways of exploring these measurements. This paper discusses approaches for visualizing the affective conte...
متن کاملVisualizing Spoken Interaction
This paper introduces and explores the use of IVEE, a technique for visualization and dynamic database queries applied to a database of transcriptions from various types of linguistic interaction. The use of IVEE is illustrated by presentation of data on vocabulary richness, verbal dominance, hesitance, uncertainty and liviliness, 1. Visualizing and exploring complex datasets with dynamic queri...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1707.05261 شماره
صفحات -
تاریخ انتشار 2017