Topic Cube: Topic Modeling for OLAP on Multidimensional Text Databases
نویسندگان
چکیده
As the amount of textual information grows explosively in various kinds of business systems, it becomes more and more desirable to analyze both structured data records and unstructured text data simultaneously. While online analytical processing (OLAP) techniques have been proven very useful for analyzing and mining structured data, they face challenges in handling text data. On the other hand, probabilistic topic models are among the most effective approaches to latent topic analysis and mining on text data. In this paper, we propose a new data model called topic cube to combine OLAP with probabilistic topic modeling and enable OLAP on the dimension of text data in a multidimensional text database. Topic cube extends the traditional data cube to cope with a topic hierarchy and store probabilistic content measures of text documents learned through a probabilistic topic model. To materialize topic cubes efficiently, we propose a heuristic method to speed up the iterative EM algorithm for estimating topic models by leveraging the models learned on component data cells to choose a good starting point for iteration. Experiment results show that this heuristic method is much faster than the baseline method of computing each topic cube from scratch. We also discuss potential uses of topic cube and show sample experimental results.
منابع مشابه
Topic modeling for OLAP on multidimensional text databases: topic cube and its applications
As the amount of textual information grows explosively in various kinds of business systems, it becomes more and more desirable to analyze both structured data records and unstructured text data simultaneously. While online analytical processing (OLAP) techniques have been proven very useful for analyzing and mining structured data, they face challenges in handling text data. On the other hand,...
متن کاملModeling Multidimensional Databases, Cubes and Cube Operations
On-Line Analytical Processing (OLAP) is a trend in database technology, which was recently introduced and has attracted the interest of a lot of research work. OLAP is based on the multidimensional view of data, supported either by multidimensional databases (MOLAP) or relational engines (ROLAP). In this paper we propose a model for multidimensional databases. Dimensions, dimension hierarchies ...
متن کاملiNextCube: Information Network-Enhanced Text Cube
Nowadays, most business, administration, and/or scientific databases contain both structured attributes and text attributes. We call a database that consists of both multidimensional structured data and narrative text data as multidimensional text database. Searching, OLAP, and mining such databases pose many research challenges. To enhance the power of data analysis, interesting entities and r...
متن کاملA review of text mining approaches and their function in discovering and extracting a topic
Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling. Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...
متن کاملIntegrating Data Cube Computation and Emerging Pattern Mining for Multidimensional Data Analysis
Online analytical processing (OLAP) in multidimensional text databases has recently become an effective tool for analyzing text-rich data such as web documents. In this capstone project, we follow the trend of using OLAP and the data cube to analyze web documents, but want to address a new problem from the data mining perspective. In particular, we wish to find contrast patterns in documents of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009