Researchers at Northwestern University, the University of Bath, and the University of Sydney have developed a new network approach to topic models, machine learning strategies that can discover abstract topics and semantic structures within text documents.
"One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts," the researchers explained in their study. "Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents."
Topic models are currently being used to identify semantically related texts and classify documents within a number of fields, including sociology, history, linguistics, and psychology. The most commonly used method, latent Dirichlet allocation (LDA), is also used for bibliometrical, psychological and political analysis, as well as for image processing.
Despite its widespread success, LDA presents several flaws in the way it represents text, such as a lack of method to choose the number of topics, discrepancies with statistical properties of real texts and a lack of justification for the Bayesian prior, which in Bayesian statistical inference is the probability distribution expressed before evidence is presented.
A large portion of recent research into topic models has focused on creating more sophisticated versions of LDA that perform better or can effectively analyze particular aspects of documents.
The approach developed by this team of researchers stems from network theory, a theory used in physics and other scientific fields that provides techniques for analyzing graphs, as well as structures in systems with different interacting agents. Their new framework for topic modeling is based on the approach used to find communities in complex networks, which, in the context of network theory, is a graph with features that occur in modeling of real-life systems.
"I was working on natural language and topic modeling from the perspective of complex systems and complex networks," Martin Gerlach, postdoctoral fellow at Northwestern University told TechXplore. "The problems seemed very similar, yet the communities of computer science (topic modeling) and complex networks seemed to work largely independently. Being trained as a physicist, we wanted to show that two seemingly different problems could be reduced to the same underlying math."
Gerlach and his colleagues devised a new approach to identifying topical structures that relates to the problem of finding communities in complex networks. Their technique represents text corpora as bipartite networks, a class of complex networks that divide nodes into sets X and Y, only allowing connections between nodes in different sets.
"We mapped the problem of topic modeling to the problem of community detection in a network consisting of words and documents showing that they are mathematically equivalent," explained Gerlach.
The researchers' approach, which adapts existing community-detection methods, was found to be more versatile and principled than other existing topic models, for instance detecting the number of topics present in texts and hierarchically grouping both words and documents. Their method used a stochastic block model (SBM), a generative model for graphs that generally maps communities, subsets of items that are connected with one another.
"We solve some of the intrinsic and known problems of popular topic modeling algorithms such as LDA (e.g. how to determine the number of topics)," said Gerlach. "In addition, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields."
The SBM approach developed by Gerlach and his colleagues could have interesting applications in other areas where machine learning is used, such as the analysis of genetic codes or images. In future, the researchers plan to continue exploring the potential of complex networks both within the context of text analysis and beyond.
"The equivalence between topic modeling and community detection allows to use insights gained in each of the communities and apply to the other domain," said Gerlach. "I hope to use these insights to gain a better understanding of these machine learning algorithms; why they work, and more importantly, under which conditions they do not work."