Generate network based on co-occurrence was proposed several decades ago, however, it still occupies most of papers talking about network discovering. Here I want to give a simple introduction for network capture by conventional co-occurrence methods. The following contents are slightly related with the basis of my current work, they are out of date but still useful sometimes.
General Introduction
- This article is just an introduction for several methods capturing useful structured data from unstructured data sets. It only talks about the basic implementation of co-occurrence in text data. After reading this short article, you will know how to generate a graph as follows from a paragraph of text, a set of Internet data captured from spiders or even a video.
- An example, the following picture is generated from a part of the script of 《Train to Busan》. I copied the script from Internet, and it’s easy to be analyzed by code. The picture was cut a little part when I published it.
Entity Identification
- A network is composed of a set of nodes and a set of edges. We name the set of nodes
V
, and edgesE
. The first problem we face is where can we getV
. Entity identification needs to be considered here. - Some simple methods such as regress( for binary classification ), SVM can be used if you know the characteristics for nodes. However, in most conditions we even can’t describe what the nodes exactly like. In these conditions, deep learning algorithms such as Convolutional Neural Network could be considered. You can give some nodes you already know, then ask your model to learn what the nodes like. That may be a little complex, so we left them behind. Here we just consider the best condition.
- Here we make a hypothesis, you already have the set of all the nodes. That means, you have the
V
and the data set, what you need to do is just generating a network for theV
from the given data set. Sounds simple? However this is indeed the case. In some few cases, for example, generating network from a movie like the example above, very few main entities appear in a movie, so we can get their identifications (here is name) from web or just make them yourself.
Relationship Identification
- Here we come across the second question, how to get
E
? I will only introduce one simple method here, that’s what I metioned above, the conventional co-occurrence method. I will introduce some more methods after my current job finished. - The co-occurrence network, is just like what its name suggests, use the information that two entities occurred together. For example, in my analysis for 《Train to Busan》, I simply build an edge for two nodes if they occur in a same paragraph. If there always been an edge for two nodes, the weight of that edge will be increased. Once the data set is big enough, the main line of the data set will appear. You can choose building directed edges or undirected edges, and choose complete graph or not.
- The co-occurrence network is only useful for data sets that have obvious centralization, edges with low weight are always redundant. Also, many nodes will have no sense of presence because they are just playing samll roles. The co-occurrence will make every node connect with the center node, that’s unreasonable. Since we just introduce the very simple condition here, I will present two common ways for reducing the redundancy
ansand [thanks for a kind reader, he pointed out the mistake here] fixing the network. - The first way is filter. Easy to understand, just filter out those edges with low weight. The threshold can be adjusted manually or learned by specific models.
- The second way is segmenting your network. This needs clustering first, and find the community centers. Cut these edges connecting with center nodes but has low weights. The effect is hard to be estimated according to what your network structure likes.
Applicable Scope
- Many fields can be applied with co-occurrence method. For example, capturing people relationships from videos, recodings, pictures, etc. I will show how to generate a network from a video later. That will cost a long time since many frame needs to be considered, which is a very time-consuming job.
I hope this article could give you some help. If this article has any error, or you have some problems/suggestions, please e-mail me. I am glad to learn from each other.
Related Data Download Link: Script for 《Train to Busan》
原创作品,允许转载,转载时无需告知,但请务必以超链接形式标明文章原始出处(https://forec.github.io/2016/10/03/co-occurrence-structure-capture/) 、作者信息(Forec)和本声明。