The need of specific link prediction algorithms
This is the first part of a set of series about the need of specific-domain link prediction algorithms
What about Link Prediction?
Given a network of friends where person A is friend of person B and person B is friend of person C, is there any chance that person A will become a friend of person C given that both have as a friend the person B?, absolutely, is there any algorithm that can determine this kind of relationships? of course. This is what link prediction tries to solve, given a graph, an algorithm is sought that can learn how the relationships are created and then, predict new ones.
The problem of link prediction has been studied for several years. Different approaches have been proposed such as heuristics-based models, random walks-based models, auto-encoders based models, etc. These models have shown impressive results in domains such as social networks, citation networks and protein structures. However, these kinds of models do not show good results in graphs of all domains. There is a need for link prediction models which can be applied on different phenomena which can be modeled as a graph. One of the domains that hasn’t been studied from the point of view of graph embeddings and in consequence link prediction, is engineering.
Engineering is a complex field which involves the integration of different disciplines such as requirements engineering, design, manufacturing and operation. One of the goals of engineering organizations is to achieve traceability across disciplines. However, achieving traceability is not an easy task because of the need of decoupled applications from data and the implementation of standard APIs. Fortunately, standards exists that we can apply as a layer on top of REST APIs such as OSLC (Open Services for Lifecycle Collaboration) to connect applications from different domains and in consequence, achieve traceability in engineering organizations. Now, let’s think about some scenario where there is engineering data connected as a global graph. What could we do with such a global graph? So many interesting things.
In engineering organizations as well as many others fields, to find the right information in the less possible time, is an important need. When data is not connected, it takes too much time and effort to make the connection between different applications, to extract the correct data, to transform data and then to find the information. It is almost impossible to do this when organizations have isolated data silos. However, when data is connected, we do not need to care about how to create the right connection or how to transform and align data, as we already have data connected which is seen as a whole, we just need to care about to ask for the specific information we need. We no longer need to care about whether the information we are asking for comes from a pair of applications or from more than two.
Better Visualizations and Reports
In the case when we have isolated data silos, we just can create isolated reports and visualizations which doesn’t make much sense when we try to understand the complete lifecycle of some specific product. However, when data is connected and viewed as a global graph, it is easier to create complete reports by querying the global graph and in consequence, this allows to create more substantial visualizations which involves information extracted from different data sources.
Machine Learning Models
In engineering we have different relationships across different domains such as requirements being satisfied by test plans or test plans making use of simulation models, etc. In reality, all these relationships are created manually by engineers. However, to create these relationships by hand takes too much time and engineers need to put an extra effort to find the right information to be linked. In reality, all these links are saved even in word documents or spreadsheets, where it is difficult to apply queries to extract information because of the format is not machine-processable, moreover, it is complicated to verify that links saved in these kind of format are valid. However, what if there was a machine learning model that help engineers to create this kind of relationships automatically? This is a real link prediction problem in the context of engineering.
As I have commented, with a global graph we can create such an impressive search engine, likewise we can create better visualizations and customized reports, however, in addition to these great solutions we can think of applying machine learning algorithms to solve specific problems and to automate some process. For example, as we already know, with a global graph we can observe and manage our data as a whole, we can add, remove or modify links and nodes but, what if a machine learning model could do these tasks by itself?, this problem is known as link prediction. Link prediction is one of the top problems that many researchers are working on, different approaches have been proposed by achieving interesting results. However, there are more tasks that can be solved by applying ML algorithms such as node classification, sub-pattern extraction, graph classification, graph completion and many more. At the end of the day, these ML approaches will help to engineering organizations to group artifacts by any variable, find complex patterns and sub-patterns within the graph, classify different graphs, complete a graph etc.
The problem of link prediction has evolved in the recent years. This is because today more data is connected and most importantly, more data need to be connected. This gives way to the creation of different and new graph structures where the current models to represent a network in a euclidean space, are not enough. We need to start thinking of graphs not as a social network or as citation network but also as specific domain graphs. Different problems, different graphs, different needs and therefore, different algorithms. This is the new challenge.
Second part will come soon, it will be about the comparison of traditional and state-of-the-art models for link prediction. The idea is to compare the performance of these models with common datasets such as social networks or citation networks versus a graph dataset from the engineering domain.