New in Python 3.9 (2)-Sort directed acyclic graphs in Python

Introduction

New changes in Python 3.9 scheduled to be released in October 2020 are summarized in the article What's new in Python 3.9 (Summary) I am. I have a separate article on a topic that seems to be a little voluminous, but this is the second one, about sorting directed acyclic graphs.

Directed acyclic graph and topological sort

First of all, the graph here is not a line graph, a bar graph, or a chart that visually represents data, but a graph of graph theory. A graph is a type of data structure that consists of nodes (vertices) and edges (branches) that connect them. By giving a node or edge some meaning, you can represent relevant information.

There are several types of graphs, and whether the first fork has a direction at the edge. A graph with a direction is called a directed graph (left side), and a graph without a direction is called an undirected graph (right side). スクリーンショット 2020-05-31 13.11.10.JPG

Edges connect nodes, but you can use a directed graph if there is some kind of master-slave relationship between the connected nodes, and an undirected graph if there is an equal relationship.

And the point is whether it is a cycle graph or not. If there is a node that returns to you by following the edge, it is called a cyclic graph, and if it is not, it is called a non-cycle graph. The Directed Acyclic Graph (DAG) has two characteristics, both directed and non-circulated, and is the graph we are dealing with this time.

This directed acyclic graph (DAG) can be used to represent a variety of information. In a strange example, the crypto asset IOTA uses DAG. Ordinary crypto assets use blockchain, and I think it can be called a one-way DAG, but IOTA connects it with a DAG that has multiple edges. it's interesting.

In a slightly more common application, there is a task scheduling problem with dependencies. Each task is a node, and the dependency of the task is represented by an edge. For example, if A → B, B cannot be started until A ends. Even if you have a list of tasks with complicated dependencies, you can see that they can be expressed in DAG by breaking them down into task-to-task relationships.

For example, B and C cannot start until A is over. And D cannot start until B and C are finished, such a dependency is expressed as follows. スクリーンショット 2020-05-31 14.29.53.JPG

A more complicated example would be something like this: スクリーンショット 2020-05-31 14.31.53.JPG

It's a little derailed, but there is a tool for visualizing DAGs called Graphviz. This describes the DAG in a unique format called the dot format, draws it, and converts it to an image format such as PNG. What's interesting is that there are several drawing algorithms available that will give you completely different outputs with the same input.

For example, if you write the second example above in dot format, it will look like this.

digraph example {
  graph [
    labeljust = "c"
  ];
  7, 5, 3, 11, 8, 2, 9, 10;
  7 -> 11;
  7 -> 8;
  5 -> 11;
  3 -> 8;
  3 -> 10;
  11 -> 2;
  11 -> 9;
  11 -> 10;
  8 -> 9;
}

If you try to output this with Graphviz's drawing tool, you can make so many variations.

it's interesting. It doesn't look like the very same DAG, but they are all drawn from the same dot file.

Now, there is an order relationship between the tasks represented by the DAG, so you want to know in what order you should put them away. That is a schedule issue. For that purpose, it is necessary to arrange them one-dimensionally while satisfying the dependencies, which is called ** topological sort **.

Although the introduction has been lengthened, Python 3.9 adds a feature called TopologicalSorter to the functools module, which allows this topological sort to be done as standard.

Try using Topological Sorter

Let's try the example with A, B, C, D nodes shown at the very beginning.

>>> from functools import TopologicalSorter
>>> ts = TopologicalSorter()
>>> ts.add('B', 'A')
>>> ts.add('C', 'A')
>>> ts.add('D', 'B', 'C')
>>> tuple(ts.static_order())
('A', 'B', 'C', 'D')

After creating an empty sorter with TopologySorter (), we are adding nodes one by one. TopologySorter.add () lists the nodes to be registered in the first argument and the nodes that have dependencies on the subsequent arguments. Nodes that do not have dependent nodes (ʻAin this case) can skip registration (because they will appear as dependent when registering other nodes). Then, the result sorted byTopologySorter.static_order ()` is returned by generator. Since it is difficult to see as it is, it is converted to tuple and displayed.

If the graph is decided from the beginning, you can give it to the constructor and make it all at once.

>>> from functools import TopologicalSorter
>>> graph = {'B': {'A'}, 'C': {'A'}, 'D': {'B', 'C'}}
>>> ts = TopologicalSorter(graph)
>>> tuple(ts.static_order())
('A', 'B', 'C', 'D')

Here, what is given to the constructor is dictionary type data, in which the key is a node and the node (a set) on which the value depends. When specifying multiple, the set type is used here, but it seems that it can be a tuple or a list.

This is all you need to sort, but there are also some methods for when you want a little more control. First of all, "what can be done at the moment". It will be a list of tasks whose dependencies are satisfied, using get_ready ().

As an example, let's use the slightly complicated example above.

>>> from functools import TopologicalSorter
>>> graph = {
...    2: {11},
...    8: {7, 3},
...    9: {11, 8},
...   10: {11, 3},
...   11: {7, 5},
... }
>>> ts = TopologicalSorter(graph)
>>> ts.prepare()
>>> ts.get_ready()
(3, 7, 5)

After creating TopologySorter by giving the dependency data of the graph in the constructor, first prepare (). This is a method that performs pre-processing for subsequent processing. We are also checking if the graph registered here is patrol, and if it is patrol, CycleError will be displayed. In the above example, this prepare () was not called, but it is automatically called in static_order () (it is actually called when fetching the first value of generator).

Subsequent calls to get_ready () will return (3, 7, 5). It is listed as "executable" among the nodes included in the graph that have no dependencies. Calling prepare () again will return () this time. It means, "(For now) there is nothing else."

I will color the graph above. Light green indicates that the node is viable.

So what if the 5 and 7 of the feasible tasks are done? The done () method teaches that to the sorter. If you call the completed tasks side by side as arguments, the internal state will change. So if you run get_ready () again, this time it will return (11,).

>>> ts.done(5,7)
>>> ts.get_ready()
(11,)

The figure is changed to a darker color to indicate that 5 and 7 are complete. And 11 is ready to run because all the dependent tasks have been completed. It matches the result of ts.get_ready () above.

We will repeat this, but let's complete it in the order of 11 → 3 → 8.

>>> ts.done(11)
>>> ts.get_ready()
(2,)
>>> ts.done(3)
>>> ts.get_ready()
(8, 10)
>>> ts.done(8)
>>> ts.get_ready()
(9,)

You can check if there are any unexecuted tasks in the graph with ʻis_active (). In the above state, 2, 9, and 10 have not been executed yet, so you can see that the return value of ʻis_active () changes before and after making them done ().

>>> ts.is_active()
True
>>> ts.done(2, 9 ,10)
>>> ts.is_active()
False

Summary

I investigated and used the new ToplogicalSorter function added to the functools module of Python 3.9. I thought it might be useful for deciding the order of batch processing with dependencies.