How Django uses topological sorting for resolving migration dependencies

Siddhartha Sehgal
The Startup
Published in
6 min readJun 16, 2020

--

Those who are familiar with Django, a Python web framework, use migrations on a regular basis to register new models and to edit existing models. Error — ‘CyclicDependencyError ’— occurs when the migration cannot be done due to interdependence of two models’ structure over each other.

Reading about graph data structures, and hence about topological sorting, it got me thinking if this is what Django uses as well to make sure all migrations are in order and there is no cyclic dependency. A simple look into the Django library code reveals that it does in fact use topological sorting. (source code below).

I explored more to find some other places it is used as well —

  • Spark uses it for making sure the directed acyclic graphs, or DAGs, are valid (source code below).
  • Maven dependency resolution, etc.

Let’s get into Topological Sorting.

What is Topological Sorting?

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering — wikipedia

Or in simpler terms- assuming a basic understanding of graphs, imagine a set of tasks, in which some tasks are dependant on others in the set. Topological ordering gives the order in which the tasks can be completed, making sure any task is only completed when the tasks it depends on are complete as well.

Popular use case example for this is — Course scheduling.

Students have certain number of courses that they have to take each year/semester.

Vertices as different courses

Now some courses have other courses as prerequisites. Assuming the courses as vertices of a graph and the dependency between courses as directed edges, we can make a graph that could look like:

Edges showing dependencies and its direction

In the above graph, course B has course A as prerequisite. So, course A needs to be completed before taking course B, and so on. The direction of the edge shows the order in which the courses have to be taken.

Topological sort should give us the correct order to take these courses so that all dependency conditions are met, and we take all courses as well.

There are some conditions for a graph to have a topological sort solution.

  • Firstly the graph should be directed. Since in an undirected graph, edges have no start and end point — nodes either are mutually adjacent or mutually not adjacent. There is no dependency, so to speak. So a topological ordering would not make sense.
    Given an edge {u, v} = {v, u}, it’s ambiguous which node would have to come first in the ordering, since neither one occupies a privileged position over the other.
  • Secondly the graph should not have any cycles. Cycle denotes dependencies between at least two vertices in the graph that is mutual.
    For example, if course A is required to take course B and course B is required to take course A, a student can never take any of the courses as the prerequisite would never be fulfilled. (Another analogy would be professionals needing experience for job and need job for experience.)
    Such graphs will not have a topological order.
    To check if a graph is acyclic, we make sure there exists no backward edge in the graph. Alternatively, if there is no vertex with ‘no-incoming edge’, the graph will have cycle(s) and cannot be sorted topologically.

NOTE: There is an implementation of topological ordering for graphs with cycles as well, like in UNIX and UNIX-like systems, called tsort. However it involves giving user a ‘cycle in data’ error and provides a partially correct topological order. Topic for another day. Some links to read more about it are:
-> tsort in UNIX/UNIX-like systems
-> tsort in GNU

Approach to Topological Sort

Since it requires visiting each vertex of the graph, depth first search, or DFS, is a simple choice to make while approaching topological ordering.

DFS is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node and explores as far as possible along each branch before backtracking — wikipedia.

The code for DFS will look like this:

  1. Select a starting node to create the DFS route from.
  2. Traverse to its neighbor and record the route.
  3. Repeat step 2 until there are no new vertex, or all vertices are visited.
  4. Backtrack and repeat step 2 for other routes.
Python code for graph Depth First Search

Running on this routine on a simpler graph, where we start at node A,will look like:

DFS in steps, dotted circles show visited vertices
Result = [ A, B, C, D, E ]

Now with some understanding of DFS, we can modify the code to find the topological sort of the graph.

From DFS -> Topological sort.

When a route from a starting node is completely visited, record that node. That will be the Topological order. For example order in which courses should be taken to satisfy all dependencies.

The code for topological sorting will look like this:

  1. Iterate over the vertices/nodes of the graph.
  2. If the vertex has no incoming edge, run the dfs_visit subroutine for the node.
  3. Once the route from a node is complete, add the node in the first position in the ordered list (result).
Python code for Topological sorting using DFS
Topological Order of courses
Result = [ A, B, D, E, C ]

There is a shortcoming with the code, it does not check for presence of cycles in the graph. We can modify the code further to use visited table, to check for cycles as well.

The visited list/dict we maintain will have values from {0, -1, 1} for each vertex.
0 : a vertex is still to be visited.
-1: vertex is currently in the recursion stack.
1 : the vertex is visited and recorded (also establishing that no new route/cycle will emerge if we start visiting from this vertex).

Code below:

Python code for Topological sorting with checks for presence of cycles

This concludes Topological Sort using DFS. There is Kahn’s Algorithm for topological sorting as well, that involves listing out all the vertices with no-incoming edges first. Details here:

To map it to the Django use case, the models are vertices, and their dependencies are the directed edges. In case of Apache Spark, the tasks (such as map, udfs, etc.) are the vertices and the order in which they need to be done are the directed edges.

Whenever you see Django figuring when two or more models depend on each other and cannot be registered as is, or when using Apache Spark DAGs, you now know topological sorting is behind them.

--

--