Skip to content

SCC problem with kosaraju’s algorithm

April 6, 2012

Problem extracted from coursera.org:

Download the text file here. Zipped version here. (Right click and save link as)

The file contains the edges of a directed graph. Vertices are labeled as positive integers from 1 to 875714. Every row indicates an edge, the vertex label in first column is the tail and the vertex label in second column is the head (recall the graph is directed, and the edges are directed from the first column vertex to the second column vertex). So for example, the 11th row looks liks : “2 47646”. This just means that the vertex with label 2 has an outgoing edge to the vertex with label 47646

Your task is to code up the algorithm from the video lectures for computing strongly connected components (SCCs), and to run this algorithm on the given graph.

Output Format: You should output the sizes of the 5 largest SCCs in the given graph, in decreasing order of sizes, separated by commas (avoid any spaces). So if your algorithm computes the sizes of the five largest SCCs to be 500, 400, 300, 200 and 100, then your answer should be “500,400,300,200,100”. If your algorithm finds less than 5 SCCs, then write 0 for the remaining terms. Thus, if your algorithm computes only 3 SCCs whose sizes are 400, 300, and 100, then your answer should be “400,300,100,0,0”.

The idea of Kosaraju’s algorithm is like this:
1. Compute the finishing time of all the vertices by doing a DFS on the reversed graph;
2. Replace the vertex index with its finishing time to get a new graph, DFS the new graph to compute each vertex’s leader vertex( in a strongly connected component(SCC) if exists any);
3. Do statistical job on the leader vertices’ index. If multiple vertices have same leader vertex, they are in the same SCC.

Here is a quick Python implementation on the problem without any optimization work:

import sys
sys.setrecursionlimit(300000)
source='SCC.txt'
N=875714
def getG():
    G={}
    Grev={}
    for i in range(1,N+1):
        G[i]=[]
        Grev[i]=[]
    fin=open(source)
    for line in fin:
        v1=int(line.split()[0])
        v2=int(line.split()[1])
        G[v1].append(v2)
        Grev[v2].append(v1)
    fin.close()
    return G, Grev

#globals
visited={}
finish={}
leader={}

def init():
    for i in range(1,N+1):
        visited[i]=0
        finish[i]=0
        leader[i]=0

def dfs(G, i):
    global t
    visited[i]=1
    leader[i]=s
    for j in G[i]:
        if(visited[j]==0): dfs(G,j)
    t=t+1
    finish[i]=t

def dfs_loop(G):
    global t
    global s
    t=0 #number of nodes processed so far
    s=0 #current source vertex
    i=N
    while(i>0):
        if(visited[i]==0):
            s=i
            dfs(G,i)
        i=i-1

if __name__ == "__main__":
    init()
    g, grev=getG()
    dfs_loop(grev) #THE FIRST LOOP ON REVERSE GRAPH

    # construct new graph
    newGraph={}
    for i in range(1,N+1):
        temp=[]
        for x in g[i]: temp.append(finish[x])
        newGraph[finish[i]]=temp

    init()    
    dfs_loop(newGraph) #THE SECOND LOOP 

    # statistics
    lst= sorted(leader.values())
    stat=[]
    pre=0
    for i in range(0,N-1):
        if lst[i]!=lst[i+1]:
            stat.append(i+1-pre)
            pre=i+1
    stat.append(N-pre)
    L= sorted(stat)
    L.reverse()
    print(L[0:5])

Here is the C/C++ code directly translated from the Python. (Unlike the Python code above, I did the statistical job externally.)

#include <iostream>
#include <fstream>
#include <cstdio>
#include <vector>
#define N 875714
using namespace std;

struct node{
    bool visited;
    int leader;
    int finish;
    vector<int> linkedVertices;
    vector<int> rLinkedVertices;
};

struct node G[N+1];
struct node newG[N+1];

int t=0;
int s=0;
void dfs(node G[], int i, bool reverse){
    G[i].visited=true;
    G[i].leader=s;
    vector<int> next;
    if (reverse) next= G[i].rLinkedVertices;
    else next= G[i].linkedVertices;
    for(int j=0; j<next.size(); j++){
        if(!G[next[j]].visited) {
            dfs(G, next[j], reverse);
        }
    }
    t++;
    G[i].finish=t;    
}

void dfs_loop(node G[], bool reverse){
    t=0;
    s=0;
    for(int i=N;i>0;--i){
        if (!G[i].visited){
            s=i;
            dfs(G,i,reverse);
        }
    }
}

int main(){
    for(int i=1;i<=N;++i){
        G[i].visited=false;
    }

    FILE* fp=freopen("SCC.txt","r",stdin);
    int head, tail;
    while (scanf("%d %d", &head, &tail) > 0) {
              G[head].linkedVertices.push_back(tail);
              G[tail].rLinkedVertices.push_back(head);
    }
    fclose(fp);

    dfs_loop(G, true);//FIRST LOOP

    for(int i=1;i<=N;++i){
        newG[i].visited=false;
        newG[i].finish=0;
        newG[i].leader=0;
        vector<int> temp;
        for(int j=0; j< G[i].linkedVertices.size(); j++){
            temp.push_back(G[G[i].linkedVertices[j]].finish);
        }
        newG[G[i].finish].linkedVertices=temp;
    }

    dfs_loop(newG,false);//SECOND LOOP

    ofstream ofs("stat.txt", ofstream::out);    
    for (int k=1;k<=N;k++) ofs<< newG[k].leader<<endl;
    ofs.close();
    return 0;
}

From → algo

50 Comments
  1. hajii permalink

    i can see the result

  2. hajii permalink

    sorry , i can not see output and result .

  3. hajii permalink

    stat.txt , what is it for ?
    it is completely empty.

    • it’s the output of all the leader nodes indexes. you do statistic on the file and get the final result, or you just sort the leader indexes in place and get the right result(like what I did in the python code).

      • angel permalink

        Are you sure?

        import sys
        sys.setrecursionlimit(300000)
        source=’stat.txt’
        N=875714
        leader=[]

        def init():
        fin=open(source, “r”)
        for line in fin:
        leader.append(line)
        fin.close()

        if __name__ == “__main__”:
        init()
        lst= sorted(leader)
        stat=[]
        pre=0
        for i in range(0,N-1):
        if lst[i]!=lst[i+1]:
        stat.append(i+1-pre)
        pre=i+1
        stat.append(N-pre)
        L= sorted(stat)
        L.reverse()
        print(L[0:5])

      • I didn’t test your code. Seems like you are doing statistic job on the output file you got with the C++ code? Keep in mind: you should change the code accordingly in order to make it work. I won’t explain how, but it should be super super easy.

      • reddy venikata permalink

        434821,968,459,313,211
        got the above but it is wrong.. i have submitted it got 0/5

  4. hajii permalink

    i am not so much familiar with c++ , i might be wrong , but i could not .
    i just need the answer .

  5. hajii permalink

    could you tell me what was you answer ?

  6. bhawesh permalink

    fatal error: no input files

    • If you got this kind of error, it’s a “segmentation fault”, means your fault 😦
      Change my code accordingly when you use it.

  7. angel permalink

    I have to congratulate you because your code in C++ es quite neat and optimize. It’s not a straightforward solution since the graph doesn’t need to use two list to representate edges and vertices. I get the segmentation fault in your python code using sys.setrecursionlimit() doesn’t change anything

  8. ashu permalink

    i ran ur python code, wat i got as an output is just a blank line!!

    • Typical YOUR FAULT. Increase the stack size.

      • rahul permalink

        and how to that i am new to python

      • rahul permalink

        and how to do that pls tell m new to python. tell me what change i have to make in ur program??

  9. ashu permalink

    i increased it using sys.setrecursionlimit(n) but to no avail

    • sys.setrecursionlimit(n), as the name implies, increases the maximum recurssion numbers, but it does NOT increase the stack size. Did you see the link I posted in the comment above? Take a look.

  10. Aman permalink

    how to overcome stack overflow? plz help fast!
    thnx

  11. Clausia permalink

    Both versions works perfectly! Thanks!

  12. kamal permalink

    whats the answer

  13. Akshay permalink

    what if we don’t know the actual node count ( in this case N) beforehand?

    • WoogaBee permalink

      You could run the part of

      for line in fin:

      first. Of course leaving the lines above (using N for the range) out.
      Then you can check the length with len(G) and set it to a variable.

      • len(G) takes more time. If you want to apply the algo to arbitrary length of data, you should use len(G).

        Sent from my iPhone

  14. formiga permalink

    Great job! Thanks.

  15. Your way to make newGraph is pretty smart.

  16. Pranav Kumar permalink

    how to increase stack size?

    • WoogaBee permalink

      Read codehiker’s post in the comments on the “ulimit” command, it is really pretty easy then.

  17. How much time it take? It’s look like it take for a week)
    I use my own code but it similar to yours)

    PS why don’t you use for visited set()? You don’t need dict becouse you don’t need value – you only need to know if you visited this vortex or not.

    • c++ version a few seconds, python longer.

      Sent from my iPhone

      • but how long? I tried to use your version but it looks like that it isn’t faster as mine

      • Well, I didn’t say it’s an optimal solution, so you are welcome to come with a better solution 🙂

        For python version I don’t quite remember, like 70 seconds or so.

        Also, the reason I use dictionary is that I know dict is implemented using hash-table, so querying would be faster. I don’t use set quite often.

  18. Vladimir permalink

    This was a home task in the coursera’s algorithms course. You’d better not published it or at least published it on the coursera’s forum.

  19. Sarma permalink

    It works.

  20. Deli Shen permalink

    Very nice, both c++ and python versions work with some configuration changes for my system.

  21. nidhi permalink

    is there any way to increase the stack size of python 3.3 in windows8…i get a error while using ulimit command

  22. Aaron permalink

    What do finish and leader do?
    Can you explain your code in C++ with some comments or add an article at the end?
    Thanks! Big Help

  23. the c++ program is creating stat.txt but not providing final output as asked

    • so stat.txt would be just “lst” as in python code then we do statistical process on it

  24. Jeff permalink

    For C++:
    Change the Stuff Starting From //SECOND LOOP

    dfs_loop(newG,false); //SECOND LOOP

    vectorscc(N+1,0);
    vectornum;

    for (int k=1;k= 0 && k < 5 ; i–){
    printf("%d",scc[i]);
    if(i != 1 && k != 4) printf(",");
    k++;
    }
    return 0;
    }

  25. Jeff permalink

    Also, remember to add #include for the sorting routine. Program returns correct solution in 10 seconds 🙂

  26. Javier permalink

    Hi there, I know that the post is old, but let’s see if anyone can help.

    I get a segmentation fault and then python quits unexpectedly.

    I believe is due to the stack size. So is it all simply solved by adding

    unlimited -s below the sys.setrecursionlimit(300000) line?

  27. Javier permalink

    done, it was

    >limit stacksize unlimited

    for a csh shell

  28. anu permalink

    answer is:
    434821, 968, 459, 313, 211

  29. TanzS permalink

    IndentationError: unexpected indent , stat=[]

Leave a comment