SCC problem with kosaraju’s algorithm

April 6, 2012

Problem extracted from coursera.org:

Download the text file here. Zipped version here. (Right click and save link as)

The file contains the edges of a directed graph. Vertices are labeled as positive integers from 1 to 875714. Every row indicates an edge, the vertex label in first column is the tail and the vertex label in second column is the head (recall the graph is directed, and the edges are directed from the first column vertex to the second column vertex). So for example, the 11th row looks liks : “2 47646”. This just means that the vertex with label 2 has an outgoing edge to the vertex with label 47646

Your task is to code up the algorithm from the video lectures for computing strongly connected components (SCCs), and to run this algorithm on the given graph.

Output Format: You should output the sizes of the 5 largest SCCs in the given graph, in decreasing order of sizes, separated by commas (avoid any spaces). So if your algorithm computes the sizes of the five largest SCCs to be 500, 400, 300, 200 and 100, then your answer should be “500,400,300,200,100”. If your algorithm finds less than 5 SCCs, then write 0 for the remaining terms. Thus, if your algorithm computes only 3 SCCs whose sizes are 400, 300, and 100, then your answer should be “400,300,100,0,0”.

The idea of Kosaraju’s algorithm is like this:
1. Compute the finishing time of all the vertices by doing a DFS on the reversed graph;
2. Replace the vertex index with its finishing time to get a new graph, DFS the new graph to compute each vertex’s leader vertex( in a strongly connected component(SCC) if exists any);
3. Do statistical job on the leader vertices’ index. If multiple vertices have same leader vertex, they are in the same SCC.

Here is a quick Python implementation on the problem without any optimization work:

import sys
sys.setrecursionlimit(300000)
source='SCC.txt'
N=875714
def getG():
    G={}
    Grev={}
    for i in range(1,N+1):
        G[i]=[]
        Grev[i]=[]
    fin=open(source)
    for line in fin:
        v1=int(line.split()[0])
        v2=int(line.split()[1])
        G[v1].append(v2)
        Grev[v2].append(v1)
    fin.close()
    return G, Grev

#globals
visited={}
finish={}
leader={}

def init():
    for i in range(1,N+1):
        visited[i]=0
        finish[i]=0
        leader[i]=0

def dfs(G, i):
    global t
    visited[i]=1
    leader[i]=s
    for j in G[i]:
        if(visited[j]==0): dfs(G,j)
    t=t+1
    finish[i]=t

def dfs_loop(G):
    global t
    global s
    t=0 #number of nodes processed so far
    s=0 #current source vertex
    i=N
    while(i>0):
        if(visited[i]==0):
            s=i
            dfs(G,i)
        i=i-1

if __name__ == "__main__":
    init()
    g, grev=getG()
    dfs_loop(grev) #THE FIRST LOOP ON REVERSE GRAPH

    # construct new graph
    newGraph={}
    for i in range(1,N+1):
        temp=[]
        for x in g[i]: temp.append(finish[x])
        newGraph[finish[i]]=temp

    init()    
    dfs_loop(newGraph) #THE SECOND LOOP 

    # statistics
    lst= sorted(leader.values())
    stat=[]
    pre=0
    for i in range(0,N-1):
        if lst[i]!=lst[i+1]:
            stat.append(i+1-pre)
            pre=i+1
    stat.append(N-pre)
    L= sorted(stat)
    L.reverse()
    print(L[0:5])

Here is the C/C++ code directly translated from the Python. (Unlike the Python code above, I did the statistical job externally.）

#include <iostream>
#include <fstream>
#include <cstdio>
#include <vector>
#define N 875714
using namespace std;

struct node{
    bool visited;
    int leader;
    int finish;
    vector<int> linkedVertices;
    vector<int> rLinkedVertices;
};

struct node G[N+1];
struct node newG[N+1];

int t=0;
int s=0;
void dfs(node G[], int i, bool reverse){
    G[i].visited=true;
    G[i].leader=s;
    vector<int> next;
    if (reverse) next= G[i].rLinkedVertices;
    else next= G[i].linkedVertices;
    for(int j=0; j<next.size(); j++){
        if(!G[next[j]].visited) {
            dfs(G, next[j], reverse);
        }
    }
    t++;
    G[i].finish=t;    
}

void dfs_loop(node G[], bool reverse){
    t=0;
    s=0;
    for(int i=N;i>0;--i){
        if (!G[i].visited){
            s=i;
            dfs(G,i,reverse);
        }
    }
}

int main(){
    for(int i=1;i<=N;++i){
        G[i].visited=false;
    }

    FILE* fp=freopen("SCC.txt","r",stdin);
    int head, tail;
    while (scanf("%d %d", &head, &tail) > 0) {
              G[head].linkedVertices.push_back(tail);
              G[tail].rLinkedVertices.push_back(head);
    }
    fclose(fp);

    dfs_loop(G, true);//FIRST LOOP

    for(int i=1;i<=N;++i){
        newG[i].visited=false;
        newG[i].finish=0;
        newG[i].leader=0;
        vector<int> temp;
        for(int j=0; j< G[i].linkedVertices.size(); j++){
            temp.push_back(G[G[i].linkedVertices[j]].finish);
        }
        newG[G[i].finish].linkedVertices=temp;
    }

    dfs_loop(newG,false);//SECOND LOOP

    ofstream ofs("stat.txt", ofstream::out);    
    for (int k=1;k<=N;k++) ofs<< newG[k].leader<<endl;
    ofs.close();
    return 0;
}

From → algo

50 Comments

hajii permalink

i can see the result

Reply
hajii permalink

sorry , i can not see output and result .

Reply
- codehiker permalink
  
  It definitely works for me. Make sure you have big enough stack to run the code.
  
  Reply
hajii permalink

stat.txt , what is it for ?
it is completely empty.

Reply
- codehiker permalink
  
  it’s the output of all the leader nodes indexes. you do statistic on the file and get the final result, or you just sort the leader indexes in place and get the right result(like what I did in the python code).
  
  Reply
  - angel permalink
    
    Are you sure?
    
    import sys
    sys.setrecursionlimit(300000)
    source=’stat.txt’
    N=875714
    leader=[]
    
    def init():
    fin=open(source, “r”)
    for line in fin:
    leader.append(line)
    fin.close()
    
    if __name__ == “__main__”:
    init()
    lst= sorted(leader)
    stat=[]
    pre=0
    for i in range(0,N-1):
    if lst[i]!=lst[i+1]:
    stat.append(i+1-pre)
    pre=i+1
    stat.append(N-pre)
    L= sorted(stat)
    L.reverse()
    print(L[0:5])
  - codehiker permalink
    
    I didn’t test your code. Seems like you are doing statistic job on the output file you got with the C++ code? Keep in mind: you should change the code accordingly in order to make it work. I won’t explain how, but it should be super super easy.
  - reddy venikata permalink
    
    434821,968,459,313,211
    got the above but it is wrong.. i have submitted it got 0/5
hajii permalink

i am not so much familiar with c++ , i might be wrong , but i could not .
i just need the answer .

Reply
hajii permalink

could you tell me what was you answer ?

Reply
bhawesh permalink

fatal error: no input files

Reply
- codehiker permalink
  
  If you got this kind of error, it’s a “segmentation fault”, means your fault 😦
  Change my code accordingly when you use it.
  
  Reply
angel permalink

I have to congratulate you because your code in C++ es quite neat and optimize. It’s not a straightforward solution since the graph doesn’t need to use two list to representate edges and vertices. I get the segmentation fault in your python code using sys.setrecursionlimit() doesn’t change anything

Reply
- codehiker permalink
  
  ulimit command
  
  Reply
ashu permalink

i ran ur python code, wat i got as an output is just a blank line!!

Reply
- codehiker permalink
  
  Typical YOUR FAULT. Increase the stack size.
  
  Reply
  - rahul permalink
    
    and how to that i am new to python
  - rahul permalink
    
    and how to do that pls tell m new to python. tell me what change i have to make in ur program??
ashu permalink

i increased it using sys.setrecursionlimit(n) but to no avail

Reply
- codehiker permalink
  
  sys.setrecursionlimit(n), as the name implies, increases the maximum recurssion numbers, but it does NOT increase the stack size. Did you see the link I posted in the comment above? Take a look.
  
  Reply
Aman permalink

how to overcome stack overflow? plz help fast!
thnx

Reply
Clausia permalink

Both versions works perfectly! Thanks!

Reply
- Aman permalink
  
  plz tell us the answer 🙂
  
  Reply
kamal permalink

whats the answer

Reply
Akshay permalink

what if we don’t know the actual node count ( in this case N) beforehand?

Reply
- WoogaBee permalink
  
  You could run the part of
```
for line in fin:
```
  first. Of course leaving the lines above (using N for the range) out.
  Then you can check the length with len(G) and set it to a variable.
  
  Reply
  - codehiker permalink
    
    len(G) takes more time. If you want to apply the algo to arbitrary length of data, you should use len(G).
    
    Sent from my iPhone
formiga permalink

Great job! Thanks.

Reply
Jing Lu permalink

Your way to make newGraph is pretty smart.

Reply
Pranav Kumar permalink

how to increase stack size?

Reply
- WoogaBee permalink
  
  Read codehiker’s post in the comments on the “ulimit” command, it is really pretty easy then.
  
  Reply
Ishayahu Lastov permalink

How much time it take? It’s look like it take for a week)
I use my own code but it similar to yours)

PS why don’t you use for visited set()? You don’t need dict becouse you don’t need value – you only need to know if you visited this vortex or not.

Reply
- codehiker permalink
  
  c++ version a few seconds, python longer.
  
  Sent from my iPhone
  
  Reply
  - Ishayahu Lastov permalink
    
    but how long? I tried to use your version but it looks like that it isn’t faster as mine
  - codehiker permalink
    
    Well, I didn’t say it’s an optimal solution, so you are welcome to come with a better solution 🙂
    
    For python version I don’t quite remember, like 70 seconds or so.
    
    Also, the reason I use dictionary is that I know dict is implemented using hash-table, so querying would be faster. I don’t use set quite often.
Vladimir permalink

This was a home task in the coursera’s algorithms course. You’d better not published it or at least published it on the coursera’s forum.

Reply
Sarma permalink

It works.

Reply
Deli Shen permalink

Very nice, both c++ and python versions work with some configuration changes for my system.

Reply
nidhi permalink

is there any way to increase the stack size of python 3.3 in windows8…i get a error while using ulimit command

Reply
Aaron permalink

What do finish and leader do?
Can you explain your code in C++ with some comments or add an article at the end?
Thanks! Big Help

Reply
am permalink

the c++ program is creating stat.txt but not providing final output as asked

Reply
- am permalink
  
  so stat.txt would be just “lst” as in python code then we do statistical process on it
  
  Reply
Jeff permalink

For C++:
Change the Stuff Starting From //SECOND LOOP

dfs_loop(newG,false); //SECOND LOOP

vectorscc(N+1,0);
vectornum;

for (int k=1;k= 0 && k < 5 ; i–){
printf("%d",scc[i]);
if(i != 1 && k != 4) printf(",");
k++;
}
return 0;
}

Reply
Jeff permalink

Also, remember to add #include for the sorting routine. Program returns correct solution in 10 seconds 🙂

Reply
Javier permalink

Hi there, I know that the post is old, but let’s see if anyone can help.

I get a segmentation fault and then python quits unexpectedly.

I believe is due to the stack size. So is it all simply solved by adding

unlimited -s below the sys.setrecursionlimit(300000) line?

Reply
Javier permalink

done, it was

>limit stacksize unlimited

for a csh shell

Reply
anu permalink

answer is:
434821, 968, 459, 313, 211

Reply
- cindy permalink
  
  how did u get that?(in c++)
  
  Reply
TanzS permalink

IndentationError: unexpected indent , stat=[]

Reply
Anthonyvuh permalink

her latest blog https://1xslots-africa.site

Reply