Saturday, May 31, 2014

Longest Common Substring Algorithms

Given strings "ABAB" and "BABA"...

A dynamic programming style algorithm constructs a matrix of the substring lengths.

    |   |   | A | B | A | B |
    |   | O | O | O | O | O |
    | B | O | O | 1 | O | 1 |
    | A | O | 1 | 0 | 2 | 0 |
    | B | O | 0 | 2 | 0 | 3 |
    | A | O | 1 | 0 | 3 | 0 |

In this matrix, the 5th column tests the substring ABA of the ABAB sequence. So cell [5, 5] compares ABA to BAB ... the lcs is BA, length 2.

The other approach is to construct a generalized suffix tree for the strings, then find the deepest internal nodes which have leaf nodes from all the strings in the subtree below it.

A Simple Suffix Tree Implementation:

"""http://goo-apple.appspot.com/article/2e8d3c6a-2c38-48b9-96c6-240b4ded253a"""
class Node:
def __init__(self, start, substr):
self.start = start
self.substr = substr
self.branches = {}
def insert_into_tree(subroot, suffix, start):
prefix_len = len(subroot.substr)
new_suffix = str(suffix[prefix_len:])
if(len(subroot.branches) == 0):
left_child = Node(subroot.start, "")
right_child = Node(start, new_suffix)
subroot.branches[""] = left_child
subroot.branches[new_suffix] = right_child
else:
for (substr, node) in subroot.branches.items():
if len(substr) > 0 and new_suffix.startswith(substr):
insert_into_tree(node, new_suffix, start)
break
else:
new_child = Node(start, new_suffix)
subroot.branches[new_suffix] = new_child
def build_suffix_tree(t_str):
len_str = len(t_str)
i = len_str - 1
root = Node(len_str, "")
while i >= 0:
insert_into_tree(root, str(t_str[i:]), i)
i -= 1
return root
def display_all_suffix(subroot, suffix_s_prefix, level = 0):
if len(subroot.branches) == 0:
print suffix_s_prefix, level
return
for (substr, node) in subroot.branches.items():
display_all_suffix(node, suffix_s_prefix + substr, level + 1)
root = build_suffix_tree("BCABC")
display_all_suffix(root, "")
view raw suffix_tree.py hosted with ❤ by GitHub

Dijkstras Algorithm for shortest path:

"""
Dijkstra's algorithm for shortest paths.
http://code.activestate.com/recipes/577343-dijkstras-algorithm-for-shortest-paths/)
- dijkstra(G, s) finds all shortest paths from s to each other vertex.
- shortest_path(G, s, t) uses dijkstra to find the shortest path from s to t.
The input graph G is assumed to have the following representation:
- A vertex can be any object that can be used as an index into a dictionary.
- G is a dictionary, indexed by vertices.
- For any vertex v, G[v] is itself a dictionary, indexed by neighbors of v.
- For any edge v->w, G[v][w] is the length of the edge.
"""
from priodict import priorityDictionary
def dijkstra(graph, start, end=None):
final_distances = {}
predecessors = {}
estimated_distances = priorityDictionary()
estimated_distances[start] = 0
for vertex in estimated_distances:
final_distances[vertex] = estimated_distances[vertex]
if vertex == end:
break
for edge in graph[vertex]:
path_distance = final_distances[vertex] + graph[vertex][edge]
if edge in final_distances:
if path_distance < final_distances[edge]:
raise ValueError("Better path to already-final vertex")
elif edge not in estimated_distances or path_distance < estimated_distances[edge]:
estimated_distances[edge] = path_distance
predecessors[edge] = vertex
return final_distances, predecessors
def shortest_path(graph, start, end):
"""
Find a single shortest path from the given start vertex to the given end,
output as list of vertices in order along the shortest path.
"""
final_distances, predecessors = dijkstra(graph, start, end)
path = []
while 1:
path.append(end)
if end == start:
break
end = predecessors[end]
path.reverse()
return path
if __name__ == "____main__":
pass
# G = {'s':{'u':10, 'x':5}, 'u':{'v':1, 'x':2}, 'v':{'y':4},
# 'x':{'u':3, 'v':9, 'y':2}, 'y':{'s':7, 'v':6}}
# The shortest path from s to v is ['s', 'x', 'u', 'v'] and has length 9.
view raw dijkstra.py hosted with ❤ by GitHub

Or, here's a more complete implementation.

No comments:

Post a Comment