Some content of this article comes from the slide of Zsuzsanna Liptak.
Introduction
Q-gram, aka n-gram, is an algorithm to compare two strings with a given alphabet.
It is used in string retrieval with and a fixed static number (in n-gram's case, this will be marked as )
Definition of q-gram
q-gram is a string of length , with a given alphabet with length .
For example, here we define :
thus , if we define , and we can choose q-grams from alphabet with a custom approach, like using lexicographic order.
Here we CAN select these strings over :AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT
Definition of occurrence count
With an array of q-grams , we can define occurrence count with a given string :
e.g. Let = 'ACAGGGCA' and , then
In another word, is the number of count that a g-gram figures in
Definition of q-gram table
With a axes of strings we want to compare, and another axes of q-grams, we can fill occurrence count into this matrix:
for = 'ACAGGGCA', ='GGGCAACA', ='AAGGACA', q-gram profiles = AA,AC,AG,AT,CA,CC,CG
:
AA | 0 | 1 | 1 |
AC | 1 | 1 | 1 |
AG | 1 | 0 | 1 |
AT | 0 | 0 | 0 |
CA | 2 | 2 | 1 |
CC | 0 | 0 | 0 |
CG | 0 | 0 | 0 |
Definition of q-gram distance
For two strings and , the q-gram distance is :
or equivalently:
which is the Manhattan distance for two vectors ,which are the mapping result of given strings of q-gram profile
Algorithm
- Use a sliding window of size over and
- Use an array of size , aka the q-gram profile
- Scan s, then scan t, assign the occurrence count into with different operators.
- Now
- Sum up the absolute value of
def get_q_gram_distance(s,t,qgrams,q):
n=len(s)
m=len(t)
d=[]
for i in range(len(qgrams)):
d.append(0)#init vector, O(q)
for i in range(1,n-q+1):#slide window, O(n)
for r in range(0,len(grams)):#get count, O(q)
if s[i:i+q-1] == qgram[r]:
d[r]+=1#first vector sends postive effect
for i in range(1,m-q+1):#slide window, O(m)
for r in range(0,len(grams)):#get count, O(q)
if s[i:i+q-1] == qgram[r]:
d[r]-=1#second vector sends negtive effect
res=0
for i in vector:
res+=abs(i)
return res
Tips:
- q-gram = 0 does not mean that two strings are same.
- , is the unit cost edit distance.