Learning Q-gram distance

Posted by Riino on

Some content of this article comes from the slide of Zsuzsanna Liptak.

Introduction

Q-gram, aka n-gram, is an algorithm to compare two strings with a given alphabet.

It is used in string retrieval with O(n)O(n) and a fixed static number qq (in n-gram's case, this will be marked as nn)

Definition of q-gram

q-gram is a string of length qq, with a given alphabet Σ\Sigma with length σ\sigma.

For example, here we define :

Σ={A,C,G,T}\Sigma=\{'A','C','G','T'\}

thus σ=4\sigma = 4, if we define q=2q=2, and we can choose q-grams from alphabet with a custom approach, like using lexicographic order.

Here we CAN select these strings over Σ\Sigma :AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT

Definition of occurrence count

With an array of q-grams , we can define occurrence count with a given string ss:

N(s,qgram)={i:sisi+q1}N(s,q-gram)=|\{i:s_i\dots s_{i+q-1}\}|

e.g. Let ss = 'ACAGGGCA' and q=2q=2, then N(s,AC)=N(s,AG)=N(s,GC)=1N(s,AC)=N(s,AG)=N(s,GC)=1

In another word, NN is the number of count that a g-gram figures in ss

Definition of q-gram table

With a axes of strings we want to compare, and another axes of q-grams, we can fill occurrence count into this matrix:

for ss= 'ACAGGGCA', tt='GGGCAACA', vv='AAGGACA', q-gram profiles = AA,AC,AG,AT,CA,CC,CG :

uuPq(s)P_q(s)Pq(t)P_q(t)Pq(v)P_q(v)
AA011
AC111
AG101
AT000
CA221
CC000
CG000

Definition of q-gram distance

For two strings ss and tt, the q-gram distance is :

distqgram(s,t)=uqN(s,u)N(t,u)dist_{q-gram}(s,t)=\sum_{u\in\sum q}|N(s,u)-N(t,u)|

or equivalently:

distqgram(s,t)=i=1σqPq(s)[i]Pq(t)[i]dist_{q-gram}(s,t)=\sum_{i=1}^{\sigma^q}|P_q(s)[i]-P_q(t)[i]|

which is the Manhattan distance for two vectors ,which are the mapping result of given strings of q-gram profile

Algorithm

  • Use a sliding window of size qq over ss and tt
  • Use an array dd of size σq\sigma^q, aka the q-gram profile
  • Scan s, then scan t, assign the occurrence count into dd with different operators.
  • Now d[r]=N(s,ur)N(t,ur)d[r]=N(s,u_r)-N(t,u_r)
  • Sum up the absolute value of dd
def get_q_gram_distance(s,t,qgrams,q):
	n=len(s)
	m=len(t)
    d=[]
    for i in range(len(qgrams)):
        d.append(0)#init vector, O(q)

    for i in range(1,n-q+1):#slide window, O(n)
    	for r in range(0,len(grams)):#get count, O(q)
        	if s[i:i+q-1] == qgram[r]:
        		d[r]+=1#first vector sends postive effect


    for i in range(1,m-q+1):#slide window, O(m)
    	for r in range(0,len(grams)):#get count, O(q)
        	if s[i:i+q-1] == qgram[r]:
        		d[r]-=1#second vector sends negtive effect
    res=0
    for i in vector:
        res+=abs(i)

    return res

Tips:

  1. q-gram = 0 does not mean that two strings are same.
  2. distqgram(s,t)2q<=dedit(s,t)\frac{dist_{q-gram}(s,t)}{2q}<=d_{edit}(s,t) , dedit(s,t)d_{edit}(s,t) is the unit cost edit distance.