Some content of this article comes from the slide of Zsuzsanna Liptak.

Q-gram, aka n-gram, is an algorithm to **compare two strings with a given alphabet.**

It is used in string retrieval with $O(n)$ and a fixed static number $q$ (in n-gram’s case, this will be marked as $n$)

q-gram is a string of length $q$, with a given alphabet $\Sigma$ with length $\sigma$.

For example, here we define :
\(\Sigma=\{'A','C','G','T'\}\)
thus $\sigma = 4$, if we define $q=2$, and we can **choose q-grams from alphabet** with a custom approach, like using lexicographic order.

Here we CAN select these strings over $\Sigma$ :`AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT`

With an array of **q-gram**s , we can define occurrence count with a given string $s$:
\(N(s,q-gram)=|\{i:s_i\dots s_{i+q-1}\}|\)
e.g. Let $s$ = ‘ACAGGGCA’ and $q=2$, then $N(s,AC)=N(s,AG)=N(s,GC)=1$

In another word, $N$ is the number of count that a g-gram figures in $s$

With a axes of **strings we want to compare**, and another axes of **q-grams**, we can fill **occurrence count** into this matrix:

for $s$= ‘ACAGGGCA’, $t$=’GGGCAACA’, $v$=’AAGGACA’, q-gram profiles = `AA,AC,AG,AT,CA,CC,CG`

:

$u$ | $P_q(s)$ | $P_q(t)$ | $P_q(v)$ |
---|---|---|---|

AA | 0 | 1 | 1 |

AC | 1 | 1 | 1 |

AG | 1 | 0 | 1 |

AT | 0 | 0 | 0 |

CA | 2 | 2 | 1 |

CC | 0 | 0 | 0 |

CG | 0 | 0 | 0 |

For two strings $s$ and $t$, the q-gram distance is :
\(dist_{q-gram}(s,t)=\sum_{u\in\sum q}|N(s,u)-N(t,u)|\)
or equivalently:
\(dist_{q-gram}(s,t)=\sum_{i=1}^{\sigma^q}|P_q(s)[i]-P_q(t)[i]|\)
which is the **Manhattan distance for two vectors ,which are the mapping result of given strings of q-gram profile**

- Use a sliding window of size $q$ over $s$ and $t$
- Use an array $d$ of size $\sigma^q$, aka the q-gram profile
- Scan s, then scan t, assign the occurrence count into $d$ with different operators.
- Now $d[r]=N(s,u_r)-N(t,u_r)$
- Sum up the absolute value of $d$

```
def get_q_gram_distance(s,t,qgrams,q):
n=len(s)
m=len(t)
d=[]
for i in range(len(qgrams)):
d.append(0)#init vector, O(q)
for i in range(1,n-q+1):#slide window, O(n)
for r in range(0,len(grams)):#get count, O(q)
if s[i:i+q-1] == qgram[r]:
d[r]+=1#first vector sends postive effect
for i in range(1,m-q+1):#slide window, O(m)
for r in range(0,len(grams)):#get count, O(q)
if s[i:i+q-1] == qgram[r]:
d[r]-=1#second vector sends negtive effect
res=0
for i in vector:
res+=abs(i)
return res
```

- q-gram = 0 does not mean that two strings are same.
- $\frac{dist_{q-gram}(s,t)}{2q}<=d_{edit}(s,t)$ , $d_{edit}(s,t)$ is the unit cost edit distance.