Multiple Pattern Matching Algorithm Based

on Sequential Binary Tree


LIU Gong -shen 1,LI Ning



S chool of Inf ormation S ecurity Eng ineer ing ,S hang hai Jiaotong Univ ersity ,Shang hai 200030,China )


(Dep ar tment of C omp uter S cience ,Univ ersity of M anche ster M ancheste r ,Eng land )

Abstract :By ana ly zing the tr aditio na l multiple patt ern matching alg or ithm based o n tr ee str ucture,a new alg or ithm is

pr o po sed by substituting sequential binar y t ree for tr ee.T he a lg or ithm is suitable fo r the applicatio n w hich requir es pr epro cessing the pa tterns dynamically.It is pro ved by ex periment t ha t t he algo rithm has thr ee feat ur es:Its construc-tio n pr ocess is quick.Its co st o f memo ry is small.At the same t ime,it s searching pr ocess is as quick as t he tr aditio nal algo rit hm.

Key words :multiple patter n ma tching ;finite state auto mata;sequent ial binary tree


刘功申1,李 宁2





摘 要:传统的多模式匹配算法是用树型结构的有限自动机实现的,它具有很多缺点.本文提出的多模式匹配算法是基于有序二叉树的多模式匹配算法.实验证明,本文算法不但具有和传统算法相当的查找速度,而且构造速度快、内存耗费少.因此,本文提出的算法特别适用于要求动态构造自动机的情况.

关键词:多模式匹配;D FSA ;有序二叉树

中图分类号:T P 301 文献标识码:A 文章编号:1000-1220(2004)07-1387-06

1 Introduction

Sear ching user-specified patter ns in a t ext file is a com-mon r equir ement in infor mation ret riev al and tex t editing ap-plications.N ow ,deter ministic finite state auto mata (DF SA )is the mo st co mmon method in so lv ing patter n match pr ob-lem [1,2].Befor e sear ching pro cess,the DF SA algo r ithm must prepro cess the patt ern set and co nst ruct an auto maton based on tr ee str uctur e.T hen,the o ccur rences of ever y patter n can be found by scanning the tex t file just once .So ,I ts tim e complex ity is O (n ).Jang -Jong F an and K eh-Yih Su pr oposed

a new algo r ithm by combining D FSA alg or ithm with Boy er -M o or e alg or ithm [3].T he alg or ithm o f Ja ng -Jong Fan and

K eh -Y ih Su do esn ?t need to inspect ever y chara ct er o f the tex t file,w hich is mo re efficient than D FSA algo rithm.

In pr act ical applications ,there are some new require-ments to be em phasized gr adually,such as,the requir ement of changing the pa tter n set that is to be sear ched dy na mically (i.e.the implement of t he function o f searching and replac-ing in t ext edit ing applica tio n )and the r equir ement of sav ing memo ry (i.e.implementing t he alg or ithm in PDA or embed-ded softw ar e ).A t the same time ,the searching efficiency

can ?t decrease .T he tr aditio na l alg or it hm based on t ree do esn ?t sat isfy these r equir ements,because its co nstr uct ion pr ocess is slo w and its cost o f memor y is excessive .

T he concept ion of sequential binar y tree (see definit ion 4in details )is pr o po sed or ig inally in this paper .A new mult i-ple pat tern matching alg or ithm---a multiple patter n matching

alg or ithm based on sequential binar y tr ee (in sho rt SM A al-g or it hm )is implemented by substituting sequential binar y tree for tr ee .T her e ar e many adv antages o f SM A a lg or ithm which ar e pr ov ed by our exper iment,such as quick construc-t ion ,the conv enience of adding o r deleting patter ns dy na mi-cally ,no requir ement s o f goto table,f ailur e table and outp ut table w hich ar e r equir ed in tr aditio na l algo rithm a nd high sear ching efficiency as tradit ional algo rit hm.

In the next sectio n ,we w ill descr ibe the disadv antages of tr aditional multiple pat tern matching algo r ithm.T he ad-v antag es o f SM A alg or it hm a re intr oduced in sectio n 3.In section 4and 5,the constr uction and searching alg or ithms of sequent ial bina ry tree a re g iv en .Sectio n 6and 7ar e algo -r it hm analysis and co nt rast ex per iments.Finally,section 8co ncludes this paper w ith so me remar ks .

2 Shortcomings of traditional algorithm

T he tr aditio nal alg or it hm is descr ibed in pa per [1]in de-tail.In or der to describe ea sily ,a sample is g iven.T he pat-tern set is {he ,hers ,his ,ho ur ,she ,o ur }.T he co rr espo nd-ing t ree is show n as fig ur e


F ig.1 A uto mata based o n tree

If a no de of a tree has n sub-tr ees,the node must have n

pointer s w hich po int to its childr en r espectively .During the construction pro cess of tr ee,it is no t decided in advance ho w many sub -t rees a node has .T her e ar e only t wo met ho ds to

handle it.T he fir st method is to re-a llocate the memor y o f

parent node w hen a sub-tr ee is added each time.T he second method is that the memor y o f m (m is the maximum number of sub-t rees of ev ery node )pointer is allo cat ed to ev er y

node ,w hich co sts the memor y v ery much


F ig.2 A uto mata based o n sequential binar y t ree T he goto ta ble,f ailur e table and outp ut table o f t radi-tio na l algo rit hm also need larg e mem or y .T heir size is pr o-po rt ional to the number of nodes of the tr ee in direct.Dur ing the course o f co nstr uctio n o f t raditional alg o-rithm,the patter ns are not so rt ed in the o rder o f dictionar y.T hat is t o say,if the pat terns of g iven patter n set a re not sor ted in dictionar y or der ,the tree is no t an or der ed t ree .T he sear ching r ate must be slo w if the tree is not or dered [5]

.In the giv en sa mple ,t he patt erns "she "and "o ur "ar e not in the dict ionar y o rder.

3 Advantages of SMA algorithm

In or der to o ver co me the shor tcom ings o f tr aditio nal al-g or it hm ,SM A alg or ithm implements multiple pat tern matching pro cess by using sequential binar y t ree instead of tree .If ther e is a patter n set {he ,hers ,his ,ho ur ,she ,our },the sequential binary t ree of the patter n set is show n as Fig ure 2.

T he advantag es o f the SM A algo rithm:?Impro ve the constr uction r ate

Because the binar y tree has only tw o sub -tr ees (left and

right sub-t rees ),we do n ?t need to g uess ho w many sub-trees a no de has .So ,we can a void o f co st ing memo ry exces-sively o r allo cating and r eleasing the memor y fr equently.

?Impro ve the sear ching efficiency

Because t he sequentia l binar y tr ee em bo dies t he dictio -nar y o rder o f all patt erns,the tr ansitio n r ate of state is mo re quick (see Sectio n 6.2).

?Add and delete patter n conveniently Because it is co nvenient to a dd o r delet e no de fr om bina-r y t ree,the SM A alg or it hm can add o r delete patter n co nv e-niently .

?Do n ?t need goto table,f ailur e table and outp ut table.T he SM A algo rithm uses point er inst ead of goto ,f ailur e and outp ut table ,so it can sav e the memor y .

4 Construction of sequential binary tree

4.1 Correlation def inition

T he sequential binary t ree sho w n in Fig ure 2is con-str ucted fr om patt ern set {he,her s,his,ho ur,she,our }.If we v isit the sequential binar y t ree accor ding to a special rule ,a patter n can be gained dur ing the visiting pr ocess fr om ro ot no de to a leaf no de.T he v isiting rule is descr ibed as fol-low s :

4.1.1 D efinition 1Visiting Rule:A stack s is used to save pat terns .A po inter p is used to keep t rack the no de .W hen p points to the r ight sub-tr ee of cur rent node and e is a edge bet ween curr ent node and its r ig ht sub-tr ee,t he item at the

top of st ack s is r emo ved then the chara ct er of edg e e is a dded to st ack s .When p po ints t o the left sub-tr ee of cur rent

no de ,the character of edg e e is added to stack s directly .Re-peat abov e pr ocess until p po ints t o the leaf node.N ow ,a pat tern is consisted of all items of t he st ack s .Fo r example ,the visiting pr ocess of patt ern "our "is described as fo llow s:(the sequential bina ry tr ee of patt ern set is sho wn as F ig ure 2)

Step 1:s =5;p =0;(r oot no de )


Step 2:s ={h };p =1;(left child )

Step 3:character "h"is remo ved,"o "is added,s ={o };p =10;(r ig ht child )

Step 4:char acter "u"is added,s ={o ,u};p =11;(left child)

Step 5:char acter "r "is a dded,s ={o ,u,r };p =12;(left child)

T he pr ocess is ended.T he patt ern "our "is consisted o f all char act ers of stack s .

4.1.2 Definitio n 2Stat e depth

T he state depth o f a node is differ ent fr om the node depth o f bina ry tree .T he state depth reflects the position o f a chara ct er in the pa tter n.T he st ate dept h of a node can be defined in a r ecur siv e way as follo w s


F ig.3 State dept h o f nodes and their pat ernit y T he state depth of r oo t is 0;

If ther e is a node w ho se state depth is h ,the stat e dept h of its left child no de is h +1,the state depth of it s r ig ht child node is h


F ig.4 A sequentia l binar y tr ee w ith failure po inter T he stat e depth o f all nodes can be figur ed out analog i-ca lly.

A cco rding to the definition ,we can figur e o ut the stat e depth of ever y node o f the sample given in t his paper.T he state depth o f ev er y node is sho wn in F igur e 3.T he A r abic number in cir cle is the state depth of curr ent node.

4.1.3 Definitio n 3Fat her and child st ate node

Let the no de l be the left child no de o f no de f ,the no de set R={r ?r is the no de in right child tree o f l and its state depth is equiv alent to that o f l }.T he no de f is t he father st ate no de of ever y node o f R and node l .Ev ery no de of R and node l is the child stat e node of node f .T he pa ternit y of st ate no de is sho wn in F ig ur e 3.T he dot lines w ith arr ow -head descr ibe the r elationship of ev ery state no de.T he no de of arr ow head is the father stat e node o f the no de of bott om and t he node o f bo tto m is the child stat e o f the node o f ar -r ow head.

4.1.4 Definitio n 4Sequential binary tr ee

If w e trav erse the binar y tr ee on N L R mode,a ser ies of pat terns can be at tained by applying the v isit ing rule o f this paper.If these pat terns ar e sor ted in t heir dict ionar y or der ,the bina ry tree is a sequential binary tr ee.T he binary tree in Fig ure 2is a sequential binary tr ee .

4.2 The construction of sequential binary tree

If a patt ern set in w hich pat terns ar e no t in dict ionar y or der is given,the fir st ta sk is to construct a sequential bina-r y t ree.D ur ing the co urse of co nstr uctio n,there ar e some tasks t o do :co nstr uct ing the sequent ial binar y tr ee ,mar king the output node,defining the father state node of ever y no de ,and so rting patter ns in or der to increase the sear ching efficiency.

T he da ta structure o f tr ee no de :

S tructure Nod e { Node *Lchild ; C har Lchar; Node *Rchild; C har Rchar;

Node*fathers tate; Node*fails tate; Boolean output ;};

Algorithm 1.The const ruction of sequential binary tree Input :pat tern set

Output :T he sequent ial binar y tr ee ,output node and father st ate node po inter


for each pattern do { p =root;i=0;

w hile((p =goto (p ,p attern [i ]))!=NU LL )i++; in ser t pattern [i:strlen(pattern)]in pointer p ; }E nd .

Definition 5Function goto (state ,character )

Accor ding t o the visiting r ule ,w e visit sequent ial binar y tree fr om stat e no de p to o ne of child sta te nodes o f state no de p .If w e arr iv e at child state node c a nd a chara ct er char is in st ack ,then g oto (p ,char )=c ,else go to (p ,char )=NU LL.T he algo rithm describes as fo llow s:


Input:pointer p,character char

Output:child state node or N U L L



return NUL L;

}else if(c har==p.Lch ar){

return p.Lch ild

}els e{


wh ile((pattern[i]>p.Rchar)&&(p!=NULL) )p=p.Rchild;


r eturn p.Rchild;

els e

r eturn NULL;



4.3 Marking the failure pointer

If the cur rent state node is p and goto(p,char)= N U L L,t he next state node is p.f ailstate.T he pr inciple o f failure point er is show n as follow s:

?T he failur e po inter of r oo t is ro ot.

?T he failur e point er of state node with stat e dept h1is ro ot too.

?F or all state node s who se stat e depth is gr eater than or equal to2:If the father stat e no de of s is r and g oto(r,a) =s,we can do follow pr og ram to mark the failure state node of s.

w hile(g oto((s.f ather state).f ailstate,a)==NU LL)s =s.f atherstate;

Finally,the failure point er o f s is s.f atherstate.

A s Figure4sho ws,t he failur e point ers are show n using do t lines.Ever y state no de has a failure po inter point ing to its failur e sta te node.Fo r ex ample,the failur e po inter o f state no de2points t o stat e no de0.T he algo r ithm3is used to mark the failur e po inter.

Algorithm3.Mark the failure pointer

Input:sequential binary tr ee who se ro ot node is s

Output:sequent ial bina ry tree w it h fa ilur e po inter

Build_F ail_Func(Str uctur e No de s)

Beg in

M ar k the failure point er o f st ate node s;

Build_F ail_Func(s.L child);

Build_F ail_Func(s.R child);


4.4 Adding and deleting node

In pr actice,the patter n set ma y be changed slig htly,fo r ex ample,adding o r deleting so me pat terns fro m t he set.I t is not w or thy of r ebuilding the w hole sequential binar y tr ee be-ca use o f t he slight change of pa tter n set,so the a dditions and deletions fro m sequentia l binar y tr ee are required.A fter ad-ditio ns and deletio ns,the failur e po inter must be changed ac-co rdingly.

If w e add o r delete a no de w ith state depth h,the failure po inter of ever y no de w ho se state depth is g reat er than h is pr obably to be cha ng ed.T hat is to sa y that part of failure po inter need cha ng ing.

5 Searching phase

5.1 Outputting the matching result

When a node s(s.outp ut==tr ue)is encounter ed in the sear ching pr ocess,w e visit the sequential binar y tr ee fr om ro ot t o s and output the content o f st ack a s t he matching r e-sult.L et no de f be the no de which is po inted by the failure po inter o f s.If f.outp ut==tr ue,sy nchr ono usly,we v isit the sequent ial binary tr ee fro m r oot to s and o utput the con-t ent of stack as the matching r esult.T here is a tr ace back in output pr ocess,so it affects the searching efficiency.

5.2 The description of searching algorithm

Aft er t he co nst ructio n of t he sequent ial binary tr ee,we can ex pediently f ind all patt erns fro m the tex t str ing by only scanning t he t ex t str ing once.T he sear ching pr ocess de-scr ibes as f ollo ws.Star ting fr om the ro ot of sequent ial binar y tree and select ing ever y char acter of t ext string o ne by one, we can det ermine the nex t state node acco rding t o g oto func-t ion and failure pointer.W e o utput the matching r esult w hen a node s(s.outp ut==tr ue)is enco unter ed in the sear ching pr ocess.T aking exam ple fo r t he sequential binar y t ree built in this paper,the pro cess o f sear ching"usher s"is descr ibed as fo llo ws:

Start ing fr om r o ot,go to(ro ot,u)=0.g oto(0,s)=13. go to(13,h)=14.go to(14,e)=15.No w,we o ut put the no de15,that is to output patter n"she".A t the same time, we o ut put t he no de2,that is to o ut put patter n"he".

We search the t ex t str ing fro m st ate no de2,because the failur e pointer of stat e no de15is sta te no de2.g oto(2,r)= 3.g ot o(3,s)=4.N ow,w e output t he no de4,that is to output pat tern"her s".T he searching pr ocess ends here.

T he fo llow ing algo rithm summarizes t he above behav-ior.

Algorithm4:Pattern matching algorithm

Input:tex t string"a1a2...a n"

T he sequential binary tr ee

Output:pat terns in tex t st ring and lo cat ions at which pat terns occur in t ex t str ing

p<--ro ot;

for i<--1unt il n{

w hile((p=goto(p,a i))==N U L L)do p<--p.f ailstate;

if(p.outp ut){

pr int i;pr int p;

if(p.f ailstate.outp ut)

print p.f ailstate;



6 Analysis of algorithm

6.1 Analysis of space complexity

Compar ing to t raditional multiple patter n mat ching alg o-rithm,the SM A alg or ithm do esn?t need additio na l memor y space o f go to t able,fa ilur e table a nd output table because o f utilizing the sequential binar y tr ee.T he memor y space o f SM A algo rithm is only the to tal space o f all nodes of t he se-quential binar y tr ee.T her e are tw o facto rs t hat determine the number o f sequential binar y tree node:T he fir st one is the to tal leng th of all patt erns in pa tter n set.T he seco nd fac-tor is ho w many sharing pr efix of each patter n.So,the num-ber o f sequential binar y tr ee no des can?t be computed by fo r-mula analysis.

6.2 Analysis of time complexity

T he analysis o f time complexity is consisted o f the con-st ructio n phase and sear ching phase.In o r der t o descr ibe ex-pediently,let the number o f patt erns o f pat tern set be k,the number of nodes o f t he sequential binar y tr ee is m and the leng th of t ext str ing is n.

Construction phase

T he time complexity o f algo rithm1is deter mined pr i-marily by the o ut er loo p,so the time complex it y is O(k).

T he pr ocess o f mar king failure po inter o f ev ery node is show n in alg or ithm3.T he main pr ocess of alg or ithm3is to tr aver se the sequential binar y t ree accor ding t o N L R mo del, so it s t ime complexity is O(m).

In a w o rd,t he time co mplex ity of co nst ructio n phase is O(m+k).

Searching phase

T he sear ching phase o f SM A algo rithm is simila r to that of t raditional multiple patter n mat ching algo rithm.

T he o ut er loo p of algo rithm4is deter mined primarily by n(the leng th o f tex t st ring).T he time co nsuming o f ev er y loop is determined prim arily by the algo rit hm2.N o w,w e analy ze the time com plex ity of algo rithm2.

T he pro cess fr om father stat e no de t o o ne o f it s child state no de(s)is determined by t he num ber o f child state node (s).F or ex ample,the num ber of sear ching path is1,2and 3r espectively fr om st ate node0t o its child state no de1,10 and13.A ssuming that the aver age num ber o f child stat e node(s)o f ever y no de is h,the tim e co mplex ity o f alg or it hm 2is O(h/2)because of t he char acters o f searching path is sor ted in incr easing o rder[4].T he upper lim it of h is the size of languag e alphabet.F or ex ample,h is sm aller t han o r equal t o26in English lang uage igno ring of case.

In a w o rd,the time complexity of sear ching phase is O (hn/2).

7 Experimental result

T a ble1 Co ng tr uct ion o f time o f tw o algo rithm

algorith ms

diction aries NAM E NAM E




ABBR Traditional

algorith m(T/m s)



algorith m(T/m s)


T he ex periment al sa mples include patter n set and tex t. T he patter n set includes N A M E dictio nary(8843names), DN S dict ionar y(1871Inter net address suffix es)and A BBR dictionar y(1470abbrev iations).T he text is do wnloaded fro m Eng lish news w ebsit e,which includes10k bytes,20k by tes,40k byt es and80k bytes tex t samples.T able1is the co nst ructio n time o f t wo a lg or ithms.T he constr uction time of SM A is half of that of t raditional multiple patter n match-ing algo rithm,w hich is fit for the conditio n of changing pat-t ern set dynamically.In this paper,w e sear ch fo ur t ext using three dictio nar ies respect ively and compute their av erag e time as sear ching time.A s t able2sho w s,the sear ching efficiency of tw o a lg or ithms is same and that o f t raditional alg or ithm is slow er slig htly than SM A algo rithm.It is o bv io us that the st or age st ructure o f SM A alg or ithm doesn?t affect its effi-ciency.

T able2 Cear ehing t ime of t wo algo rit hm

algorith ms

length10K20K40K80K Traditional

algorith m(T/m s)



algorith m(T/m s)


8 Conclusions

In o rder to meet the r equirement of co nstr uct ing au-t omata dy na mically,this paper or ig inally pr oposes the method to constr uct t he auto mata using sequential binar y tree.It is pr ov ed by ex per iment that the SM A algo rithm has a hig h constr uct ion rate and has a same sear ching efficiency with tr aditio nal alg or ithm.So t he SM A alg or it hm has a g oo d applicatio n pr ospect.

T he SM A algo rithm inspects ev ery char acter o f t ex t in sear ching pr https://www.360docs.net/doc/c518959638.html,bining Boyer-M oo re alg or ithm o r Quick Sear ch alg or ithm w ith SM A alg or ithm,we can imple-ment a skip-sear ching alg or it hm.A s for lar ge alphabet lan-


guag e,because the number o f child stat e no des of the ro ot node is ver y big,we can co nstr uct the fir st lev el st ate nodes using hash metho d,that is to say that we co nst ruct a for est w hich is co nsisted o f many sequential binar y tr ee.T he con-st ructio n method can sear ch t he t ext mor e quickly.

Ref erences:

1Aho A V,Corasick M J,Efficien t string matching:an aid to bib-liographic s earch[J].Com m.AC M,1975,18(6):333-340.

2L ew is H R,Pap adimitriou C H,Elemen ts of the theory of com-pu tation(S econd Edition)[M].Pr entice~Hall In ternation al, Inc.1998.3Fan J J,S u K Y,An efficien t algorithm for match multiple pat-terns[J].IEEE T rans.on Know ledge and Data En gineering, 1993,5(2):339-351.

4S haffer C A,A practice introduction to data structu res and algo-rithm analysis[M].New York Pren tice Hall,1997.

5Ch en Gui-lin.S ome technology research in autom atic abstract

[D].Sh ang hai Jiaotong University,Sh an ghai China,2000,4(In

Ch ines e).

6Boyer R S,M oore J S,A fast string searching algorithm[J].


7Su nday D M,A very fast su bstring search algorithm[J].Comm.


