自动机的思维导图
自动机有很多用途,不光是编译器会用到,众所周知的还有正则表达式。其他,简单的如字符串搜索,存储等,复杂点的如模型验证,定理证明……
最简单的自动机应该算是DFA了,再往简单了说,一个DFA就是一个 map<int,map<char,int>>, 其中 int 是 state id,char是转移标号;如果有人还嫌不够简单,最最简单的那就是 int [][256] 了,就是一个256列的矩阵。就这么简单的一个东西,却蕴含着非常深刻的思想,有着非常广泛的用途。
几个月前因为项目需要,再加上自己的兴趣,实现了一系列自动机相关的算法和数据结构(用C++11写的),其结果和成就感远超我当初一开始的想象和预期,因为有好几个算法我的实现远远胜于任何已有的实现,不光运行速度快得多,内存占用也小得多。今天有点闲暇,整理一个思维导图,以后有时间再慢慢逐个描述。
mind map 最好打开新窗口,再缩放着看,在这里只能看到很小的一部分。
DFA
1 DFA Minimization
1.1 General DFA Minimization
1.1.1 Hopcroft Algorithm
1.1.1.1 Partition Refinement
1.1.1.2 A mutation to the splitter generator, for performance
1.1.1.3 Waiting set strategy
1.1.1.3.1 Stack is faster
1.1.1.3.2 Queue is slower
1.1.1.4 Intersections of Partitions are all empty
1.1.1.5 * Efficient implementation needs double linked list * Use array index as the link
1.2 Path compression This is not DFA minimization, it’s just an optimization. A sequence of states which all have just 1 transition. First state of the sequence couldn’t be a final state. Last state of the sequence couldn’t be a confluence
state.
state.
2 ADFA Acyclic DFA
2.1 Acyclic DFA Minimization General Minimization Algorithm works for this case, but there are dedicated algorithm, which needs less memory, sometimes faster
2.1.1 Sorted input
2.1.2 Random input
2.1.3 MinADFA or DAWG
2.2 Don’t support the mapping Use it as a SET This is a dynamic SET
2.3 Identify Each Word Could be used as a MAP
2.3.1 The word id is the Dictionary ordinal of the word in the DAWG The word<=>id mapping is a bidirectional mapping
2.3.2 CLTK Implementaion
2.3.2.1 Save a number in the State.
2.3.2.2 The number is the total pathes from the state to all reachable final state.
2.3.2.3 Pros: Number update for random inserting word could be efficient enough
2.3.2.4 Cons: Mapping a word to its (dictinary) ordinal is not fast enough: O(|Σ|*strlen(word))
2.3.3 Febird Implementation
2.3.3.1 Save a number in the Transition
2.3.3.2 Let s be the source of the transition. Let c be the label char of the transition. The number saved in the transition is the total paths from s which label char is less than c.
2.3.3.2.1 The unique single transition of a state was saved in the state’s self, and the number is always 0, thus it could be omited
2.3.3.2.2 In applications, there are about 50% states just have a single transition
2.3.3.2.3 For states which has multiple transitions, the number of its first transtion is always 0, this 0 is not omited, to omit this 0, implementation code would be much complex and nasty
2.3.3.3 Pros: Mapping a word to its dictionary ordinal is fast: O(strlen(word))
2.3.3.4 Cons: Number update for dynamic inserting word couldn’t be efficient
3 Aho-Corasick Multiple Pattern Matching
3.1 Combine With Numbered ADFA
3.2 Word set storage in Adjacent Difference Use less memory
4 Febird Implementation: * Many states have 1 single transition * Few states have a few transitions * Very few states have many transitions * Almost all states have a small range of label char
4.1 struct State32 { // There are other impl uint32_t spos; char_t minch; char_t maxch; // inclusive unsigned term_bit : 1; unsigned pzip_bit : 1; }; // if minch==maxch, spos is the target state_id // else spos is the offset to targets
map in mempool // the targets map consist a bitmap and targets id
map in mempool // the targets map consist a bitmap and targets id
4.2 All states are saved in a vector * State could be deleted * Deleted states are put into freelist * When state vector is full, double its capacity
4.3 Written in C++11 * Templatized, for performance & memory utilization * For concise, use lambda extensively * Use class enum, just for clear
5 Abastrac Concept: map<state,map<char,state>>