Browse code

Add ALGORITHM.md

John Hawthorn authored on 19/12/2015 23:01:25 • John Hawthorn committed on 20/04/2016 00:20:50
Showing 2 changed files

1 1
new file mode 100644
... ...
@@ -0,0 +1,176 @@
1
+
2
+This document describes the scoring algorithm of fzy as well as the algorithm
3
+of other similar projects.
4
+
5
+# Matching vs Scoring
6
+
7
+I like to split the problem a fuzzy matchers into two subproblems: matching and scoring.
8
+
9
+Matching determines which results are eligible for the list.
10
+All the projects here consider this to be the same problem, matching the
11
+candidate strings against the search string with any number of gaps.
12
+
13
+Scoring determines the order in which the results are sorted.
14
+Since scoring is tasked with finding what the human user intended, there is no
15
+correct solution. As a result there are large variety in scoring strategies.
16
+
17
+# fzy's matching
18
+
19
+Generally, more time is taken in matching rather than scoring, so it is
20
+important that matching be as fast as possible. If this were case sensitive it
21
+would be a simple loop calling strchr, but since it needs to be case
22
+insensitive.
23
+
24
+# fzy's scoring
25
+
26
+fzy treats scoring as a modified [edit
27
+distance](https://en.wikipedia.org/wiki/Edit_distance) problem of calculating
28
+the
29
+[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance).
30
+Edit distance is the measure of how different two strings are in terms of
31
+insertions, deletions, and substitutions. This is the same problems as [DNA
32
+sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment). Fuzzy
33
+matching is a simpler problem which only accepts insertions, not deletions or
34
+substitutions.
35
+
36
+fzy's scoring is a dynamic programming algorithm similar to
37
+[Wagner–Fischer](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm)
38
+and
39
+[Needleman–Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm).
40
+
41
+Dynamic programming requires the observation that the result is based on the
42
+result of subproblems.
43
+
44
+Fzy borrows heavily from concepts in bioinformatics to performs scoring.
45
+
46
+Fzy builds a `n`-by-`m` matrix, where `n` is the length of the search string
47
+and `m` the length of the candidate string. Each position `(i,j)` in the matrix
48
+stores the score for matching the first `i` characters of the search with the
49
+first `j` characters of the candidate.
50
+
51
+Fzy calculates an affine gap penalty, this means simply that we assign a
52
+constant penalty for having a gap and a linear penalty for the length of the
53
+gap.
54
+Inspired by the [Gotoh algorithm
55
+(pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy
56
+computes a second `D` (for diagonal) matrix in parallel with the score matrix.
57
+The `D` matrix computes the best score which *ends* in a match. This allows
58
+both computation of the penlalty for starting a gap and the score for a
59
+consecutive match.
60
+
61
+Using this algorithm fzy is able to score based on the optimal match.
62
+
63
+* Gaps (negative score)
64
+  * at the start of the match
65
+  * at the end of the match
66
+  * within the match
67
+* Matches (positive score)
68
+  * consecutive
69
+  * following a slash
70
+  * following a space (the start of a word)
71
+  * capital letter (the start of a CamlCase word)
72
+  * following a dot (often a file extension)
73
+
74
+
75
+
76
+# Other fuzzy finders
77
+
78
+## TextMate
79
+
80
+TextMate deserves immense credit for popularizing fuzzy finding from inside
81
+text editors. It's influence can be found in the commant-t project, various
82
+other editors use command-t for file finding, and the 't' command in the github
83
+web interface.
84
+
85
+* https://github.com/textmate/textmate/blob/master/Frameworks/text/src/ranker.cc
86
+
87
+## command-t, ctrlp-cmatcher
88
+
89
+Command is a plugin first released in 2010 intending to bring TextMate's
90
+"Go to File" feature to vim.
91
+
92
+Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to 
93
+
94
+The wy `last_idx` is suspicious.
95
+
96
+* https://github.com/wincent/command-t/blob/master/ruby/command-t/match.c
97
+* https://github.com/JazzCore/ctrlp-cmatcher/blob/master/autoload/fuzzycomt.c
98
+
99
+## Length of shortest first match: fzf
100
+https://github.com/junegunn/fzf/blob/master/src/algo/algo.go
101
+
102
+Fzy scores based on the size of the greedy shortest match. fzf finds it's match
103
+by the first match appearing in the candidate string. It has some cleverness to
104
+find if there is a shorter match contained in that search, but it isn't
105
+guaranteed to find the shortest match in the string.
106
+
107
+Example results for the search "abc"
108
+
109
+* <tt>**AXXBXXC**xxabc</tt>
110
+* <tt>xxxxxxx**AXBXC**</tt>
111
+* <tt>xxxxxxxxx**ABC**</tt>
112
+
113
+## Length of first match: ctrlp, pick, selecta (`<= 0.0.6`)
114
+
115
+These score based on the length of the first match in the candidate. This is
116
+probably the simplest useful algorithm. This has the advantage that the heavy
117
+lifting can be performed by the regex engine, which is faster than implementing
118
+anything natively in ruby or Vim script.
119
+
120
+## Length of shortest match: pick
121
+
122
+Pick has a method, `min_match`, to find the absolute shortest match in a string.
123
+This will find better results than the finders, at the expense of speed, as backtracking is required.
124
+
125
+## selecta (latest master)
126
+https://github.com/garybernhardt/selecta/commit/d874c99dd7f0f94225a95da06fc487b0fa5b9edc
127
+https://github.com/garybernhardt/selecta/issues/80
128
+
129
+Selecta doesn't compare all possible matches, but only the shortest match from the same start location.
130
+This can lead to inconsistent results.
131
+
132
+Example results for the search "abc"
133
+
134
+* <tt>x**AXXXXBC**</tt>
135
+* <tt>x**ABXC**x</tt>
136
+* <tt>x**ABXC**xbc</tt>
137
+
138
+The third result here shoud have been scored the same as the first, but the
139
+lower scoring but shorter match is what is measured.
140
+
141
+
142
+## others
143
+
144
+* https://github.com/joshaven/string_score/blob/master/coffee/string_score.coffee (first match + heuristics)
145
+* https://github.com/atom/fuzzaldrin/blob/master/src/scorer.coffee (modified version of string_score)
146
+* https://github.com/jeancroy/fuzzaldrin-plus/blob/master/src/scorer.coffee (Smith Waterman)
147
+
148
+
149
+# Possible fzy Algorithm Improvements
150
+
151
+## Multithreading
152
+
153
+Currently a single thread is used for finding matches. Using multiple threads
154
+would likely be faster, but require some additional complexity.
155
+
156
+## Case sensitivity
157
+
158
+fzy currently treats all searches as case-insensitive. However, scoring prefers
159
+matches on uppercase letters to help find CamelCase candidates. It might be
160
+desirable to support a case sensitive flag or "smart case" searching.
161
+
162
+## Faster matching
163
+
164
+Matching is currently performed using the standard lib's `strpbrk`, which has a
165
+very simple implementation (at least in glibc).
166
+
167
+Glibc has an extremely clever `strchr` implementation which searches the haystack
168
+string by [word](https://en.wikipedia.org/wiki/Word_(computer_architecture)), a
169
+4 or 8 byte `long int`, instead of by byte. It tests if a word is likely to
170
+contain either the search char or the null terminator using bit twiddling.
171
+
172
+A similar method could probably be written to perform to find a character in a
173
+string case-insensitively.
174
+
175
+* https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strchr.c;h=f73891d439dcd8a08954fad4d4615acac4e0eb85;hb=HEAD
176
+
... ...
@@ -36,7 +36,7 @@ void mat_print(score_t *mat, const char *needle, const char *haystack) {
36 36
 	int i, j;
37 37
 	fprintf(stderr, "    ");
38 38
 	for (j = 0; j < m; j++) {
39
-		fprintf(stderr, "     %c", haystack[j]);
39
+		fprintf(stderr, "    %c", haystack[j]);
40 40
 	}
41 41
 	fprintf(stderr, "\n");
42 42
 	for (i = 0; i < n; i++) {
... ...
@@ -44,9 +44,9 @@ void mat_print(score_t *mat, const char *needle, const char *haystack) {
44 44
 		for (j = 0; j < m; j++) {
45 45
 			score_t val = mat[i * m + j];
46 46
 			if (val == SCORE_MIN) {
47
-				fprintf(stderr, "  -inf");
47
+				fprintf(stderr, "   -\u221E");
48 48
 			} else {
49
-				fprintf(stderr, " % .2f", val);
49
+				fprintf(stderr, " % 4g", val);
50 50
 			}
51 51
 		}
52 52
 		fprintf(stderr, "\n");