| 1 | 1 |
new file mode 100644 |
| ... | ... |
@@ -0,0 +1,176 @@ |
| 1 |
+ |
|
| 2 |
+This document describes the scoring algorithm of fzy as well as the algorithm |
|
| 3 |
+of other similar projects. |
|
| 4 |
+ |
|
| 5 |
+# Matching vs Scoring |
|
| 6 |
+ |
|
| 7 |
+I like to split the problem a fuzzy matchers into two subproblems: matching and scoring. |
|
| 8 |
+ |
|
| 9 |
+Matching determines which results are eligible for the list. |
|
| 10 |
+All the projects here consider this to be the same problem, matching the |
|
| 11 |
+candidate strings against the search string with any number of gaps. |
|
| 12 |
+ |
|
| 13 |
+Scoring determines the order in which the results are sorted. |
|
| 14 |
+Since scoring is tasked with finding what the human user intended, there is no |
|
| 15 |
+correct solution. As a result there are large variety in scoring strategies. |
|
| 16 |
+ |
|
| 17 |
+# fzy's matching |
|
| 18 |
+ |
|
| 19 |
+Generally, more time is taken in matching rather than scoring, so it is |
|
| 20 |
+important that matching be as fast as possible. If this were case sensitive it |
|
| 21 |
+would be a simple loop calling strchr, but since it needs to be case |
|
| 22 |
+insensitive. |
|
| 23 |
+ |
|
| 24 |
+# fzy's scoring |
|
| 25 |
+ |
|
| 26 |
+fzy treats scoring as a modified [edit |
|
| 27 |
+distance](https://en.wikipedia.org/wiki/Edit_distance) problem of calculating |
|
| 28 |
+the |
|
| 29 |
+[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). |
|
| 30 |
+Edit distance is the measure of how different two strings are in terms of |
|
| 31 |
+insertions, deletions, and substitutions. This is the same problems as [DNA |
|
| 32 |
+sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment). Fuzzy |
|
| 33 |
+matching is a simpler problem which only accepts insertions, not deletions or |
|
| 34 |
+substitutions. |
|
| 35 |
+ |
|
| 36 |
+fzy's scoring is a dynamic programming algorithm similar to |
|
| 37 |
+[Wagner–Fischer](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm) |
|
| 38 |
+and |
|
| 39 |
+[Needleman–Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm). |
|
| 40 |
+ |
|
| 41 |
+Dynamic programming requires the observation that the result is based on the |
|
| 42 |
+result of subproblems. |
|
| 43 |
+ |
|
| 44 |
+Fzy borrows heavily from concepts in bioinformatics to performs scoring. |
|
| 45 |
+ |
|
| 46 |
+Fzy builds a `n`-by-`m` matrix, where `n` is the length of the search string |
|
| 47 |
+and `m` the length of the candidate string. Each position `(i,j)` in the matrix |
|
| 48 |
+stores the score for matching the first `i` characters of the search with the |
|
| 49 |
+first `j` characters of the candidate. |
|
| 50 |
+ |
|
| 51 |
+Fzy calculates an affine gap penalty, this means simply that we assign a |
|
| 52 |
+constant penalty for having a gap and a linear penalty for the length of the |
|
| 53 |
+gap. |
|
| 54 |
+Inspired by the [Gotoh algorithm |
|
| 55 |
+(pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy |
|
| 56 |
+computes a second `D` (for diagonal) matrix in parallel with the score matrix. |
|
| 57 |
+The `D` matrix computes the best score which *ends* in a match. This allows |
|
| 58 |
+both computation of the penlalty for starting a gap and the score for a |
|
| 59 |
+consecutive match. |
|
| 60 |
+ |
|
| 61 |
+Using this algorithm fzy is able to score based on the optimal match. |
|
| 62 |
+ |
|
| 63 |
+* Gaps (negative score) |
|
| 64 |
+ * at the start of the match |
|
| 65 |
+ * at the end of the match |
|
| 66 |
+ * within the match |
|
| 67 |
+* Matches (positive score) |
|
| 68 |
+ * consecutive |
|
| 69 |
+ * following a slash |
|
| 70 |
+ * following a space (the start of a word) |
|
| 71 |
+ * capital letter (the start of a CamlCase word) |
|
| 72 |
+ * following a dot (often a file extension) |
|
| 73 |
+ |
|
| 74 |
+ |
|
| 75 |
+ |
|
| 76 |
+# Other fuzzy finders |
|
| 77 |
+ |
|
| 78 |
+## TextMate |
|
| 79 |
+ |
|
| 80 |
+TextMate deserves immense credit for popularizing fuzzy finding from inside |
|
| 81 |
+text editors. It's influence can be found in the commant-t project, various |
|
| 82 |
+other editors use command-t for file finding, and the 't' command in the github |
|
| 83 |
+web interface. |
|
| 84 |
+ |
|
| 85 |
+* https://github.com/textmate/textmate/blob/master/Frameworks/text/src/ranker.cc |
|
| 86 |
+ |
|
| 87 |
+## command-t, ctrlp-cmatcher |
|
| 88 |
+ |
|
| 89 |
+Command is a plugin first released in 2010 intending to bring TextMate's |
|
| 90 |
+"Go to File" feature to vim. |
|
| 91 |
+ |
|
| 92 |
+Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to |
|
| 93 |
+ |
|
| 94 |
+The wy `last_idx` is suspicious. |
|
| 95 |
+ |
|
| 96 |
+* https://github.com/wincent/command-t/blob/master/ruby/command-t/match.c |
|
| 97 |
+* https://github.com/JazzCore/ctrlp-cmatcher/blob/master/autoload/fuzzycomt.c |
|
| 98 |
+ |
|
| 99 |
+## Length of shortest first match: fzf |
|
| 100 |
+https://github.com/junegunn/fzf/blob/master/src/algo/algo.go |
|
| 101 |
+ |
|
| 102 |
+Fzy scores based on the size of the greedy shortest match. fzf finds it's match |
|
| 103 |
+by the first match appearing in the candidate string. It has some cleverness to |
|
| 104 |
+find if there is a shorter match contained in that search, but it isn't |
|
| 105 |
+guaranteed to find the shortest match in the string. |
|
| 106 |
+ |
|
| 107 |
+Example results for the search "abc" |
|
| 108 |
+ |
|
| 109 |
+* <tt>**AXXBXXC**xxabc</tt> |
|
| 110 |
+* <tt>xxxxxxx**AXBXC**</tt> |
|
| 111 |
+* <tt>xxxxxxxxx**ABC**</tt> |
|
| 112 |
+ |
|
| 113 |
+## Length of first match: ctrlp, pick, selecta (`<= 0.0.6`) |
|
| 114 |
+ |
|
| 115 |
+These score based on the length of the first match in the candidate. This is |
|
| 116 |
+probably the simplest useful algorithm. This has the advantage that the heavy |
|
| 117 |
+lifting can be performed by the regex engine, which is faster than implementing |
|
| 118 |
+anything natively in ruby or Vim script. |
|
| 119 |
+ |
|
| 120 |
+## Length of shortest match: pick |
|
| 121 |
+ |
|
| 122 |
+Pick has a method, `min_match`, to find the absolute shortest match in a string. |
|
| 123 |
+This will find better results than the finders, at the expense of speed, as backtracking is required. |
|
| 124 |
+ |
|
| 125 |
+## selecta (latest master) |
|
| 126 |
+https://github.com/garybernhardt/selecta/commit/d874c99dd7f0f94225a95da06fc487b0fa5b9edc |
|
| 127 |
+https://github.com/garybernhardt/selecta/issues/80 |
|
| 128 |
+ |
|
| 129 |
+Selecta doesn't compare all possible matches, but only the shortest match from the same start location. |
|
| 130 |
+This can lead to inconsistent results. |
|
| 131 |
+ |
|
| 132 |
+Example results for the search "abc" |
|
| 133 |
+ |
|
| 134 |
+* <tt>x**AXXXXBC**</tt> |
|
| 135 |
+* <tt>x**ABXC**x</tt> |
|
| 136 |
+* <tt>x**ABXC**xbc</tt> |
|
| 137 |
+ |
|
| 138 |
+The third result here shoud have been scored the same as the first, but the |
|
| 139 |
+lower scoring but shorter match is what is measured. |
|
| 140 |
+ |
|
| 141 |
+ |
|
| 142 |
+## others |
|
| 143 |
+ |
|
| 144 |
+* https://github.com/joshaven/string_score/blob/master/coffee/string_score.coffee (first match + heuristics) |
|
| 145 |
+* https://github.com/atom/fuzzaldrin/blob/master/src/scorer.coffee (modified version of string_score) |
|
| 146 |
+* https://github.com/jeancroy/fuzzaldrin-plus/blob/master/src/scorer.coffee (Smith Waterman) |
|
| 147 |
+ |
|
| 148 |
+ |
|
| 149 |
+# Possible fzy Algorithm Improvements |
|
| 150 |
+ |
|
| 151 |
+## Multithreading |
|
| 152 |
+ |
|
| 153 |
+Currently a single thread is used for finding matches. Using multiple threads |
|
| 154 |
+would likely be faster, but require some additional complexity. |
|
| 155 |
+ |
|
| 156 |
+## Case sensitivity |
|
| 157 |
+ |
|
| 158 |
+fzy currently treats all searches as case-insensitive. However, scoring prefers |
|
| 159 |
+matches on uppercase letters to help find CamelCase candidates. It might be |
|
| 160 |
+desirable to support a case sensitive flag or "smart case" searching. |
|
| 161 |
+ |
|
| 162 |
+## Faster matching |
|
| 163 |
+ |
|
| 164 |
+Matching is currently performed using the standard lib's `strpbrk`, which has a |
|
| 165 |
+very simple implementation (at least in glibc). |
|
| 166 |
+ |
|
| 167 |
+Glibc has an extremely clever `strchr` implementation which searches the haystack |
|
| 168 |
+string by [word](https://en.wikipedia.org/wiki/Word_(computer_architecture)), a |
|
| 169 |
+4 or 8 byte `long int`, instead of by byte. It tests if a word is likely to |
|
| 170 |
+contain either the search char or the null terminator using bit twiddling. |
|
| 171 |
+ |
|
| 172 |
+A similar method could probably be written to perform to find a character in a |
|
| 173 |
+string case-insensitively. |
|
| 174 |
+ |
|
| 175 |
+* https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strchr.c;h=f73891d439dcd8a08954fad4d4615acac4e0eb85;hb=HEAD |
|
| 176 |
+ |
| ... | ... |
@@ -36,7 +36,7 @@ void mat_print(score_t *mat, const char *needle, const char *haystack) {
|
| 36 | 36 |
int i, j; |
| 37 | 37 |
fprintf(stderr, " "); |
| 38 | 38 |
for (j = 0; j < m; j++) {
|
| 39 |
- fprintf(stderr, " %c", haystack[j]); |
|
| 39 |
+ fprintf(stderr, " %c", haystack[j]); |
|
| 40 | 40 |
} |
| 41 | 41 |
fprintf(stderr, "\n"); |
| 42 | 42 |
for (i = 0; i < n; i++) {
|
| ... | ... |
@@ -44,9 +44,9 @@ void mat_print(score_t *mat, const char *needle, const char *haystack) {
|
| 44 | 44 |
for (j = 0; j < m; j++) {
|
| 45 | 45 |
score_t val = mat[i * m + j]; |
| 46 | 46 |
if (val == SCORE_MIN) {
|
| 47 |
- fprintf(stderr, " -inf"); |
|
| 47 |
+ fprintf(stderr, " -\u221E"); |
|
| 48 | 48 |
} else {
|
| 49 |
- fprintf(stderr, " % .2f", val); |
|
| 49 |
+ fprintf(stderr, " % 4g", val); |
|
| 50 | 50 |
} |
| 51 | 51 |
} |
| 52 | 52 |
fprintf(stderr, "\n"); |