Update ALGORITHM.md
Missed a few letters
| ... | ... |
@@ -88,7 +88,7 @@ web interface. |
| 88 | 88 |
|
| 89 | 89 |
## command-t, ctrlp-cmatcher |
| 90 | 90 |
|
| 91 |
-Command is a plugin first released in 2010 intending to bring TextMate's |
|
| 91 |
+Command-t is a plugin first released in 2010 intending to bring TextMate's |
|
| 92 | 92 |
"Go to File" feature to vim. |
| 93 | 93 |
|
| 94 | 94 |
Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to |
Fix link in README
| ... | ... |
@@ -59,7 +59,7 @@ both computation of the penalty for starting a gap and the score for a |
| 59 | 59 |
consecutive match. |
| 60 | 60 |
|
| 61 | 61 |
Using [this |
| 62 |
-algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L58) fzy |
|
| 62 |
+algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L105) fzy |
|
| 63 | 63 |
is able to score based on the optimal match. |
| 64 | 64 |
|
| 65 | 65 |
* Gaps (negative score) |
| ... | ... |
@@ -80,7 +80,7 @@ is able to score based on the optimal match. |
| 80 | 80 |
## TextMate |
| 81 | 81 |
|
| 82 | 82 |
TextMate deserves immense credit for popularizing fuzzy finding from inside |
| 83 |
-text editors. It's influence can be found in the commant-t project, various |
|
| 83 |
+text editors. It's influence can be found in the command-t project, various |
|
| 84 | 84 |
other editors use command-t for file finding, and the 't' command in the github |
| 85 | 85 |
web interface. |
| 86 | 86 |
|
| ... | ... |
@@ -101,7 +101,7 @@ The wy `last_idx` is suspicious. |
| 101 | 101 |
## Length of shortest first match: fzf |
| 102 | 102 |
https://github.com/junegunn/fzf/blob/master/src/algo/algo.go |
| 103 | 103 |
|
| 104 |
-Fzy scores based on the size of the greedy shortest match. fzf finds it's match |
|
| 104 |
+Fzy scores based on the size of the greedy shortest match. fzf finds its match |
|
| 105 | 105 |
by the first match appearing in the candidate string. It has some cleverness to |
| 106 | 106 |
find if there is a shorter match contained in that search, but it isn't |
| 107 | 107 |
guaranteed to find the shortest match in the string. |
| ... | ... |
@@ -137,7 +137,7 @@ Example results for the search "abc" |
| 137 | 137 |
* <tt>x**ABXC**x</tt> |
| 138 | 138 |
* <tt>x**ABXC**xbc</tt> |
| 139 | 139 |
|
| 140 |
-The third result here shoud have been scored the same as the first, but the |
|
| 140 |
+The third result here should have been scored the same as the first, but the |
|
| 141 | 141 |
lower scoring but shorter match is what is measured. |
| 142 | 142 |
|
| 143 | 143 |
|
This now has multithreading so it can be removed from the TODO.
| ... | ... |
@@ -150,11 +150,6 @@ lower scoring but shorter match is what is measured. |
| 150 | 150 |
|
| 151 | 151 |
# Possible fzy Algorithm Improvements |
| 152 | 152 |
|
| 153 |
-## Multithreading |
|
| 154 |
- |
|
| 155 |
-Currently a single thread is used for finding matches. Using multiple threads |
|
| 156 |
-would likely be faster, but require some additional complexity. |
|
| 157 |
- |
|
| 158 | 153 |
## Case sensitivity |
| 159 | 154 |
|
| 160 | 155 |
fzy currently treats all searches as case-insensitive. However, scoring prefers |
This made sense on paper when deciding what was a "word". However in
reality this is rarely an indication of a separate word. I've found that
this caused hexadecimal or base64 strings to be favoured in matches.
| ... | ... |
@@ -69,7 +69,7 @@ is able to score based on the optimal match. |
| 69 | 69 |
* Matches (positive score) |
| 70 | 70 |
* consecutive |
| 71 | 71 |
* following a slash |
| 72 |
- * following a space, underscore, dash, or number (the start of a word) |
|
| 72 |
+ * following a space, underscore, or dash (the start of a word) |
|
| 73 | 73 |
* capital letter (the start of a CamelCase word) |
| 74 | 74 |
* following a dot (often a file extension) |
| 75 | 75 |
|
| ... | ... |
@@ -55,10 +55,12 @@ Inspired by the [Gotoh algorithm |
| 55 | 55 |
(pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy |
| 56 | 56 |
computes a second `D` (for diagonal) matrix in parallel with the score matrix. |
| 57 | 57 |
The `D` matrix computes the best score which *ends* in a match. This allows |
| 58 |
-both computation of the penlalty for starting a gap and the score for a |
|
| 58 |
+both computation of the penalty for starting a gap and the score for a |
|
| 59 | 59 |
consecutive match. |
| 60 | 60 |
|
| 61 |
-Using this algorithm fzy is able to score based on the optimal match. |
|
| 61 |
+Using [this |
|
| 62 |
+algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L58) fzy |
|
| 63 |
+is able to score based on the optimal match. |
|
| 62 | 64 |
|
| 63 | 65 |
* Gaps (negative score) |
| 64 | 66 |
* at the start of the match |
| ... | ... |
@@ -67,8 +69,8 @@ Using this algorithm fzy is able to score based on the optimal match. |
| 67 | 69 |
* Matches (positive score) |
| 68 | 70 |
* consecutive |
| 69 | 71 |
* following a slash |
| 70 |
- * following a space (the start of a word) |
|
| 71 |
- * capital letter (the start of a CamlCase word) |
|
| 72 |
+ * following a space, underscore, dash, or number (the start of a word) |
|
| 73 |
+ * capital letter (the start of a CamelCase word) |
|
| 72 | 74 |
* following a dot (often a file extension) |
| 73 | 75 |
|
| 74 | 76 |
|
| 1 | 1 |
new file mode 100644 |
| ... | ... |
@@ -0,0 +1,176 @@ |
| 1 |
+ |
|
| 2 |
+This document describes the scoring algorithm of fzy as well as the algorithm |
|
| 3 |
+of other similar projects. |
|
| 4 |
+ |
|
| 5 |
+# Matching vs Scoring |
|
| 6 |
+ |
|
| 7 |
+I like to split the problem a fuzzy matchers into two subproblems: matching and scoring. |
|
| 8 |
+ |
|
| 9 |
+Matching determines which results are eligible for the list. |
|
| 10 |
+All the projects here consider this to be the same problem, matching the |
|
| 11 |
+candidate strings against the search string with any number of gaps. |
|
| 12 |
+ |
|
| 13 |
+Scoring determines the order in which the results are sorted. |
|
| 14 |
+Since scoring is tasked with finding what the human user intended, there is no |
|
| 15 |
+correct solution. As a result there are large variety in scoring strategies. |
|
| 16 |
+ |
|
| 17 |
+# fzy's matching |
|
| 18 |
+ |
|
| 19 |
+Generally, more time is taken in matching rather than scoring, so it is |
|
| 20 |
+important that matching be as fast as possible. If this were case sensitive it |
|
| 21 |
+would be a simple loop calling strchr, but since it needs to be case |
|
| 22 |
+insensitive. |
|
| 23 |
+ |
|
| 24 |
+# fzy's scoring |
|
| 25 |
+ |
|
| 26 |
+fzy treats scoring as a modified [edit |
|
| 27 |
+distance](https://en.wikipedia.org/wiki/Edit_distance) problem of calculating |
|
| 28 |
+the |
|
| 29 |
+[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). |
|
| 30 |
+Edit distance is the measure of how different two strings are in terms of |
|
| 31 |
+insertions, deletions, and substitutions. This is the same problems as [DNA |
|
| 32 |
+sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment). Fuzzy |
|
| 33 |
+matching is a simpler problem which only accepts insertions, not deletions or |
|
| 34 |
+substitutions. |
|
| 35 |
+ |
|
| 36 |
+fzy's scoring is a dynamic programming algorithm similar to |
|
| 37 |
+[Wagner–Fischer](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm) |
|
| 38 |
+and |
|
| 39 |
+[Needleman–Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm). |
|
| 40 |
+ |
|
| 41 |
+Dynamic programming requires the observation that the result is based on the |
|
| 42 |
+result of subproblems. |
|
| 43 |
+ |
|
| 44 |
+Fzy borrows heavily from concepts in bioinformatics to performs scoring. |
|
| 45 |
+ |
|
| 46 |
+Fzy builds a `n`-by-`m` matrix, where `n` is the length of the search string |
|
| 47 |
+and `m` the length of the candidate string. Each position `(i,j)` in the matrix |
|
| 48 |
+stores the score for matching the first `i` characters of the search with the |
|
| 49 |
+first `j` characters of the candidate. |
|
| 50 |
+ |
|
| 51 |
+Fzy calculates an affine gap penalty, this means simply that we assign a |
|
| 52 |
+constant penalty for having a gap and a linear penalty for the length of the |
|
| 53 |
+gap. |
|
| 54 |
+Inspired by the [Gotoh algorithm |
|
| 55 |
+(pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy |
|
| 56 |
+computes a second `D` (for diagonal) matrix in parallel with the score matrix. |
|
| 57 |
+The `D` matrix computes the best score which *ends* in a match. This allows |
|
| 58 |
+both computation of the penlalty for starting a gap and the score for a |
|
| 59 |
+consecutive match. |
|
| 60 |
+ |
|
| 61 |
+Using this algorithm fzy is able to score based on the optimal match. |
|
| 62 |
+ |
|
| 63 |
+* Gaps (negative score) |
|
| 64 |
+ * at the start of the match |
|
| 65 |
+ * at the end of the match |
|
| 66 |
+ * within the match |
|
| 67 |
+* Matches (positive score) |
|
| 68 |
+ * consecutive |
|
| 69 |
+ * following a slash |
|
| 70 |
+ * following a space (the start of a word) |
|
| 71 |
+ * capital letter (the start of a CamlCase word) |
|
| 72 |
+ * following a dot (often a file extension) |
|
| 73 |
+ |
|
| 74 |
+ |
|
| 75 |
+ |
|
| 76 |
+# Other fuzzy finders |
|
| 77 |
+ |
|
| 78 |
+## TextMate |
|
| 79 |
+ |
|
| 80 |
+TextMate deserves immense credit for popularizing fuzzy finding from inside |
|
| 81 |
+text editors. It's influence can be found in the commant-t project, various |
|
| 82 |
+other editors use command-t for file finding, and the 't' command in the github |
|
| 83 |
+web interface. |
|
| 84 |
+ |
|
| 85 |
+* https://github.com/textmate/textmate/blob/master/Frameworks/text/src/ranker.cc |
|
| 86 |
+ |
|
| 87 |
+## command-t, ctrlp-cmatcher |
|
| 88 |
+ |
|
| 89 |
+Command is a plugin first released in 2010 intending to bring TextMate's |
|
| 90 |
+"Go to File" feature to vim. |
|
| 91 |
+ |
|
| 92 |
+Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to |
|
| 93 |
+ |
|
| 94 |
+The wy `last_idx` is suspicious. |
|
| 95 |
+ |
|
| 96 |
+* https://github.com/wincent/command-t/blob/master/ruby/command-t/match.c |
|
| 97 |
+* https://github.com/JazzCore/ctrlp-cmatcher/blob/master/autoload/fuzzycomt.c |
|
| 98 |
+ |
|
| 99 |
+## Length of shortest first match: fzf |
|
| 100 |
+https://github.com/junegunn/fzf/blob/master/src/algo/algo.go |
|
| 101 |
+ |
|
| 102 |
+Fzy scores based on the size of the greedy shortest match. fzf finds it's match |
|
| 103 |
+by the first match appearing in the candidate string. It has some cleverness to |
|
| 104 |
+find if there is a shorter match contained in that search, but it isn't |
|
| 105 |
+guaranteed to find the shortest match in the string. |
|
| 106 |
+ |
|
| 107 |
+Example results for the search "abc" |
|
| 108 |
+ |
|
| 109 |
+* <tt>**AXXBXXC**xxabc</tt> |
|
| 110 |
+* <tt>xxxxxxx**AXBXC**</tt> |
|
| 111 |
+* <tt>xxxxxxxxx**ABC**</tt> |
|
| 112 |
+ |
|
| 113 |
+## Length of first match: ctrlp, pick, selecta (`<= 0.0.6`) |
|
| 114 |
+ |
|
| 115 |
+These score based on the length of the first match in the candidate. This is |
|
| 116 |
+probably the simplest useful algorithm. This has the advantage that the heavy |
|
| 117 |
+lifting can be performed by the regex engine, which is faster than implementing |
|
| 118 |
+anything natively in ruby or Vim script. |
|
| 119 |
+ |
|
| 120 |
+## Length of shortest match: pick |
|
| 121 |
+ |
|
| 122 |
+Pick has a method, `min_match`, to find the absolute shortest match in a string. |
|
| 123 |
+This will find better results than the finders, at the expense of speed, as backtracking is required. |
|
| 124 |
+ |
|
| 125 |
+## selecta (latest master) |
|
| 126 |
+https://github.com/garybernhardt/selecta/commit/d874c99dd7f0f94225a95da06fc487b0fa5b9edc |
|
| 127 |
+https://github.com/garybernhardt/selecta/issues/80 |
|
| 128 |
+ |
|
| 129 |
+Selecta doesn't compare all possible matches, but only the shortest match from the same start location. |
|
| 130 |
+This can lead to inconsistent results. |
|
| 131 |
+ |
|
| 132 |
+Example results for the search "abc" |
|
| 133 |
+ |
|
| 134 |
+* <tt>x**AXXXXBC**</tt> |
|
| 135 |
+* <tt>x**ABXC**x</tt> |
|
| 136 |
+* <tt>x**ABXC**xbc</tt> |
|
| 137 |
+ |
|
| 138 |
+The third result here shoud have been scored the same as the first, but the |
|
| 139 |
+lower scoring but shorter match is what is measured. |
|
| 140 |
+ |
|
| 141 |
+ |
|
| 142 |
+## others |
|
| 143 |
+ |
|
| 144 |
+* https://github.com/joshaven/string_score/blob/master/coffee/string_score.coffee (first match + heuristics) |
|
| 145 |
+* https://github.com/atom/fuzzaldrin/blob/master/src/scorer.coffee (modified version of string_score) |
|
| 146 |
+* https://github.com/jeancroy/fuzzaldrin-plus/blob/master/src/scorer.coffee (Smith Waterman) |
|
| 147 |
+ |
|
| 148 |
+ |
|
| 149 |
+# Possible fzy Algorithm Improvements |
|
| 150 |
+ |
|
| 151 |
+## Multithreading |
|
| 152 |
+ |
|
| 153 |
+Currently a single thread is used for finding matches. Using multiple threads |
|
| 154 |
+would likely be faster, but require some additional complexity. |
|
| 155 |
+ |
|
| 156 |
+## Case sensitivity |
|
| 157 |
+ |
|
| 158 |
+fzy currently treats all searches as case-insensitive. However, scoring prefers |
|
| 159 |
+matches on uppercase letters to help find CamelCase candidates. It might be |
|
| 160 |
+desirable to support a case sensitive flag or "smart case" searching. |
|
| 161 |
+ |
|
| 162 |
+## Faster matching |
|
| 163 |
+ |
|
| 164 |
+Matching is currently performed using the standard lib's `strpbrk`, which has a |
|
| 165 |
+very simple implementation (at least in glibc). |
|
| 166 |
+ |
|
| 167 |
+Glibc has an extremely clever `strchr` implementation which searches the haystack |
|
| 168 |
+string by [word](https://en.wikipedia.org/wiki/Word_(computer_architecture)), a |
|
| 169 |
+4 or 8 byte `long int`, instead of by byte. It tests if a word is likely to |
|
| 170 |
+contain either the search char or the null terminator using bit twiddling. |
|
| 171 |
+ |
|
| 172 |
+A similar method could probably be written to perform to find a character in a |
|
| 173 |
+string case-insensitively. |
|
| 174 |
+ |
|
| 175 |
+* https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strchr.c;h=f73891d439dcd8a08954fad4d4615acac4e0eb85;hb=HEAD |
|
| 176 |
+ |