Browse code

Merge pull request #150 from casprwang/patch-1

Update ALGORITHM.md

John Hawthorn authored on 23/01/2022 06:50:00 • GitHub committed on 23/01/2022 06:50:00
Showing 0 changed files
Browse code

Update ALGORITHM.md

Missed a few letters

Faris A Chugthai authored on 12/09/2021 00:30:36 • GitHub committed on 12/09/2021 00:30:36
Showing 1 changed files
... ...
@@ -88,7 +88,7 @@ web interface.
88 88
 
89 89
 ## command-t, ctrlp-cmatcher
90 90
 
91
-Command is a plugin first released in 2010 intending to bring TextMate's
91
+Command-t is a plugin first released in 2010 intending to bring TextMate's
92 92
 "Go to File" feature to vim.
93 93
 
94 94
 Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to 
Browse code

Update ALGORITHM.md

Fix link in README

Casper Wang authored on 21/10/2020 16:45:11 • GitHub committed on 21/10/2020 16:45:11
Showing 1 changed files
... ...
@@ -59,7 +59,7 @@ both computation of the penalty for starting a gap and the score for a
59 59
 consecutive match.
60 60
 
61 61
 Using [this 
62
-algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L58) fzy 
62
+algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L105) fzy 
63 63
 is able to score based on the optimal match.
64 64
 
65 65
 * Gaps (negative score)
Browse code

Fix a few typos

Jonathan Neuschäfer authored on 23/09/2017 04:15:08
Showing 1 changed files
... ...
@@ -80,7 +80,7 @@ is able to score based on the optimal match.
80 80
 ## TextMate
81 81
 
82 82
 TextMate deserves immense credit for popularizing fuzzy finding from inside
83
-text editors. It's influence can be found in the commant-t project, various
83
+text editors. It's influence can be found in the command-t project, various
84 84
 other editors use command-t for file finding, and the 't' command in the github
85 85
 web interface.
86 86
 
... ...
@@ -101,7 +101,7 @@ The wy `last_idx` is suspicious.
101 101
 ## Length of shortest first match: fzf
102 102
 https://github.com/junegunn/fzf/blob/master/src/algo/algo.go
103 103
 
104
-Fzy scores based on the size of the greedy shortest match. fzf finds it's match
104
+Fzy scores based on the size of the greedy shortest match. fzf finds its match
105 105
 by the first match appearing in the candidate string. It has some cleverness to
106 106
 find if there is a shorter match contained in that search, but it isn't
107 107
 guaranteed to find the shortest match in the string.
... ...
@@ -137,7 +137,7 @@ Example results for the search "abc"
137 137
 * <tt>x**ABXC**x</tt>
138 138
 * <tt>x**ABXC**xbc</tt>
139 139
 
140
-The third result here shoud have been scored the same as the first, but the
140
+The third result here should have been scored the same as the first, but the
141 141
 lower scoring but shorter match is what is measured.
142 142
 
143 143
 
Browse code

Remove multithreading from TODO

This now has multithreading so it can be removed from the TODO.

Sven-Hendrik Haase authored on 07/11/2017 09:20:17 • GitHub committed on 07/11/2017 09:20:17
Showing 1 changed files
... ...
@@ -150,11 +150,6 @@ lower scoring but shorter match is what is measured.
150 150
 
151 151
 # Possible fzy Algorithm Improvements
152 152
 
153
-## Multithreading
154
-
155
-Currently a single thread is used for finding matches. Using multiple threads
156
-would likely be faster, but require some additional complexity.
157
-
158 153
 ## Case sensitivity
159 154
 
160 155
 fzy currently treats all searches as case-insensitive. However, scoring prefers
Browse code

Don't consider numbers word separators

This made sense on paper when deciding what was a "word". However in
reality this is rarely an indication of a separate word. I've found that
this caused hexadecimal or base64 strings to be favoured in matches.

John Hawthorn authored on 27/06/2016 06:10:25
Showing 1 changed files
... ...
@@ -69,7 +69,7 @@ is able to score based on the optimal match.
69 69
 * Matches (positive score)
70 70
   * consecutive
71 71
   * following a slash
72
-  * following a space, underscore, dash, or number (the start of a word)
72
+  * following a space, underscore, or dash (the start of a word)
73 73
   * capital letter (the start of a CamelCase word)
74 74
   * following a dot (often a file extension)
75 75
 
Browse code

List all conditions identifying match in docs

Henry Baughman authored on 04/07/2016 15:14:23 • GitHub committed on 04/07/2016 15:14:23
Showing 1 changed files
... ...
@@ -55,10 +55,12 @@ Inspired by the [Gotoh algorithm
55 55
 (pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy
56 56
 computes a second `D` (for diagonal) matrix in parallel with the score matrix.
57 57
 The `D` matrix computes the best score which *ends* in a match. This allows
58
-both computation of the penlalty for starting a gap and the score for a
58
+both computation of the penalty for starting a gap and the score for a
59 59
 consecutive match.
60 60
 
61
-Using this algorithm fzy is able to score based on the optimal match.
61
+Using [this 
62
+algorithm](https://github.com/jhawthorn/fzy/blob/master/src/match.c#L58) fzy 
63
+is able to score based on the optimal match.
62 64
 
63 65
 * Gaps (negative score)
64 66
   * at the start of the match
... ...
@@ -67,8 +69,8 @@ Using this algorithm fzy is able to score based on the optimal match.
67 69
 * Matches (positive score)
68 70
   * consecutive
69 71
   * following a slash
70
-  * following a space (the start of a word)
71
-  * capital letter (the start of a CamlCase word)
72
+  * following a space, underscore, dash, or number (the start of a word)
73
+  * capital letter (the start of a CamelCase word)
72 74
   * following a dot (often a file extension)
73 75
 
74 76
 
Browse code

Add ALGORITHM.md

John Hawthorn authored on 19/12/2015 23:01:25 • John Hawthorn committed on 20/04/2016 00:20:50
Showing 1 changed files
1 1
new file mode 100644
... ...
@@ -0,0 +1,176 @@
1
+
2
+This document describes the scoring algorithm of fzy as well as the algorithm
3
+of other similar projects.
4
+
5
+# Matching vs Scoring
6
+
7
+I like to split the problem a fuzzy matchers into two subproblems: matching and scoring.
8
+
9
+Matching determines which results are eligible for the list.
10
+All the projects here consider this to be the same problem, matching the
11
+candidate strings against the search string with any number of gaps.
12
+
13
+Scoring determines the order in which the results are sorted.
14
+Since scoring is tasked with finding what the human user intended, there is no
15
+correct solution. As a result there are large variety in scoring strategies.
16
+
17
+# fzy's matching
18
+
19
+Generally, more time is taken in matching rather than scoring, so it is
20
+important that matching be as fast as possible. If this were case sensitive it
21
+would be a simple loop calling strchr, but since it needs to be case
22
+insensitive.
23
+
24
+# fzy's scoring
25
+
26
+fzy treats scoring as a modified [edit
27
+distance](https://en.wikipedia.org/wiki/Edit_distance) problem of calculating
28
+the
29
+[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance).
30
+Edit distance is the measure of how different two strings are in terms of
31
+insertions, deletions, and substitutions. This is the same problems as [DNA
32
+sequence alignment](https://en.wikipedia.org/wiki/Sequence_alignment). Fuzzy
33
+matching is a simpler problem which only accepts insertions, not deletions or
34
+substitutions.
35
+
36
+fzy's scoring is a dynamic programming algorithm similar to
37
+[Wagner–Fischer](https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm)
38
+and
39
+[Needleman–Wunsch](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm).
40
+
41
+Dynamic programming requires the observation that the result is based on the
42
+result of subproblems.
43
+
44
+Fzy borrows heavily from concepts in bioinformatics to performs scoring.
45
+
46
+Fzy builds a `n`-by-`m` matrix, where `n` is the length of the search string
47
+and `m` the length of the candidate string. Each position `(i,j)` in the matrix
48
+stores the score for matching the first `i` characters of the search with the
49
+first `j` characters of the candidate.
50
+
51
+Fzy calculates an affine gap penalty, this means simply that we assign a
52
+constant penalty for having a gap and a linear penalty for the length of the
53
+gap.
54
+Inspired by the [Gotoh algorithm
55
+(pdf)](http://www.cs.unibo.it/~dilena/LabBII/Papers/AffineGaps.pdf), fzy
56
+computes a second `D` (for diagonal) matrix in parallel with the score matrix.
57
+The `D` matrix computes the best score which *ends* in a match. This allows
58
+both computation of the penlalty for starting a gap and the score for a
59
+consecutive match.
60
+
61
+Using this algorithm fzy is able to score based on the optimal match.
62
+
63
+* Gaps (negative score)
64
+  * at the start of the match
65
+  * at the end of the match
66
+  * within the match
67
+* Matches (positive score)
68
+  * consecutive
69
+  * following a slash
70
+  * following a space (the start of a word)
71
+  * capital letter (the start of a CamlCase word)
72
+  * following a dot (often a file extension)
73
+
74
+
75
+
76
+# Other fuzzy finders
77
+
78
+## TextMate
79
+
80
+TextMate deserves immense credit for popularizing fuzzy finding from inside
81
+text editors. It's influence can be found in the commant-t project, various
82
+other editors use command-t for file finding, and the 't' command in the github
83
+web interface.
84
+
85
+* https://github.com/textmate/textmate/blob/master/Frameworks/text/src/ranker.cc
86
+
87
+## command-t, ctrlp-cmatcher
88
+
89
+Command is a plugin first released in 2010 intending to bring TextMate's
90
+"Go to File" feature to vim.
91
+
92
+Anecdotally, this algorithm works very well. The recursive nature makes it a little hard to 
93
+
94
+The wy `last_idx` is suspicious.
95
+
96
+* https://github.com/wincent/command-t/blob/master/ruby/command-t/match.c
97
+* https://github.com/JazzCore/ctrlp-cmatcher/blob/master/autoload/fuzzycomt.c
98
+
99
+## Length of shortest first match: fzf
100
+https://github.com/junegunn/fzf/blob/master/src/algo/algo.go
101
+
102
+Fzy scores based on the size of the greedy shortest match. fzf finds it's match
103
+by the first match appearing in the candidate string. It has some cleverness to
104
+find if there is a shorter match contained in that search, but it isn't
105
+guaranteed to find the shortest match in the string.
106
+
107
+Example results for the search "abc"
108
+
109
+* <tt>**AXXBXXC**xxabc</tt>
110
+* <tt>xxxxxxx**AXBXC**</tt>
111
+* <tt>xxxxxxxxx**ABC**</tt>
112
+
113
+## Length of first match: ctrlp, pick, selecta (`<= 0.0.6`)
114
+
115
+These score based on the length of the first match in the candidate. This is
116
+probably the simplest useful algorithm. This has the advantage that the heavy
117
+lifting can be performed by the regex engine, which is faster than implementing
118
+anything natively in ruby or Vim script.
119
+
120
+## Length of shortest match: pick
121
+
122
+Pick has a method, `min_match`, to find the absolute shortest match in a string.
123
+This will find better results than the finders, at the expense of speed, as backtracking is required.
124
+
125
+## selecta (latest master)
126
+https://github.com/garybernhardt/selecta/commit/d874c99dd7f0f94225a95da06fc487b0fa5b9edc
127
+https://github.com/garybernhardt/selecta/issues/80
128
+
129
+Selecta doesn't compare all possible matches, but only the shortest match from the same start location.
130
+This can lead to inconsistent results.
131
+
132
+Example results for the search "abc"
133
+
134
+* <tt>x**AXXXXBC**</tt>
135
+* <tt>x**ABXC**x</tt>
136
+* <tt>x**ABXC**xbc</tt>
137
+
138
+The third result here shoud have been scored the same as the first, but the
139
+lower scoring but shorter match is what is measured.
140
+
141
+
142
+## others
143
+
144
+* https://github.com/joshaven/string_score/blob/master/coffee/string_score.coffee (first match + heuristics)
145
+* https://github.com/atom/fuzzaldrin/blob/master/src/scorer.coffee (modified version of string_score)
146
+* https://github.com/jeancroy/fuzzaldrin-plus/blob/master/src/scorer.coffee (Smith Waterman)
147
+
148
+
149
+# Possible fzy Algorithm Improvements
150
+
151
+## Multithreading
152
+
153
+Currently a single thread is used for finding matches. Using multiple threads
154
+would likely be faster, but require some additional complexity.
155
+
156
+## Case sensitivity
157
+
158
+fzy currently treats all searches as case-insensitive. However, scoring prefers
159
+matches on uppercase letters to help find CamelCase candidates. It might be
160
+desirable to support a case sensitive flag or "smart case" searching.
161
+
162
+## Faster matching
163
+
164
+Matching is currently performed using the standard lib's `strpbrk`, which has a
165
+very simple implementation (at least in glibc).
166
+
167
+Glibc has an extremely clever `strchr` implementation which searches the haystack
168
+string by [word](https://en.wikipedia.org/wiki/Word_(computer_architecture)), a
169
+4 or 8 byte `long int`, instead of by byte. It tests if a word is likely to
170
+contain either the search char or the null terminator using bit twiddling.
171
+
172
+A similar method could probably be written to perform to find a character in a
173
+string case-insensitively.
174
+
175
+* https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strchr.c;h=f73891d439dcd8a08954fad4d4615acac4e0eb85;hb=HEAD
176
+