You have 2 strings s and w, of length n and m respectively, and you want to know if string w is a substring of s.
For this task you develop a cool and intuitive algorithm with time efficiency O(nm) that just shifts the word w and compare to the substring of s in that position, this runs incredibly fast in your test cases, but when the strings get bigger your code starts being too slow and ineffective.
KMP is an algorithm that checks if w is a substring of s in O(n + m).
The KMP algorithm has basically two parts:
- Prefix function
- String Matching
"KMP’s algorithm is: whenever we detect a mismatch (after some matches), we already know some of the characters in the text (since they matched the pattern characters prior to the mismatch). We take advantage of this information to avoid matching the characters that we know will anyway match."
So, in the Prefix function what we want to do is define an array f that stores the length of the maximum matching proper prefix which is also a suffix of the sub-pattern w[0..i].
Example:
w = “AAACAAAAAC”, f[] is [0, 1, 2, 0, 1, 2, 3, 3, 3, 4]. Because:
- "A", is the first letter so f[0] = 0 (this is always true)
- "AA", the maximum proper prefix which is also a suffix is "A" that has length 1, so f[1] = 1
- "AAA", the maximum proper prefix which is also a suffix is "AA" that has length 2, so f[2] = 2
- "AAAC", the maximum proper prefix which is also a suffix is "" that has length 0, so f[3] = 0
- "AAACA", the maximum proper prefix which is also a suffix is "A" that has length 1, so f[4] = 1
- "AAACAA", the maximum proper prefix which is also a suffix is "AA" that has length 2, so f[5] = 2
- "AAACAAA", the maximum proper prefix which is also a suffix is "AAA" that has length 2, so f[6] = 3
- "AAACAAAA", the maximum proper prefix which is also a suffix is "AAA" that has length 2, so f[7] = 3
- "AAACAAAA", the maximum proper prefix which is also a suffix is "AAA" that has length 2, so f[8] = 3
- "AAACAAAAC", the maximum proper prefix which is also a suffix is "AAAC" that has length 2, so f[9] = 4
Then, the String Matching will consist of comparing s with w if a match happens (s[i] == w[j]) then continue comparing, if j == m it's a real match! if s[i] != w[j], then j = f[j], this means that if the current letters are not a match we don't need to start from the beginning of the string again we can instead start from f[j] (think a little about this).
Try to implement it! Doubts? see the code.cpp file to see the implementantion in C++ with comments 😄!