第09章文本处理TextProcessing.ppt
《第09章文本处理TextProcessing.ppt》由会员分享,可在线阅读,更多相关《第09章文本处理TextProcessing.ppt(28页珍藏版)》请在三一文库上搜索。
1、4/6/2019 1:42 PM,Text Processing,1,Chapter 9: Text Processing,4/6/2019 1:42 PM,Text Processing,2,Outline and Reading,Strings and Pattern Matching (9.1) Tries (9.2) Text Compression (9.3) Optional: Text Similarity (9.4). No Slides.,4/6/2019 1:42 PM,Text Processing,3,Texts & Pattern Matching,4/6/2019
2、1:42 PM,Text Processing,4,Strings,A string is a sequence of characters Examples of strings: Java program HTML document DNA sequence Digitized image An alphabet S is the set of possible characters for a family of strings Example of alphabets: ASCII Unicode 0, 1 A, C, G, T,Let P be a string of size m
3、A substring Pi j of P is the subsequence of P consisting of the characters with ranks between i and j A prefix of P is a substring of the type P0 i A suffix of P is a substring of the type Pi m - 1 Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring o
4、f T equal to P Applications: Text editors Search engines Biological research,4/6/2019 1:42 PM,Text Processing,5,Brute-Force Algorithm,The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either a match is found, or all pl
5、acements of the pattern have been tried Brute-force pattern matching runs in time O(nm) Example of worst case: T = aaa ah P = aaah may occur in images and DNA sequences unlikely in English text,Algorithm BruteForceMatch(T, P) Input text T of size n and pattern P of size m Output starting index of a
6、substring of T equal to P or -1 if no such substring exists for i 0 to n - m test shift i of the pattern j 0 while j m Ti + j = Pj j j + 1 if j = m return i match at i else break while loop mismatch return -1 no match anywhere,4/6/2019 1:42 PM,Text Processing,6,Boyer-Moore Heuristics,The Boyer-Moore
7、s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at Ti = c If P contains c, shift P to align the last occurrence of c in P with Ti Else, shift P to align P0 with Ti + 1
8、 Example,4/6/2019 1:42 PM,Text Processing,7,The Boyer-Moore Algorithm,Algorithm BoyerMooreMatch(T, P, S) L lastOccurenceFunction(P, S ) i m - 1 j m - 1 repeat if Ti = Pj if j = 0 return i match at i else i i - 1 j j - 1 else character-jump l LTi i i + m min(j, 1 + l) j m - 1 until i n - 1 return -1
9、no match ,4/6/2019 1:42 PM,Text Processing,8,Example,4/6/2019 1:42 PM,Text Processing,9,Analysis,Boyer-Moores algorithm runs in time O(nm + s) Example of worst case: T = aaa a P = baaa The worst case may occur in images and DNA sequences but is unlikely in English text Boyer-Moores algorithm is sign
10、ificantly faster than the brute-force algorithm on English text,4/6/2019 1:42 PM,Text Processing,10,The KMP Algorithm - Motivation,Knuth-Morris-Pratts algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm. When a mismatc
11、h occurs, what is the most we can shift the pattern so as to avoid redundant comparisons? Answer: the largest prefix of P0j that is a suffix of P1j,x,j,.,.,a,b,a,a,b,.,.,.,.,.,a,b,a,a,b,a,a,b,a,a,b,a,No need to repeat these comparisons,Resume comparing here,4/6/2019 1:42 PM,Text Processing,11,KMP Fa
12、ilure Function,Knuth-Morris-Pratts algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself The failure function F(j) is defined as the size of the largest prefix of P0j that is also a suffix of P1j Knuth-Morris-Pratts algorithm modifies the brute-force a
13、lgorithm so that if a mismatch occurs at Pj Ti we set j F(j - 1),4/6/2019 1:42 PM,Text Processing,12,The KMP Algorithm,The failure function can be represented by an array and can be computed in O(m) time At each iteration of the while-loop, either i increases by one, or the shift amount i - j increa
14、ses by at least one (observe that F(j - 1) j) Hence, there are no more than 2n iterations of the while-loop Thus, KMPs algorithm runs in optimal time O(m + n),Algorithm KMPMatch(T, P) F failureFunction(P) i 0 j 0 while i 0 j Fj - 1 else i i + 1 return -1 no match ,4/6/2019 1:42 PM,Text Processing,13
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 09 文本 处理 TextProcessing
链接地址:https://www.31doc.com/p-2546863.html