Index Compression - Tennessee Tech University索引压缩-田纳西技术大学.ppt
《Index Compression - Tennessee Tech University索引压缩-田纳西技术大学.ppt》由会员分享,可在线阅读,更多相关《Index Compression - Tennessee Tech University索引压缩-田纳西技术大学.ppt(52页珍藏版)》请在三一文库上搜索。
1、Index Compression,Ferrol Aderholdt,Motivation,Uncompressed indexes are large It might be useful for some modern devices to support information retrieval techniques that would not be able to do with uncompressed indexes,Motivation (cont.),Disk I/O is slow,Types of Compression,Lossy Compression that i
2、nvolves the removal of data. Loseless Compression that involves no removal of data.,Overview,A lossy compression scheme Static Index Pruning Loseless compression Elias Codes n-s encoding Golomb encoding Variable Byte Encoding (vByte) Fixed Binary Codewords CPSS-Tree,Static Index Pruning,Goal is to r
3、educe the size of the index without reducing the precision such that a human cant tell the difference between a pruned index and non-pruned index Focuses on the top k or top results Assumes there is a scoring function Assumes the function is based off of some table A such that A(t,d) 0 if t is withi
4、n d and A(t,d) = 0 otherwise,Static Index Pruning (cont.),Two approaches Defined as Uniform pruning. The removal of “all posting entries whose corresponding table values are bounded above by some fixed cutoff threshold” Could have a terms entire posting list pruned Defined as Term based pruning An a
5、pproach that attempts to guarantee that every term will have at least some entries remaining in the index,Static Index Pruning (cont.),Scoring functions are fuzzy Only need to find some scoring function S such that S is within a factor of epsilon of S Carmel et al proved this mathematically for both
6、 uniform and term-based methods,Static Index Pruning (cont.),Static Index Pruning results,Found that the idealized top k pruning algorithm did not work very well The smallest value in the posting list was almost always above their threshold so little pruning was done Modified the algorithm to apply
7、a shift Subtracted the smallest value from all positive scores with the list Greatly increased the pruning,Static Index Pruning results (cont.),Static Index Pruning results (cont.),Static Index Pruning results (cont.),Overview,Loseless Compression,Elias Codes,Non-parameterized bitwise method of codi
8、ng integers Gamma Codes Represent a positive integer k with stored as a unary code. This is followed by the binary representation of the number without the most significant bit Not efficient for numbers larger than 15,Elias Codes (cont.),Delta Codes Represent a positive integer k with stored as a ga
9、mma code. This is followed by the binary representation of the number without the most significant bit Not efficient for small values,n-s coding,Parameterized, bitwise encoding Uses a block of n bits followed by s stop bits. Also contains a parameter b which refers to the base of the number. Meaning
10、, the numbers represented in the blocks of n size cannot be greater than or equal to b.,n-s coding example,Let n=3, s=2, and the base be 6. Valid data blocks are 000, 001, 010, 011, 100, and 101. 101 100 001 11 would have the value of 5416,n-s coding (cont.),2 used n-s codes with prefix omission and
11、 run-length encoding Ex.,n-s coding (cont.),Run-length encoding is the process of replacing non-initial elements of a sequence with differences between adjacent elements. E.g.,n-s coding results,Golomb coding,Better compression and faster retrieval than Elias codes Is parameterized This is usually s
12、tored separate using some other compression scheme,vByte coding,A very simple bytewise compression scheme Uses 7 bits to code the data portion and the most significant bit is reserved as a flag bit.,Scholer et. al.,Defined an inverted list to be the following: Where the list is Example inverted list
13、 for term “Matthew”: Uses different coding schemes per part E.g. Golomb for freq, Gamma for doc, and vByte for offset,Scholer et al. (cont.),One optimization is to require encoding to be byte aligned so that decompression can be faster Another optimization when referring to Boolean or ranked queries
14、 is to ignore the offsets and only take into account flag bits within the offset. Referred to as scanning,Scholer et al. (cont.),Third optimization is called signature blocks. An eight bit block that stores the flag bits of up to eight blocks that follow. For example: 11100101 Represents 5 integers
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- Index Compression Tennessee Tech University索引压缩-田纳西技术大学 University 索引 压缩 田纳西 技术 大学
链接地址:https://www.31doc.com/p-2433211.html