10
A Parallel Decoder for Lossless Image Compression by Block Matching Luigi Cinque and Sergio De Agostino Comp. Sci. Dept., La Sapienza U., Via Salaria 113, 00198 Roma, Italy email: [email protected]; phone: +390649918529; fax: +39068841964 Abstract. A work-optimal O(log n log M ) time PRAM-EREW algorithm for lossless image compression by block matching was shown in [1], where n is the size of the image and M is the maximum size of the match. The design of a parallel decoder was left as an open problem. By slightly modifying the parallel encoder, in this paper we show how to implement the decoder in O(log n log M ) time with O(n/ log n) processors on the PRAM-EREW. With the realistic assumption that the size of the compressed image is O(n 1/2 ), the parallel decoder requires O(log 2 n) time and O(n/ log n) processors on the mesh of trees. 1. Introduction Storer suggested that fast encoders are possible for two-dimensional lossless com- pression by showing a square greedy matching heuristic for bi-level images, which can be implemented by a simple hashing scheme [5]. The technique is a two-dimensional extension of LZ1 compression [4]. Rectangle matching improves the compression per- formance, but it is slower since it requires O(M log M ) time for a single match, where M is the size of the match [6]. Therefore, the sequential time to compress an image of size n by rectangle matching is Ω(n log M ). Among the different ways of reading an image, we assume the rectangle matching compression heuristic scans an m x m image row by row (raster scan). A 64K table with one position for each possible 4x4 subarray is the only data structure used. All- zero and all-one rectangles are handled differently. The encoding scheme is to precede each item with a flag field indicating whether there is a monochromatic rectangle, a match or raw data. When there is a match, the 4x4 subarray in the current position is hashed to yield a pointer to a copy. This pointer is used for the current rectangle greedy match and then replaced in the hash table by a pointer to the current posi- tion. As mentioned above, the procedure for computing the largest rectangle match with left upper corners in positions (i, j ) and (k,h) takes O(M log M ) time, where M is the size of the match. Obviously, this procedure can be used for computing the largest monochromatic rectangle in a given position (i, j ) as well. If the 4 x 4 subar- ray in position (i, j ) is monochromatic, then we compute the largest monochromatic rectangle in that position. Otherwise, we compute the largest rectangle match in the position provided by the hash table and update the table with the current position. If the subarray is not hashed to a pointer, then it is left uncompressed and added to the hash table with its current position. The positions covered by matches are skipped in the linear scan of the image. Therefore, the sequential time to compress an image of size n by rectangle matching is Ω(n log M ). We want to point out that 2007 Data Compression Conference (DCC'07) 0-7695-2791-4/07 $20.00 © 2007

[IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

  • Upload
    sergio

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

Page 1: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

A Parallel Decoder for

Lossless Image Compression by Block Matching

Luigi Cinque and Sergio De AgostinoComp. Sci. Dept., La Sapienza U., Via Salaria 113, 00198 Roma, Italy

email: [email protected]; phone: +390649918529; fax: +39068841964

Abstract. A work-optimal O(log n log M) time PRAM-EREW algorithm for lossless imagecompression by block matching was shown in [1], where n is the size of the image and M isthe maximum size of the match. The design of a parallel decoder was left as an open problem.By slightly modifying the parallel encoder, in this paper we show how to implement the decoderin O(log n log M) time with O(n/ log n) processors on the PRAM-EREW. With the realistic

assumption that the size of the compressed image is O(n1/2), the parallel decoder requires O(log2 n)time and O(n/ log n) processors on the mesh of trees.

1. Introduction

Storer suggested that fast encoders are possible for two-dimensional lossless com-pression by showing a square greedy matching heuristic for bi-level images, which canbe implemented by a simple hashing scheme [5]. The technique is a two-dimensionalextension of LZ1 compression [4]. Rectangle matching improves the compression per-formance, but it is slower since it requires O(M log M) time for a single match, whereM is the size of the match [6]. Therefore, the sequential time to compress an imageof size n by rectangle matching is Ω(n log M).

Among the different ways of reading an image, we assume the rectangle matchingcompression heuristic scans an m x m′ image row by row (raster scan). A 64K tablewith one position for each possible 4x4 subarray is the only data structure used. All-zero and all-one rectangles are handled differently. The encoding scheme is to precedeeach item with a flag field indicating whether there is a monochromatic rectangle, amatch or raw data. When there is a match, the 4x4 subarray in the current positionis hashed to yield a pointer to a copy. This pointer is used for the current rectanglegreedy match and then replaced in the hash table by a pointer to the current posi-tion. As mentioned above, the procedure for computing the largest rectangle matchwith left upper corners in positions (i, j) and (k, h) takes O(M log M) time, whereM is the size of the match. Obviously, this procedure can be used for computing thelargest monochromatic rectangle in a given position (i, j) as well. If the 4 x 4 subar-ray in position (i, j) is monochromatic, then we compute the largest monochromaticrectangle in that position. Otherwise, we compute the largest rectangle match in theposition provided by the hash table and update the table with the current position.If the subarray is not hashed to a pointer, then it is left uncompressed and addedto the hash table with its current position. The positions covered by matches areskipped in the linear scan of the image. Therefore, the sequential time to compressan image of size n by rectangle matching is Ω(n log M). We want to point out that

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 2: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

besides the proper matches we use to call a match every rectangle of the parsing ofthe image produced by the heuristic.

In [1], we showed a work-optimal parallel algorithm using the rectangle greedymatching technique requiring O(log M log n) time on the PRAM EREW model. Awork-optimal implementation of this algorithm still in O(log n log M) time was shownin [2] on a mesh of trees, which is hybrid network architecture based on arrays andtrees [3]. The design of a parallel decoder was left as an open problem in [1]. Inthis paper, we slightly modify the encoder in order to obtain a parallel decoder.By contrast with the coding case, while the sequential decoder requires O(n) timethe parallel decoder needs O(log n log M) time and O(n/ log n) processors on thePRAM-EREW. With the realistic assumption that the size of the compressed imageis O(n1/2), the parallel decoder requires O(log2 n) time and O(n/ log n) processors onthe mesh of trees.

In section 2, we explain the parallel compression algorithm and provide the paralleldecoder. The implementations on the mesh of trees are given in section 3. Conclusionsand future work are given in section 4.

2. The PRAM EREW Encoder and Decoder

To achieve sublinear time we partition an m x m′ image I in x x y rectangularareas, where x and y are Θ(log1/2 n), and n is the size of the image. In parallel for eacharea, one processor applies the sequential parsing algorithm so that in O(log n log M)time each area will be parsed in rectangles, some of which are monochromatic. We donot allow overlapping of the monochromatic rectangles when we apply the sequentialalgorithm to each area. Before encoding we wish to compute larger monochromaticrectangles.

2.1. Computing the Monochromatic Rectangles

In the implementation of the algorithm, we use four m x m′ matrices RC, CC,W and L which are determined by the parsing procedure on each area. RC[i, j] andCC[i, j] are equal to zero if I[i, j] is not covered by a monochromatic rectangle, oth-erwise they are equal to the row and column coordinate of the left upper corner of themonochromatic rectangle. W [i, j] and L[i, j] are equal to zero if I[i, j] is not coveredby a monochromatic rectangle, otherwise they are equal to the width and length ofthe monochromatic rectangle. We also use four matrices TRC, TCC, TW and TLto store temporary values needed for the computation of the final parsing, which areinitially set to RC, CC, W and L. We merge monochromatic rectangles adjacent onthe horizontal boundaries and then on the vertical boundaries, at each step. If weindex the areas, it is always a rectangle of an odd-indexed area with respect eitherto the vertical or horizontal order which tries to merge with an adjacent rectangle inthe next area. Generally, this merging operation causes that the rectangles split intotwo or three subrectangles. A rectangle in the odd-indexed area is split in at most

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 3: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

1 in parallel for2 i and j: i = ax, a odd, j = TCC[i; j(<> 0), TRC[i+ 1; j <> 0 and I[i; j = I[i+ 1; j f3 in parallel for4 TRC[i; j k i, j h minfTCC[i+ 1; j + TW [i+ 1; j 1; j + TW [i; j 1g f5 MIN = min fTCC[i+ 1; j + TW [i+ 1; j 1; j + TW [i; j 1g j + 1;6 TW [k;h = MIN ; TL[k;h = TL[k; h + TL[i+ 1; j; g7 in parallel for8 i+ 1 k i+ TL[i+ 1; j and9 j h MIN10 f TCC[k;h = j;TRC[k; h = TRC[i; j;TW [k;h = TW [i; j;TL[k;h = TL[i; j; g g11 in parallel for12 i and j: TCC[i; j 1 + TW [i; j 1 = j; TCC[i; j <> j; TCC[i; j <>= 013 or TCC[i; j = j f14 ompute the smallest k with k j: TCC(i; k) + TW [i; k > k + 1; TCC[i; k + 1 = k + 115 or TCC(i; k) + TW [i; k = k + 116 in parallel for j h k f TCC[i; h = j; TW [i; h = k j + 1; g g17 in parallel for i, j: TRC[i; j = i and TCC[i; j = j18 if TW [i; j TL[i; j W [i; j L[i; j19 in parallel for i k < i+ TL[i; j, j h < j + TW [i; j f20 RC[i; j = TRC[i; j; CC[i; j = TCC[i; j; W [i; j = TW [i; j; L[i; j = TL[i; j; gFigure 1: Merging monochromatic rectangles on the horizontal boundaries.

two subrectangles since we want to preserve the upper left corner. The merging isrealized by updating the temporary matrices. If we obtain a larger rectangle thenwe update the original matrices, otherwise we continue merging by working with thetemporary values to see if we can get a larger rectangle later.

We describe the procedure of merging monochromatic rectangles on the horizontalboundaries in detail in Figure 1. At line 2, we consider in parallel the left lowercorners of monochromatic rectangles of the areas in odd positions which are adjacentto a monochromatic rectangle with the same color in the next area below. Then,at lines 3, 4, 5, 6 we change in parallel the width and the length of such rectangles,where the length is the sum of the lengths of the two adjacent rectangles and thewidth changes if the right corners of the rectangle in the next area below are to theleft of the right corners of the rectangle above. At lines 7, 8, and 9 the values inthe temporary matrices are changed also for the pixels of the rectangles in the next

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 4: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

area below since they merged. As mentioned above, the merging causes a splittingof rectangles into subrectangles and the values in the temporary matrices must beredefined also for the pixels covered by the other rectangles produced by the merging.This is done from line 11 to 16. In parallel we consider all the pixels for which,according to the temporary values, either they are not on the leftmost column ofa rectangle and the adjacent pixels in front of them result to be on the rightmostcolumn of a rectangle (line 12) or they are on the leftmost column (line 13). Foreach of them, we compute the closest pixel to the right for which, according to thetemporary values, either it is not on the rightmost column of a rectangle and thenext pixel results to be on the leftmost column of a rectangle (line 14) or it is onthe rightmost column of a rectangle (line 15). Observe that the property at line 14is verified if and only if the property at line 13 was verified by the other pixel, whilethe property at line 15 is verified if and only if the other pixel verifies the propertyat line 12. This is because the pixel is the closest to the one computed in lines 12 or13 which satisfies the conditions defined at lines 14 or 15. Therefore, these two pixelsmust be on the rightmost and leftmost column of the same monochromatic rectangle.Such monochromatic rectangles are redefined in the temporary matrices at line 16.At this point, for each left upper corner of a monochromatic rectangle (line 17), if weobtained a larger rectangle after the merging (line 18) we can ovewrite the informationon the new rectangle on the original matrices (lines 19-20). A similar procedure canbe designed to merge rectangles on the vertical boundaries. The algorithm alternatesa vertical merge with a horizontal one, doubling the sides of the areas. We want topoint out that after doubling the areas, the rectangles which try to merge are still theones with the upper left corner in one of the original odd-indexed areas of the firststep of the main loop. Such loop takes O(log M) time, where M is the maximum sizeof a monochromatic rectangle. It is easy to see that all the operations for mergingvertically or horizontally can be executed in O(log n) time with O(n/ log n) processors(lines 14-15 by parallel prefix) on the PRAM EREW. Therefore, the matrices arecomputed with optimal parallel work in O(log n log M) time.

2.2. The Parallel Encoder

We derive the sequence of pointers from the image parsing computed above. InO(log n) time with O(n/ log n) processors we can identify every upper left corner of amatch (proper, monochromatic or raw) by assigning a different segment of logarithmiclength on a row to each processor. Each processor detects the upper left corners onits segment. Then, by parallel prefix we obtain a sequence of pointers decodable bythe decompressor paired with the sequential heuristic. However, the decoding of suchsequence seems hard to parallelize. In order to design a parallel decoder, it is moresuitable to produce the sequence of pointers by a raster scan of each of the areasinto which the image was originally partitioned, where the areas are ordered by araster scan themselves. Then, again we can easily derive the sequence of pointers inO(log n) time with O(n/ log n) processors by detecting in each of the areas the upperleft corners of a match and producing the sequence of pointers by parallel prefix.

As mentioned in the introduction, the encoding scheme for the pointers uses a flag

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 5: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

field indicating whether there is a monochromatic rectangle (0 for the white ones and10 for the black ones), a proper match (110) or raw data (111). For the feasibilityof the parallel decoder, we want to indicate the end of the encoding of matcheswith the upper left corner in a specific logarithmic area. Therefore, we change theencoding scheme by associating the flag field 1110 to the raw match so that we canindicate with 1111 the end of the sequence of pointers corresponding to a given area.Moreover, since some areas could be entirely covered by a monochromatic match 1111is followed by the index associated with the next area by the raster scan. The pointerof a monochromatic match has fields for the width and the length while the pointerof a proper match also has fields for the coordinates of the left upper corner of thecopy in the window. In order to save bits, the value stored in any of these fieldsis the binary value of the field plus 1 (so, we employ the zero value). Obviously,there are four different pointer sizes in the coding for the four different cases (white,black, raw or proper match). This coding technique is more redundant than otherspreviously designed, but its compression effectiveness is still better than the one ofthe square greedy matching technique (and the rectangle greedy matching techniqueis more suitable for a parallel implementation, anyway).

2.3. The Parallel Decoder

The parallel decoder has three phases. Observe that on each position of the binarysequence encoding the image we read a subsequence of bits that is either 1111 (recallthat the k bits following 1111 provide the area index, where k is the number of bitsused to encode it) or can be interpreted as a flag field of a pointer. Then, in thefirst phase we reduce the binary sequence to a doubly linked structure and apply thewell-known Euler tour technique in order to identify for each area the correspondingpointers. The reduction works as follows: link each position p of the sequence to theposition next to the end of the subsequence starting in position p that either can beinterpreted as a pointer or is equal to 1111 followed by k bits. For those suffixes ofthe sequence which can be interpreted as pointers, their first positions are linked to aspecial node denoting the end of the coding. For those suffixes of the sequence whichcannot be interpreted as pointers, their first positions are not linked to anything. Thelinked structure is a forest with one tree rooted into the special node denoting theend of the coding and the other trees rooted into the first position of a suffix of theencoding sequence not interpretable as a pointer. The first position of the binarysequence is a leaf of the tree rooted into the special node. The sequence of pointersencoding the image is given by the path from the first position to the root. In orderto compute such path we need the children to be doubly linked to the parent. Then,we need to reserve space for each node to store the links to the children. Each nodehas at most five children since there are only four different pointer sizes. So, for eachposition p of the binary sequence we set aside five locations [p, 1] · · · [p, 5], initiallyset to zero. When a link is added from position p′ to p, depending on whether thesubsequence starting in position p′ is 1111 or can be interpreted as a pointer to a raw,white, black or proper match, the value p′ is overwritten on location [p, 1], [p, 2], [p, 3][p, 4] or [p, 5], respectively. Then, by means of the well-known Euler technique we can

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 6: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

linearize the linked structure and apply list ranking to obtain the path from the firstposition of the sequence to the root of its tree. It is well-known that all this can becomputed in O(log n) time with O(n/ log n) processors on the PRAM EREW, sincethe row image size is greater than the size of the sequence. Then, still in O(log n) timewith O(n/ log n) processors we can identify the positions on the path correspondingto 1111.

In the second phase of the parallel decoder a different processor decodes the sequenceof pointers corresponding to a different area. As far as the pointers to monochromaticmatches, each processor decompresses the portion of the match inside the correspond-ing area if it contains the left upper corner. Therefore, after the second phase an areamight not be entirely decompressed (or not decompressed at all). In order to finishdecompressing, also the decoder works with the matrices RC, CC, W and L to rep-resent the monochromatic rectangles of the parsing. The matrices are initially nulland after the first phase only the components corresponding to the upper left decom-pressed portions of the monochromatic rectangles store nonnull values. Obviously,such phase requires O(log n) time and O(n/ log n) processors on the PRAM EREW

The third phase completes the decoding. We denote with Ai,j for 1 ≤ i ≤ ⌈m/x⌉ and1 ≤ j ≤ ⌈m′/y⌉ the areas into which the image was partitioned. At the first step ofthe third phase, one processor for each area A2i−1,j updates the components of thematrices RC, CC, W and L corresponding to A2i,j, if there is any monochromaticmatch with upper left corner in A2i−1,j which covers pixels of A2i,j . The same is donehorizontally for Ai,2j−1 and Ai,2j. At the k-th step, one processor for each of the areasA(i−1)2k−1+1,j, A(i−1)2k−1+2,j, · · · Ai2k−1,j, with i odd, updates the components of thematrices RC, CC, W and L corresponding to Ai2k−1+1,j , Ai2k−1+2,j, · · · A(i+1)2k−1,j, re-spectively. The same is done horizontally for Ai,(j−1)2k−1+1, Ai,(j−1)2k−1+2, · · · Ai,j2k−1,with j odd, and Ai,j2k−1+1, Ai,j2k−1+2, · · · Ai,(j+1)2k−1 . After O(log M) steps the imageis entirely decompressed. Each step takes O(log n) time and O(n/ log n) processorssince there is one processor for each area of logarithmic size. Therefore, the decoder isrealized with O(log n log M) time and O(n/ log n) processors on the PRAM EREW.

3. The Mesh of Trees Implementations

A mesh of trees is a network of 3N2 − 2N processors with N a power of two,consisting of an N x N grid where a complete binary tree of processors is formed oneach row and each column as shown in Figure 2. First, we wish to give a detaileddescription of the mesh of trees compression algorithm presented in [2]. Then, weprovide the implementation of the parallel decoder.

Let max be equal to max m, m′. We assume m and m′ have the same order ofmagnitude, as in practice with the height and the width of an image. Let N be thesmallest power of two greater than ⌈max/ log1/2 n⌉, where n is the size of an m x m′

image. Then, the number of processors of the mesh of trees is O(n/ log n) and we canstore the logarithmic rectangular areas into which the parallel algorithm partitions

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 7: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

Figure 2: The construction of a mesh of trees with a 4x4 processor grid.

the image into the N x N grid. Starting from the upper left corner of the grid,a different processor stores a different rectangular area and applies the sequentialcompression heuristic to such area. The remaining processors are inactive. From nowon, we will refer only to the active processors.

3.1. Computing the Monochromatic Rectangles

After the compression heuristic has been executed on each area, each processorstores the values of matrices RC, CC, W , L and TRC, TCC, TW , TL correspondingto the pixels of its rectangular area. To implement the first ten lines of the procedurein Figure 1 each processor storing an area in odd position (or a portion of it after itdoubles) detects on the last row the left lower corners of monochromatic rectangleswhich are adjacent to monochromatic rectangles with the same color in the next areabelow and receives from the adjacent processor on the grid the length and the width ofsuch rectangles. The width, length and left upper corner of the new monochromaticrectangles after the merging are sent along the trees to all processors storing pixelsof its and the temporary matrix values corresponding to its pixels are updated inlogarithmic time. Then, for the pixels (i, j) for which conditions at line 12 or 13 areverified we need to compute by prefix computation the smallest k ≤ j on row i suchthat conditions at line 14 or 15 are verified. Actually, it is enough that the prefixcomputation is performed only on one row of the newly formed rectangles. Now, eachof these rectangles has either the first row which is a segment of the first row of aneven-indexed area or the last row which is a segment of the last row of an odd-indexedarea. Then, k is computed in logarithmic time for the pixels on the first row of aneven-indexed area and for the pixels on the last row of an odd-indexed area usingthe processors on the corresponding lines of the grid together with the processorswhich are internal nodes of the trees over these lines. Afterwards, the value k issent in logarithmic time along the trees to all processors who need it to update thetemporary matrices as in line 16. The last four lines are trivially executed on thegrid. Therefore, it is easy to see that the PRAM EREW procedure to compute larger

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 8: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

monochromatic rectangles can be implemented on a mesh of trees with the samenumber of processors without slowing it down.

3.2. The Parallel Encoder

The sequence of pointers for each area can be trivially produced on the grid. Ifwe assume the possibility of a parallel output, the sequences can be put togetherby parallel prefix. This can be realized in O(log n) time on a mesh of trees withO(n/ log n) processors.

3.3. The Parallel Decoder

Let B = b1 · · · bℓ be the binary sequence encoding the image. In practice, experi-mental results show that ℓ is O(n1/2) on realistic image data. Then, each root of a rowtree of a mesh with O(n/ log n) processors could store a block of length O(log1/2 n)of B. However, it is convenient for the decoder to guarantee that a root store a blockcontaining the whole coding of an area. So we assume that the first k roots, forsome appropriate k, store a block of logarithmic length and the other roots remaininactive. Moreover, the last four bits of a block overlap the first four bits of the nextone. From now on, we will refer only to the active roots. In logarithmic time the rootof the i-th row tree sends its block down to the leaf on the i-th column of the grid.Then, in logarithmic time the block is sent to the root of the i-th column tree. So,we have two copies of B stored in the roots of the row and column trees respectively.At this point, two copies of the linked structure described in subsection 2.3 must berealized and stored as well in the roots of the row and column trees. Each root linksthe positions up to the fifth to the last one of its block. So, in O(log n) time twocopies of such linked structure are stored in the roots of the row and column trees,respectively. As we know, such linked structure is a forest and we want to computethe path from the leaf representing the first position of B to the root representing theend of B. On the mesh of trees we can apply the pointer jumping technique, which isexplained in Figure 3 on the PRAM CREW model. With the mesh of trees, each rootof a column tree sends each link of a marked position of its block down to the leaf onthe row corresponding to the block which covers the position equal to the link value.Then, if such position is not marked it will be marked in logarithmic time on the copystored in the roots of the row trees and, consequently, on the copy stored in the rootsof the column trees. Afterwards, each root of a row tree sends the links of its blockdown to each leaf. So each node of the k-th row of the grid stores the links of thepositions in the k-th block. Then, each root of a column tree receives for each link itstores the new link to replace it with from the row corresponding to the block whichcovers the position equal to the old link value. The new linked structure obtained inthe roots of the column trees is finally sent to the roots of the row trees. All this iscomputed in logarithmic time. Then, completing the pointer jumping procedure takesO(log2 n) time. As we know, the positions which are marked in this linked structuredenote the beginning of a new pointer or the sequence 1111 indicating the end of theencoding of an area followed by the k bits indicating the area index. Then, the rootof the i-th row tree detects the rightmost sequence 1111 of its block which is on the

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 9: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

in parallel for each node i of the forest

set link(i) equal to the parent of i

mark leaf ℓ

until the root is marked

in parallel for each node i such that link(i) is not marked

if i is marked then mark link(i)

link(i) = link(link(i))

Figure 3: How to compute the path from a leaf to the root in parallel.

path. Then, it sends to the root of the i + 1-th column tree the block suffix startingwith such sequence. In this way, each root of a column tree stores the encodings ofone or more areas and each encoding starts with 1111 followed by the area index,which provides the row and column coordinates of the area on the grid of the mesh oftrees. Then, each root of a column tree sends the encodings of its areas down to thecorresponding row coordinates. Afterwards, the encodings move to the correspondingcolumn coordinates along the row trees. It can be seen that this allocation of the areaencodings on the grid in the positions corresponding to the locations of the areas inthe image takes logarithmic time. At this point, each processor on the grid completesthe second phase of the decoder described in subsection 2.3. Then, it is easy to seethat the third and last phase of the PRAM decoder is implementable on a mesh oftrees with the same number of processor and no slowdown. In conclusion, the decodertakes O(log2 n) time on a mesh of trees with O(n/ log n) processors.

4. Conclusions

The design of a parallel decoder for lossless compression of bi-level images byblock matching was left as an open problem in [1]. Here, we provide a parallel de-coder by changing the way the pointers are ordered. The PRAM decoder is notwork-optimal. The mesh of trees implementation has the same performance of thePRAM if the size of the largest match has the same order of magnitude of the sizeof the image under the realistic assumption that the order of magnitude of the com-pressed image size is the square root of the original one. On the other hand, a meshof trees implementation of the work-optimal PRAM EREW compression algorithmwas presented in [2] with the same number of processors and without slowing it down.Such implementation has two phases: a first phase where the sequential algorithm isapplied in parallel on different small parts of the image and a second phase wheremonochromatic rectangles parsed by the algorithm expand by merging with portionsof adjacent momochromatic rectangles. The PRAM and the mesh of trees work iden-tically in the first phase. The second phase has two subphases: a first subphase where

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 10: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - A Parallel Decoder for Lossless Image Compression

monochromatic rectangles expand by merging with portions of adjacent momochro-matic rectangles and a second subphase where the information about the portionswhich have not merged is updated. In the second phase, the PRAM and the meshof trees compute exactly the same rectangles. While each processor of the mesh oftrees must keep working with its part of the image, the PRAM is more flexible andeach processor can have reassigned a segment of logarithmic size over a row or overa column depending on whether a vertical or horizontal merging of monochromaticrectangles is computed. The transmission of data among processors in the networkdoes not effect with a logarithmic slowdown the first subphase since the operationsof the main loop require constant time and the logarithmic time required by thePRAM depends only on the fact that we want to use optimal parallel work. Withthe second subphase, a prefix computation operation is executed in the main loop.Prefix computation takes logarithmic time. Since each processor of the PRAM canhave assigned a segment of logarithmic size over a row (or column), it comes naturalthat the PRAM algorithm performs it for every row for which conditions at line 12or 13 of the procedure of Figure 1 are verified (as for every column, similarly). Sinceeach processor of the mesh of trees must keep working on the area it compressedin the first phase, it is a key point that such prefix computation can be performedonly on the first or last row of a newly formed rectangle in order to keep the run-ning time of the mesh of trees implementation O(log n log M) and the parallel workoptimal. As future work, it would be interesting to study implementations of thePRAM compression algorithm on even simpler architectures than meshes of trees,which still have small diameter and large bisection width as pyramids and multigrids.A work-optimal O(log n log M) time realization on these architectures seems morechallenging and some practical assumptions to make it possible might be needed.

References

[1] Cinque L., De Agostino S., Liberati F., A Work-Optimal Parallel Implementationof Lossless Image Compression by Block Matching, Nordic Journal of Computing

10 (2003) 13–20.

[2] De Agostino S., Lossless Image Compression by Block Matching on a Mesh ofTrees, Proceedings IEEE Data Compression Conference,(2006) 443.

[3] Leighton F. T.: Introduction to Parallel Algorithms and Architectures Morgan-Kaufmann (1992).

[4] Lempel A. and Ziv J., A Universal Algorithm for Sequential Data Compression,IEEE Transactions on Information Theory 23 (1977) 337–343.

[5] Storer J. A., Lossless Image Compression using Generalized LZ1-Type Methods,Proceedings IEEE Data Compression Conference, (1996) 290-299.

[6] Storer J. A. and Helfgott H., Lossless Image Compression by Block Matching,The Computer Journal 40 (1997) 137-145.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007