Image Deduplication

图像去重算法的学习整理。

图像去重和相似性查找可以当做一件事情，当找到相似图像之后执行删除操作就是去重，找到相似图像之后返回给前端，那么就是相似性查找。为了方便起见，下文都是写成图像去重。

图像去重算法分成两大类别：基于 hash 的算法去重；基于 CNN 的算法去重。

If the hashes are different, then the data is different. And if the hashes are the same, then the data is likely the same. (Since there is a possibility of a hash collision, having the same hash values does not guarantee the same data.) In contrast, perceptual hashes can be compared – giving you a sense of similarity between the two data sets.

hash 冲突是可能发生的

基于 hash 的算法有三种，下面逐个展开一下。

(1) Average Hash algorithm 的流程

Reduce size. The fastest way to remove high frequencies and detail is to shrink the image. In this case, shrink it to 8x8 so that there are 64 total pixels. Don’t bother keeping the aspect ratio, just crush it down to fit an 8x8 square. This way, the hash will match any variation of the image, regardless of scale or aspect ratio.

（这里resize 是直接 crush，不会考虑原来的 ratio，所以图像的大小得到的 AHash 是基本不变的）

Reduce color. The tiny 8x8 picture is converted to a grayscale. This changes the hash from 64 pixels (64 red, 64 green, and 64 blue) to 64 total colors.

Average the colors. Compute the mean value of the 64 colors.

Compute the bits. This is the fun part. Each bit is simply set based on whether the color value is above or below the mean.

Construct the hash. Set the 64 bits into a 64-bit integer. The order does not matter, just as long as you are consistent. (I set the bits from left to right, top to bottom using big-endian.)

特点

The resulting hash won’t change if the image is scaled or the aspect ratio changes. Increasing or decreasing the brightness or contrast, or even altering the colors won’t dramatically change the hash value.

Ahash 的特点，如果改变了颜色，是识别不出来的。

And best of all: this is FAST!

(2) phash 的算法流程

Reduce size. Like Average Hash, pHash starts with a small image. However, the image is larger than 8x8; 32x32 is a good size. This is really done to simplify the DCT computation and not because it is needed to reduce the high frequencies. （相比于 ahash，这个尺寸稍微大一些）

Reduce color. The image is reduced to a grayscale just to further simplify the number of computations.

Compute the DCT. The DCT separates the image into a collection of frequencies and scalars. While JPEG uses an 8x8 DCT, this algorithm uses a 32x32 DCT. （DCT 是频率和相应的向量）

Reduce the DCT. While the DCT is 32x32, just keep the top-left 8x8. Those represent the lowest frequencies in the picture. （注意这里取得的是左上角的子图）

Further reduce the DCT. This is the magic step. Set the 64 hash bits to 0 or 1 depending on whether each of the 64 DCT values is above or below the average value. The result doesn’t tell us the actual low frequencies; it just tells us the very-rough relative scale of the frequencies to the mean. The result will not vary as long as the overall structure of the image remains the same; this can survive gamma and color histogram adjustments without a problem. （从这里知道 DCT 的计算保留的是 overall structure 的信息，计算得到的值是相对于 mean 的值， relative scale ）

Construct the hash. Set the 64 bits into a 64-bit integer. The order does not matter, just as long as you are consistent. To see what this fingerprint looks like, simply set the values (this uses +255 and -255 based on whether the bits are 1 or 0) and convert from the 32x32 DCT (with zeros for the high frequencies) back into the 32x32 image:

特点

For example, if I have a small thumbnail of an image and I know that the big one exists somewhere in my collection, then Average Hash will find it very quickly.

如果是相同的图，使用大图查找小图，那么使用 ahash

if there are modifications – like text was added or a head was spliced into place, then Average Hash probably won’t do the job. While pHash is slower, it is very tolerant of minor modifications (minor being less than 25% of the picture).

如果修改少于 1/4，那么查找（结构）相似，那么使用phash

(3) dhash 算法

Reduce size. The fastest way to remove high frequencies and detail is to shrink the image. In this case, shrink it to 9x8 so that there are 72 total pixels. (I’ll get to the “why” for the odd 9x8 size in a moment.) By ignoring the size and aspect ratio, this hash will match any similar picture regardless of how it is stretched. （缩小比例）

Reduce color. Convert the image to a grayscale picture. This changes the hash from 72 pixels to a total of 72 colors. (For optimal performance, either reduce color before scaling or perform the scaling and color reduction at the same time.)

Compute the difference. The dHash algorithm works on the difference between adjacent pixels. This identifies the relative gradient direction. In this case, the 9 pixels per row yields 8 differences between adjacent pixels. Eight rows of eight differences becomes 64 bits.

Assign bits. Each bit is simply set based on whether the left pixel is brighter than the right pixel. The order does not matter, just as long as you are consistent. (I use a “1” to indicate that P < P[x+1] and set the bits from left to right, top to bottom using big-endian.)

通过比较 hamming distance，如果值大于 10 那么是 different image，否则是 a variation。

The hash values represent the relative change in brightness intensity. To compare two hashes, just count the number of bits that are different. (This is the Hamming distance.) A value of 0 indicates the same hash and likely a similar picture. A value greater than 10 is likely a different image, and a value between 1 and 10 is potentially a variation.

关于上述三个算法的总结

aHash (also called Average Hash or Mean Hash). This approach crushes the image into a grayscale 8x8 image and sets the 64 bits in the hash based on whether the pixel’s value is greater than the average color for the image. （比较看重color value）

pHash (also called “Perceptive Hash”). This algorithm is similar to aHash but use a discrete cosine transform (DCT) and compares based on frequencies rather than color values.

（phash 是based on frequency 而不是 color value）

As an implementation, dHash is nearly identical to aHash but it performs much better. While aHash focuses on average values and pHash evaluates frequency patterns, dHash tracks gradients.

Dhash 和 Ahash 都具有的性质

As with aHash, the resulting hash won’t change if the image is scaled or the aspect ratio changes. Increasing or decreasing the brightness or contrast, or even altering the colors won’t dramatically change the hash value. Even complex adjustments like gamma corrections and color profiles won’t impact the result. And best of all: this is FAST! Seriously – the slowest part of the algorithm is the size reduction step.

ahash 不会检测到 scale, ratio, brightness, contrast, color 方面的变化，特点就是快。

aHash. This algorithm takes about 3.75 hours to run. In other words, it takes more time to load and scale the image than to run the algorithm. Unfortunately, aHash generates a huge number of false positives. It matched all of the expected images, but also matched about 10x more false-positives. For example, the test picture that should have matched 32 times actually matched over 400 images. Worse: some of the misses had a difference of less than 2 bits. In general, aHash is fast but not very accurate.

pHash. This algorithm definitely performed the best with regards to accuracy. No false positives, no false negatives, and every match has a score of 2 or less. I’m sure that a bigger data set (or alternate needle image) will generate false matches, but the number of misses will likely be substantially less than aHash.

The problem with pHash is the performance. It took over 7 hours to complete. This is because the DCT computation uses lots of operations including cosine and sine. If I pre-compute the DCT constants, then this will drop 1-2 hours from the overall runtime. But applying the DCT coefficients still takes time. pHash is accurate, but not very fast.

dHash. Absolutely amazing… Very few false positives. For example, the image with two known matches ended up matching 6 pictures total (4 false positives). The scores were: 10, 0, 8, 10, 0, and 10. The two zeros were the correct matches; all of the false-positive matches had higher scores. As speed goes, dHash is as fast as aHash. Well, technically it is faster since it doesn’t need to compute the mean color value. The dHash algorithm has all the speed of aHash with very few false-positives.

查找相似的图像。速度上考虑，ahash 是最快的，但是准确性不好；准确性上 phash 很好，但速度比较慢；dhash 效果和phash 更高，并且速度和 ahash 是一样的快。

项目地址：http://123.56.8.10:9988/jeng/imagededup

这个是仿照同名 imagededup 进行搭建的。简单学习一下其中的算法。去重算法分为基于hash 的去重和基于 cnn 的去重。

得到 hash 值有现成的算法，基于 hash 的检索是 BK-Tree

BK Tree或Burkhard Keller Tree是一种数据结构，用于根据编辑距离（Levenshtein距离）概念执行拼写检查。 BK树也用于近似字符串匹配。基于该数据结构，可以实现许多软件中的各种自动校正特征。

正好 hash 数值也是 01 的字符串，其实是可以使用这种方法进行检索的。

基于深度学习 CNN 的算法

比如开源项目 imagededup 中也是先使用 CNN 网络提取特征向量，然后计算 cosine 距离。优化点：使用更好的网络结构去提取embedding、embedding 的压缩检索方面的优化。

CNN 稠密向量的检索？

通过模型求解 embedding 比较好的接口：提供 single image 和 batch image 的接口

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


    def _get_cnn_features_single(self, image_array: np.ndarray) -> np.ndarray:
        """
        Generate CNN encodings for a single image.
        Args:
            image_array: Image typecast to numpy array.
        Returns:
            Encodings for the image in the form of numpy array.
        """
        image_pp = self.preprocess_input(image_array)
        image_pp = np.array(image_pp)[np.newaxis, :] # 这里设置 batch size =1
        return self.model.predict(image_pp)
    
    def _get_cnn_features_batch(self, image_dir: PurePath) -> Dict[str, np.ndarray]:
        """
        Generate CNN encodings for all images in a given directory of images.
        Args:
            image_dir: Path to the image directory.
        Returns:
            A dictionary that contains a mapping of filenames and corresponding numpy array of CNN encodings.
        """
        self.logger.info('Start: Image encoding generation')
        self.data_generator = self.DataGenerator(
            image_dir=image_dir,
            batch_size=self.batch_size,
            target_size=self.target_size,
            basenet_preprocess=self.preprocess_input,
        )

        feat_vec = self.model.predict_generator(
            self.data_generator, len(self.data_generator), verbose=self.verbose
        )
        self.logger.info('End: Image encoding generation')

        filenames = [i.name for i in self.data_generator.valid_image_files]

        self.encoding_map = {j: feat_vec[i, :] for i, j in enumerate(filenames)}
        return self.encoding_map

计算使用 sklearn 中的 cosine 距离，做的优化是使用多进程并行运算。

1
2
3
4
5
6
7
8


def parallelise(function: Callable, data: List, verbose: bool) -> List:
    pool = Pool(processes=cpu_count())
    results = list(
        tqdm.tqdm(pool.imap(function, data, 100), total=len(data), disable=not verbose)
    )
    pool.close()
    pool.join()
    return results

pool 中 map 和 imap 的區別

map 會將可迭代對象轉換成列表（假設不是列表），並且將對象分解成不同的塊。所以佔內容，並且運行得快。

imap 是不會轉換成可迭代的列表，也不會分解成塊。（默認情況下，可以設置大於1 的 chunksize 來緩解）。如果內存中放不下，那麼適合使用這種方式。

结论：整体来说，这个 repo 的代码格式和命名都是很好的，有时间的时候可以细看。

参考资料

http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

文章目录