How to identify similar images using hashing

Hi 👋,

In this article I would like to talk about image hashing.

Image hashing algorithms are specialized hashing functions that output the hash of an image based on the image’s properties. Duplicate images output the same hash value and visually identical images output a hash value that is slightly different.

To simplify

hash("white_cat") = "aaaa"
hash("brown_cat") = "aaba"
hash("car") = "xkjwe"

Some use cases for image hashing are:

  • Duplicate Image Detection
  • Anti-Impersonation / Image Stealing
  • Image filtering
  • Reverse image search

Let’s play around with image hashing techniques using Python and the ImageHash library. Install the library with:

pip install imagehash
pip install six

To obtain some sample images I’ve used Pexels and searched for words like “white cat”, “firetruck”.

Here’s the images that I’m using: cat1, cat2, cat3 and firetruck1.

I’m going to import the necessary stuff and add a function that converts the hexadecimal string given by image hash to an integer.

from PIL import Image
import imagehash


def hash_to_int(img_hash: imagehash.ImageHash):
    return int(str(img_hash), 16)

The reason for the hash_to_int function is that is much easier to do computations using integers rather than strings, in the future if we’re going to build a service that makes use of the image hashing and computes hamming distances, we can store the int hashes in an OLAP database such as ClickHouse and use bitHammingDistance to compute the Hamming Distance.

The next snippet of code opens the images, computes the average and color hashes and for every image in the dataset it computes the hamming distance between the average hash summed with the hamming distance of the color hash.

The lower the hamming distance the more similar the images. A hamming distane of 0 means the images are equal.

def main():
    images = [
        Image.open("cat1.jpg"),
        Image.open("cat2.jpg"),
        Image.open("cat3.jpg"),
        Image.open("firetruck1.jpg")
    ]

    average_hashes = [hash_to_int(imagehash.average_hash(image)) for image in images]
    color_hashes = [hash_to_int(imagehash.colorhash(image)) for image in images]

    image_hashes = list(zip(images, average_hashes, color_hashes))

    source = image_hashes[0]

    for image in image_hashes:
        hamming_average_hash = bin(source[1] ^ image[1]).count("1")
        hamming_color_hash = bin(source[2] ^ image[2]).count("1")
        hamming_distance = hamming_average_hash + hamming_color_hash
        print("Hamming Distance between", source[0].filename, "and", image[0].filename, "is", hamming_distance)


if __name__ == '__main__':
    main()

To compute the hamming distance, you’ll need to XOR the two integers and then count the number of 1 bits bin(source[1] ^ image[1]).count("1"). That’s it.

If the run the program with the source variable set to cat1.jpg, source = image_hashes[0], we get the following result:

Hamming Distance between cat1.jpg and cat1.jpg is 0
Hamming Distance between cat1.jpg and cat2.jpg is 36
Hamming Distance between cat1.jpg and cat3.jpg is 39
Hamming Distance between cat1.jpg and firetruck1.jpg is 33

If we look at our dataset the first image cat1 is somewhat visually similar to the image of the firetruck.

If we run the program with the source variable set to cat2.jpg we can see that cat2 is similar to cat3 since both images contain white cats.

Hamming Distance between cat2.jpg and cat1.jpg is 36
Hamming Distance between cat2.jpg and cat2.jpg is 0
Hamming Distance between cat2.jpg and cat3.jpg is 23
Hamming Distance between cat2.jpg and firetruck1.jpg is 47

Conclusion

We used a Python image hashing library to compute the average and color hash of some images and then we determined which images are similar to each other by computing the hamming distance of the hashes.

Thanks for reading and build something fun! 🔨

References

Full Code

"""
pip install imagehash
pip install six
"""
from PIL import Image
import imagehash


def hash_to_int(img_hash: imagehash.ImageHash):
    return int(str(img_hash), 16)


def main():
    images = [
        Image.open("C:\\Users\\nutiu\\PycharmProjects\\imageHash\\cat1.jpg"),
        Image.open("C:\\Users\\nutiu\\PycharmProjects\\imageHash\\cat2.jpg"),
        Image.open("C:\\Users\\nutiu\\PycharmProjects\\imageHash\\cat3.jpg"),
        Image.open("C:\\Users\\nutiu\\PycharmProjects\\imageHash\\firetruck1.jpg")
    ]

    average_hashes = [hash_to_int(imagehash.average_hash(image)) for image in images]
    color_hashes = [hash_to_int(imagehash.colorhash(image)) for image in images]

    image_hashes = list(zip(images, average_hashes, color_hashes))

    source = image_hashes[0]

    for image in image_hashes:
        hamming_average_hash = bin(source[1] ^ image[1]).count("1")
        hamming_color_hash = bin(source[2] ^ image[2]).count("1")
        hamming_distance = hamming_average_hash + hamming_color_hash
        print("Hamming Distance between", source[0].filename, "and", image[0].filename, "is", hamming_distance)


if __name__ == '__main__':
    main()

Object Pool Pattern

Hi 👋

In this article we’ll talk about the Object Pool pattern in Golang.

The Object Pool pattern is a design pattern used in situations when constructing objects is a costly operation, for example building an HTTPClient or DatabaseClient object can take some time.

By having a pool of resources, the resources are requested from the pool when needed and then returned when not needed so they can be reused later.

Programs can benefit from this pattern because once the object is constructed when you need it again, you’ll just grab an instance instead of constructing it again from scratch.

In Golang this pattern is easily implemented with sync.Pool. Given a struct Resource struct, to implement an object pool we’ll need to pass the NewResource function to the pool.

To track how many active instances, we have of the object Resource, we use the counter variable.

Resource

var logger = log.Default()
var counter = 0
 
type Resource struct {
    id string
}
 
func NewResource() *Resource {
    logger.Printf("NewResource called")
    counter += 1
    return &Resource{id: fmt.Sprintf("Resource-%d", counter)}
}
 
func (r *Resource) doWork() {
    logger.Printf("%s doing work", r.id)
}
 

Let’s demo sync.Pool!

Demo 1️⃣

In the first demo, we get the resource from the pool, do some work and then put it back. By doing this one step at the time in the end we’ll end with just one Resource instance.

func demo1() {
	println("demo1")
	theResourcePool := sync.Pool{New: func() any {
		return NewResource()
	}}

	for i := 0; i < 10; i++ {
		item := theResourcePool.Get().(*Resource)
		item.doWork()
		theResourcePool.Put(item)
	}

	println("done", counter)
}

Output

demo1
2022/08/17 22:38:59 NewResource called
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
2022/08/17 22:38:59 Resource-1 doing work
done 1

Resource-1 is the only instance that does work.

Demo 2️⃣

In demo2 we spawn 10 goroutines, that use the pool. Since all goroutines start roughly at the same time and require a resource to doWork, in the end the pool will have 10 Resource instances.

func demo2() {
	println("demo2")
	wg := sync.WaitGroup{}
	theResourcePool := sync.Pool{New: func() any {
		return NewResource()
	}}

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			item := theResourcePool.Get().(*Resource)
			item.doWork()
			theResourcePool.Put(item)
		}()

	}
	wg.Wait()

	println("done", counter)
}

Output

demo2
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 Resource-3 doing work
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 Resource-4 doing work
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 Resource-5 doing work
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 Resource-6 doing work
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 Resource-7 doing work
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 Resource-8 doing work
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 NewResource called
2022/08/17 22:41:12 Resource-1 doing work
2022/08/17 22:41:12 Resource-2 doing work
2022/08/17 22:41:12 Resource-9 doing work
2022/08/17 22:41:12 Resource-10 doing work
done 10

Demo 3️⃣

In demo3 doing the same thing we did in demo2 with some random sleeps in between, some goroutines are faster and others are slower. The faster goroutines will also return the resource faster to the pool and slower goroutines which start at a later time will reuse the resource instead of creating a new one.

func demo3() {
	println("demo2")
	wg := sync.WaitGroup{}
	theResourcePool := sync.Pool{New: func() any {
		return NewResource()
	}}

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			time.Sleep(time.Duration(rand.Intn(900)+100) * time.Millisecond)
			item := theResourcePool.Get().(*Resource)
			item.doWork()
			time.Sleep(time.Duration(rand.Intn(100)+100) * time.Millisecond)
			theResourcePool.Put(item)
		}()

	}
	wg.Wait()

	println("done", counter)
}

Output

demo2
2022/08/17 22:42:35 NewResource called
2022/08/17 22:42:35 Resource-1 doing work
2022/08/17 22:42:35 NewResource called
2022/08/17 22:42:35 Resource-2 doing work
2022/08/17 22:42:35 NewResource called
2022/08/17 22:42:35 Resource-3 doing work
2022/08/17 22:42:36 Resource-1 doing work
2022/08/17 22:42:36 Resource-2 doing work
2022/08/17 22:42:36 Resource-3 doing work
2022/08/17 22:42:36 Resource-1 doing work
2022/08/17 22:42:36 NewResource called
2022/08/17 22:42:36 Resource-4 doing work
2022/08/17 22:42:36 NewResource called
2022/08/17 22:42:36 Resource-5 doing work
2022/08/17 22:42:36 Resource-2 doing work
done 5

Only 5 Resource instances have been created at this time.

Conclusion

The object pool pattern is a great pattern when you need to reuse an instance of an object. Constructing the object every time can be slow.

In Go we have sync.pool which implements the Object Pool pattern for us, we just need to give it a New function that returns a pointer.

Thanks for reading! 📚

References

Full Code

package main

import (
	"fmt"
	"log"
	"math/rand"
	"sync"
	"time"
)

var logger = log.Default()
var counter = 0

type Resource struct {
	id string
}

func NewResource() *Resource {
	logger.Printf("NewResource called")
	counter += 1
	return &Resource{id: fmt.Sprintf("Resource-%d", counter)}
}

func (r *Resource) doWork() {
	logger.Printf("%s doing work", r.id)
}

func demo1() {
	println("demo1")
	theResourcePool := sync.Pool{New: func() any {
		return NewResource()
	}}

	for i := 0; i < 10; i++ {
		item := theResourcePool.Get().(*Resource)
		item.doWork()
		theResourcePool.Put(item)
	}

	println("done", counter)
}

func demo2() {
	println("demo2")
	wg := sync.WaitGroup{}
	theResourcePool := sync.Pool{New: func() any {
		return NewResource()
	}}

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			item := theResourcePool.Get().(*Resource)
			item.doWork()
			theResourcePool.Put(item)
		}()

	}
	wg.Wait()

	println("done", counter)
}

func demo3() {
	println("demo2")
	wg := sync.WaitGroup{}
	theResourcePool := sync.Pool{New: func() any {
		return NewResource()
	}}

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			time.Sleep(time.Duration(rand.Intn(900)+100) * time.Millisecond)
			item := theResourcePool.Get().(*Resource)
			item.doWork()
			time.Sleep(time.Duration(rand.Intn(100)+100) * time.Millisecond)
			theResourcePool.Put(item)
		}()

	}
	wg.Wait()

	println("done", counter)
}

func main() {
	demo1()
	//demo2()
	//demo3()
}

Testing Tips: Avoid sleep in tests

Hi 👋,

In this article I wanna show a testing tip that I’ve recently learned myself by reading Software Engineering at Google: Lessons Learned from Programming Over Time. The technique improved the way I write unit tests.

When I’m writing bigger unit tests, I have execute something in the background, like for example publishing a message to a message broker, wait for the message to be published and then consume it to test that what I published is correct.

When waiting for the message to be published or any other operation that required waiting in tests I used to call a sleep function, for a second or two, this is decent for few tests but if your tests grow then this approach does not scale well. Imagine if you’re having 50 tests and each test sleeps for one second, it would take at least 50 seconds to run the test suite, which is a lot of wasted time.

The better approach is to use a timeout and polling, you can poll at every millisecond to see if your test has done what you wanted to do instead of sleeping, this will improve the tests and reduce the execution time by a lot!

Let’s demonstrate this will a small example using the Golang programming language, I’m not going to use any external dependencies to demonstrate this technique but you can apply it everywhere you’re calling something that blocks or if you need to wait for something.

What we’re going to test is a simple struct with a method that blocks and modifies a field.

import (
	"math/rand"
	"time"
)

type SystemUnderTest struct {
	Result string
}

func (s *SystemUnderTest) SetResult() {
	go func() {
		time.Sleep(time.Duration(rand.Intn(3000)) * time.Millisecond)
		s.Result = "the_result"
	}()
}

func (s *SystemUnderTest) GetData() string {
	time.Sleep(time.Duration(rand.Intn(3000)) * time.Millisecond)
	return "the_data"
}

This is the not ideal way of testing it:

// A not very ideal way to test SetResult
func Test_SystemUnderTest_SetResult_NotIdeal(t *testing.T) {
	sut := SystemUnderTest{}
	sut.SetResult()

	time.Sleep(4 * time.Second)

	if sut.Result != "the_result" {
		t.Fatalf("Result not equal, want %s got %s", "the_result", sut.Result)
	}
}

SetResults takes between 0 to 3 seconds to run, since we’re waiting for the result we’re sleeping for 4 seconds.

=== RUN   Test_SystemUnderTest_SetResult_NotIdeal
--- PASS: Test_SystemUnderTest_SetResult_NotIdeal (4.00s)
PASS

A better way is to write a simple loop and poll for the result:

// A better way of testing the code
func Test_SystemUnderTest_SetResult(t *testing.T) {
	sut := SystemUnderTest{}
	sut.SetResult()

	passedMilliseconds := 0
	for {
		if passedMilliseconds > 4000 {
			t.Fatalf("timeout reached")
		}
		passedMilliseconds += 1
		time.Sleep(1 * time.Millisecond)
		if sut.Result != "" {
			break
		}
	}
	if sut.Result != "the_result" {
		t.Fatalf("Result not equal, want %s got %s", "the_result", sut.Result)
	}
}

Writing a loop and polling for the result will make the test more complex but it will execute faster. In this case the benefits outweigh the downsides.

=== RUN   Test_SystemUnderTest_SetResult
--- PASS: Test_SystemUnderTest_SetResult (2.08s)
PASS

If the language permits we can also use channels, let’s change the following function that returns a result after a random amount of time and test it.

func Test_SystemUnderTest_GetData(t *testing.T) {
	sut := SystemUnderTest{}

	timeoutTicker := time.NewTicker(5 * time.Second)
	result := make(chan string)

	// Get result when ready
	go func() {
		result <- sut.GetData()
	}()

	select {
	case <-timeoutTicker.C:
		t.Fatal("timeout reached")
	case actual := <-result:
		if actual != "the_data" {
			t.Fatalf("Data not equal, want: %s, got %s", "the_data", actual)
		}
	}
}

We avoided writing a loop with the use of a ticker and select.

In another case you may need to test HTTP calls on the local machine or any other library. Look for timeout options.

Go’s HTTP library let’s you specify a custom timeout for every call you make:

	client := http.Client{
		Timeout: 50 * time.Millisecond,
	}
	response, err := client.Get("http://localhost:9999/metrics")
	...

In Conclusion

Avoid the use of sleep in tests, try polling for the result instead or check if the blocking functions have parameters or can be configured to stop the execution after a timeout period.

Thanks for reading and I hope you’ve enjoyed this article! 🍻

Go Pattern: Sorting a slice on multiple keys

Hi 👋

In this article I want to highlight a simple pattern for sorting a slice in Go on multiple keys.

Given the following structure, let’s say we want to sort it in ascending order after Version, Generation and Time.

type TheStruct struct {
	Generation int
	Time       int
	Version    int
}

The way we sort slices in Go is by using the sort interface or one of the sort.Slice functions. To sort the slice after the above criteria we’ll call slice.Sort with the following function.

	sort.Slice(structs, func(i, j int) bool {
		iv, jv := structs[i], structs[j]
		switch {
		case iv.Version != jv.Version:
			return iv.Version < jv.Version
		case iv.Generation != jv.Generation:
			return iv.Generation < jv.Generation
		default:
			return iv.Time < jv.Time
		}
	})

The slice will be sorted after the following fields: Version, Generation and Time. The trick is the switch statement and the case expression case iv.Version != jv.Version followed by the statement return iv.Version < jv.Version.

You can use this pattern whenever you want to sort slices over multiple fields in Go.

Thanks for reading! 🍻

Source Code

package main

import (
	"fmt"
	"sort"
)

type TheStruct struct {
	Generation int
	Time       int
	Version    int
}

func main() {
	var structs = []TheStruct{
		{
			Generation: 1,
			Time:       150,
			Version:    0,
		},
		{
			Generation: 1,
			Time:       200,
			Version:    0,
		},
		{
			Generation: 1,
			Time:       200,
			Version:    2,
		},
		{
			Generation: 1,
			Time:       500,
			Version:    0,
		},
		{
			Generation: 1,
			Time:       100,
			Version:    0,
		},
		{
			Generation: 1,
			Time:       400,
			Version:    0,
		},
		{
			Generation: 2,
			Time:       400,
			Version:    0,
		},
		{
			Generation: 2,
			Time:       100,
			Version:    2,
		},
		{
			Generation: 1,
			Time:       300,
			Version:    0,
		},
	}

	fmt.Printf("%v\n", structs)

	sort.Slice(structs, func(i, j int) bool {
		iv, jv := structs[i], structs[j]
		switch {
		case iv.Version != jv.Version:
			return iv.Version < jv.Version
		case iv.Generation != jv.Generation:
			return iv.Generation < jv.Generation
		default:
			return iv.Time < jv.Time
		}
	})
	fmt.Printf("%v\n", structs)

}

Output

[{1 150 0} {1 200 0} {1 200 2} {1 500 0} {1 100 0} {1 400 0} {2 400 0} {2 100 2} {1 300 0}]
[{1 100 0} {1 150 0} {1 200 0} {1 300 0} {1 400 0} {1 500 0} {2 400 0} {1 200 2} {2 100 2}]

Also, special thanks to RP.💖