Parallel Processing in Python with concurrent.futures

Hello 👋,

In this short article I want to talk about parallel processing in Python.

Introduction

Sometimes you will need to process certain things in parallel. If you’re using Python you may know about the global interpreter lock abbreviated GIL for short. The GIL is a lock that allows only a single thread to control the Python Interpreter at a time (per process), that means that your multi-threaded Python program will have only a single thread executing code at the same time.

To overcome the GIL problem, I highly recommend the concurrent.futures module along with the ProcessPoolExecutor which is available since Python 3.2.

The ProcessPoolExecutor comes with some limitations, you can only execute and return objects that can be pickled. The pickle module is used for serializing and deserializing Python objects.

Computing Sums

To demonstrate the use of the ProcessPoolExecutor I wrote a simple program for counting the sum of an array with 1_000_000_000 elemets. On my machine the program executes in 20713ms.

from time import time


def main():
    elements = 1_000_000_000
    arr = [i for i in range(1, elements)]

    start = time()
    print(sum(arr))
    end = time() - start
    print("Duration", end * 1000, "ms.")


if __name__ == '__main__':
    main()

To speed up the program we can execute the code in parallel in multiple processes, instead of computing the sum in a single step we can split it in 100 steps and use the ProcessPoolExecutor to execute the compute sum for each step.

By running the following code:

from concurrent.futures.process import ProcessPoolExecutor
from time import time


def compute_sum(start, stop):
    arr = [i for i in range(start, stop)]
    return sum(arr)


def main():
    start = time()
    elements = 1_000_000_000

    with ProcessPoolExecutor(max_workers=20) as executor:
        # Steps will be a list from [0, 10000000, 20000000, ..., 990000000, 1000000000]
        steps = [i for i in range(0, elements + 1, elements // 100)]
        # Results will store our futures
        results = []
        for step, i in enumerate(range(2, len(steps)+1)):
            print("Submitting", step)
            # step is [0, 10000000] to [990000000, 1000000000]
            step = steps[i-2:i]
            # compute the sum of sub arrays, from 0 to 10000000; Returns a future
            future = executor.submit(compute_sum, step[0], step[1])
            # save the future
            results.append(future)

        # Retrieve the results and add up the sums.
        total_sum = 0
        for r in results:
            total_sum += r.result()
        print("Sum", total_sum)

    end = time() - start
    print("Duration", end * 1000, "ms.")


if __name__ == '__main__':
    main()

It will start 20 Python processes and each of them will get a task for computing the sum between a start range and stop range:

def compute_sum(start, stop):
    arr = [i for i in range(start, stop)]
    return sum(arr)

Compared to the previous code, the one that uses the process pool executor runs only in ~7 seconds. That’s 3x time improvement!

Conclusion

Running the program in parallel improved the running time by almost 3X on my machine.

Dividing a problem into sub-problems and solving each problem in a parallel manner is a good way to improve the performance of your programs.

In Python if we run the same code on multiple threads only one thread executes at a time, which defeats the purpose of running on multiple threads. To overcome this limitation we used the ProcessPoolExecutor from the concurrent.futures module to run the code in multiple Python processes and finally we combined the results.

Thanks for reading!

How to identify similar images using hashing and Python

Hi 👋,

In this article I would like to talk about image hashing.

Image hashing algorithms are specialized hashing functions that output the hash of an image based on the image’s properties. Duplicate images output the same hash value and visually identical images output a hash value that is slightly different.

To simplify

hash("white_cat") = "aaaa"
hash("brown_cat") = "aaba"
hash("car") = "xkjwe"

Some use cases for image hashing are:

  • Duplicate Image Detection
  • Anti-Impersonation / Image Stealing
  • Image filtering
  • Reverse image search

Let’s play around with image hashing techniques using Python and the ImageHash library. Install the library with:

pip install imagehash
pip install six

To obtain some sample images I’ve used Pexels and searched for words like “white cat”, “firetruck”.

Here’s the images that I’m using: cat1, cat2, cat3 and firetruck1.

I’m going to import the necessary stuff and add a function that converts the hexadecimal string given by image hash to an integer.

from PIL import Image
import imagehash


def hash_to_int(img_hash: imagehash.ImageHash):
    return int(str(img_hash), 16)

The reason for the hash_to_int function is that is much easier to do computations using integers rather than strings, in the future if we’re going to build a service that makes use of the image hashing and computes hamming distances, we can store the int hashes in an OLAP database such as ClickHouse and use bitHammingDistance to compute the Hamming Distance.

The next snippet of code opens the images, computes the average and color hashes and for every image in the dataset it computes the hamming distance between the average hash summed with the hamming distance of the color hash.

The lower the hamming distance the more similar the images. A hamming distane of 0 means the images are equal.

def main():
    images = [
        Image.open("cat1.jpg"),
        Image.open("cat2.jpg"),
        Image.open("cat3.jpg"),
        Image.open("firetruck1.jpg")
    ]

    average_hashes = [hash_to_int(imagehash.average_hash(image)) for image in images]
    color_hashes = [hash_to_int(imagehash.colorhash(image)) for image in images]

    image_hashes = list(zip(images, average_hashes, color_hashes))

    source = image_hashes[0]

    for image in image_hashes:
        hamming_average_hash = bin(source[1] ^ image[1]).count("1")
        hamming_color_hash = bin(source[2] ^ image[2]).count("1")
        hamming_distance = hamming_average_hash + hamming_color_hash
        print("Hamming Distance between", source[0].filename, "and", image[0].filename, "is", hamming_distance)


if __name__ == '__main__':
    main()

To compute the hamming distance, you’ll need to XOR the two integers and then count the number of 1 bits bin(source[1] ^ image[1]).count("1"). That’s it.

If the run the program with the source variable set to cat1.jpg, source = image_hashes[0], we get the following result:

Hamming Distance between cat1.jpg and cat1.jpg is 0
Hamming Distance between cat1.jpg and cat2.jpg is 36
Hamming Distance between cat1.jpg and cat3.jpg is 39
Hamming Distance between cat1.jpg and firetruck1.jpg is 33

If we look at our dataset the first image cat1 is somewhat visually similar to the image of the firetruck.

If we run the program with the source variable set to cat2.jpg we can see that cat2 is similar to cat3 since both images contain white cats.

Hamming Distance between cat2.jpg and cat1.jpg is 36
Hamming Distance between cat2.jpg and cat2.jpg is 0
Hamming Distance between cat2.jpg and cat3.jpg is 23
Hamming Distance between cat2.jpg and firetruck1.jpg is 47

Conclusion

We used a Python image hashing library to compute the average and color hash of some images and then we determined which images are similar to each other by computing the hamming distance of the hashes.

Thanks for reading and build something fun! 🔨

References

Full Code

"""
pip install imagehash
pip install six
"""
from PIL import Image
import imagehash


def hash_to_int(img_hash: imagehash.ImageHash):
    return int(str(img_hash), 16)


def main():
    images = [
        Image.open("cat1.jpg"),
        Image.open("cat2.jpg"),
        Image.open("cat3.jpg"),
        Image.open("firetruck1.jpg")
    ]

    average_hashes = [hash_to_int(imagehash.average_hash(image)) for image in images]
    color_hashes = [hash_to_int(imagehash.colorhash(image)) for image in images]

    image_hashes = list(zip(images, average_hashes, color_hashes))

    source = image_hashes[0]

    for image in image_hashes:
        hamming_average_hash = bin(source[1] ^ image[1]).count("1")
        hamming_color_hash = bin(source[2] ^ image[2]).count("1")
        hamming_distance = hamming_average_hash + hamming_color_hash
        print("Hamming Distance between", source[0].filename, "and", image[0].filename, "is", hamming_distance)


if __name__ == '__main__':
    main()

Multiple Python versions on Windows

Hi 👋

In this short article I will show you two ways of changing Python versions on Windows. It is useful when you have installed multiple Python versions on your system and want to run a specific version from the terminal.

For example, if we have the following versions installed:

We can use either the Python Launcher py to run Python or the python command.

Python Launcher

To list installed Python versions with Python launcher we can use the py -0 command.

@nutiu ➜ ~ py -0
Installed Pythons found by C:\WINDOWS\py.exe Launcher for Windows
 -3.10-64 *
 -3.7-64

@nutiu ➜ ~ py
Python 3.10.3 (tags/v3.10.3:a342a49, Mar 16 2022, 13:07:40) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

The default version has a star next to it. If we run a simple py command, we’ll get a prompt to Python 3.10. To change the default version all we need to do is to set the environment variable PY_PYTHON to the desired version.

@nutiu ➜ ~ $env:PY_PYTHON = "3.7"
@nutiu ➜ ~ py -0
Installed Pythons found by C:\WINDOWS\py.exe Launcher for Windows
 -3.10-64
 -3.7-64 *
@nutiu ➜ ~ py
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

Using the Python command

If you prefer running Python using the full command then you’ll get the Python version which has higher precedence in your path, for example if I run python on my machine I will get:

@nutiu ➜ ~ python
Python 3.10.3 (tags/v3.10.3:a342a49, Mar 16 2022, 13:07:40) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

We can change the order by going to: My PC -> Advanced System Settings -> Environment Variables

Select path from User variables and click Edit…

Python 3.10 has higher precedence in path because it is above Python 3.7. If we want to change the order, we need to select the folders referencing Python37 and click Move Up until they are above Python 3.10

Restarting your terminal and running python again should run your desired Python version.

Thanks for reading! 🍻