Research:A System for Large-scale Image Similarity

Created
13:41, 4 August 2021 (UTC)
Collaborators
no affiliation
Duration:  2022-01 – 2022-06
Category:2022 projects

This page documents a research project in progress.
Information may be incomplete and change as the project progresses.
Please contact the project lead before formally citing or reusing results from this page.

Category:Active research projects

After several requests from different parts of the movement, the Research team is working on an image similarity tool. The tool will take as input an image and return the most similar images in Wikimedia Commons.

Methods

This project is done in two parts:

  1. Calculate the embeddings of the images stored as string(base64) in csv file.
  2. Create the index tree using embeddings generate in above step, for recommendations. For this step we will be using AnnoyIndex from annoy.

Step 1: Calculating embeddings from the image base64 strings


[Pipeline used for embedding generation


As, image strings are stored in csv format so we decided to go with pandas and loaded the csv file into pandas Dataframe. After this we extracted the column of image base64 representation. Now, after getting Image base64 strings we passed the strings to generate_image function which used these strings to generate image array. We then finished out final step to calculate image embeddings which then again stored in csv file with one additional column of image urls.

Step 2: Creating Index tree using embeddings

After getting the embedding of the images, we created index tree using AnnoyIndex from annoy. This index tree will we used to get similar images at the time of APIs calling.

Timeline

  1. Investigate the best tools and methods to efficiently compute image similarity at scale
  2. Implement a first prototype for large-scale image similarity
  3. Iterate on the prototype and implement a public-facing tool

Experiments

Experiment 1: Model selection for Embedding generation

We mainly want to test EfficientNet against the ResNet50, that is why we only took variants of these models.
Observation: While, doing experiment we observe one strange thing variants of EfficientNet taking much time to load on memory otherwise they are faster in terms of model inference.

Main Result of Experiment
nameparametersimagestime_took(sec)testing_imageoverall_acc
EfficientNetV2B312930622101733298.12573432922363118048.42%
EfficientNetB417673816101733543.248486995697118042.33%
ResNet50V123561152101733187.57541871070862118040.51%
EfficientNetB764097680101733867.1762790679932118041.60%
ResNet50V223564800101733365.39129114151118040.14%

Experiment 2: Finding the scalability of AnnoyIndex tree

Experiment 3: Selecting the n_components of PCA for embeddings

To selected the best PCA n_components, we experimented with different models by selecting different n_components. To compare different models with different PCA components we used to matrix
1. class_accuracy and
2. overlap_acc
.

Columns of Table

1. name: Name of the model used in the experiment.
2. n_components: It defines the number of principle components we used for PCA.
3. variance_ratio: It is the ratio of the total information to the information represented by principle components of PCA on images.
4. class_accuracy: It the the accuracy of the predicted class of top 10 predicted image of belongs to same class as the source image. It is done for all images in testing dateset.
5. overlap_acc: It is quite interesting, start with assumption that images predicted by tree(annoy) of original images(let's say ideal_list) without any PCA is ideal. Now, images predicted after applying PCA with different n_components, if image present in ideal_list, we will call it Hit otherwise Miss, then sum of all hits divided by total predicted images(we took 100) gave us this accuracy for that n_component of PCA.
6. tree_size: We also keep track of size of tree(annoy) created after applying PCA.

EfficientNetB3V2 PCA Experiment
namen_componentsvariance ratioclass_accuracyoverlap_acctree_size
EfficientNetB3V2-1100.00%48.42%100.00%726.94MB
EfficientNetB3V2102494.47%48.25%81.85%525.67MB
EfficientNetB3V251283.62%47.96%80.22%334.65MB
EfficientNetB3V225673.22%47.50%77.42%240.83MB
EfficientNetB3V212860.28%46.32%69.26%194.31MB
EfficientNetB3V26446.31%40.97%52.70%166.54MB
EfficientNetB4 PCA Experiment
namen_componentsvariance ratioclass_accuracyoverlap_acctree_size
EfficientNetB4-1100.00%42.33%100.00%825.95MB
EfficientNetB4102493.45%42.32%85.10%526.46MB
EfficientNetB451285.37%42.25%84.09%336.73MB
EfficientNetB425674.33%42.08%76.84%241.28MB
EfficientNetB412858.49%40.48%67.35%194.53MB
EfficientNetB46443.50%34.98%50.88%168.76MB
EfficientNetB7 PCA Experiment
namen_componentsvariance ratioclass_accuracyoverlap_acctree_size
EfficientNetB7-1100.00%41.60%100.00%1132.17MB
EfficientNetB7102492.78%41.69%85.67%532.97MB
EfficientNetB751287.51%41.90%85.05%343.51MB
EfficientNetB725676.62%40.76%75.54%250.43MB
EfficientNetB712860.11%38.07%63.64%201.94MB
EfficientNetB76444.96%36.08%53.85%174.78MB
ResNet50V1 PCA Experiment
namen_componentsvariance ratioclass_accuracyoverlap_acctree_size
ResNet50V1-1100.00%40.51%100.00%917.29MB
ResNet50V1102493.57%40.31%83.25%523.87MB
ResNet50V151284.63%40.10%80.77%331.79MB
ResNet50V125673.91%39.24%74.97%238.33MB
ResNet50V112861.83%38.88%69.67%190.20MB
ResNet50V16449.22%35.58%55.98%162.61MB
ResNet50V2 PCA Experiment
namen_componentsvariance ratioclass_accuracyoverlap_acctree_size
ResNet50V2-1100.00%40.14%100.00%916.58MB
ResNet50V2102495.73%40.25%82.71%526.60MB
ResNet50V251286.23%39.92%80.06%334.51MB
ResNet50V225674.25%39.01%73.46%239.71MB
ResNet50V212860.57%38.21%66.84%192.16MB
ResNet50V26446.84%35.47%55.34%164.16MB

Experiment 4: End-to-End(in reference to Time) testing of models

In this experiment what we want to test, is mainly timing of the model. Like how much time it will take to generate result. In this experiment time is calculated from passing to url to getting the recommendation of similar images.

PIPELINE FOLLOWED

1. Downloading of image from the URL and loading it into numpy array.
2. Loading of the PCA.
3. Loading of the tree(annoy).
4. Calculation of the embeddings of image.
5. Applying of the PCA to embeddings.
6. Getting the ids of recommended images.
7. Fetching the URL of recommended image from CSV file.
NOTE: We preloaded the model into RAM.

EfficientNetB3V3 End-to-End
namen_componentsalready_loadeditertime_took(sec)
EfficientNetB3V2-1True12.993394613265991
EfficientNetB3V2-1True20.8071532249450684
EfficientNetB3V2-1True30.8314507007598877
EfficientNetB3V21024True12.900543689727783
EfficientNetB3V21024True20.8321168422698975
EfficientNetB3V21024True30.8280463218688965
EfficientNetB3V2512True12.4594342708587646
EfficientNetB3V2512True20.8171608448028564
EfficientNetB3V2512True30.8010208606719971
EfficientNetB3V2256True11.8577890396118164
EfficientNetB3V2256True20.8222873210906982
EfficientNetB3V2256True30.8436849117279053
EfficientNetB3V2128True10.8022160530090332
EfficientNetB3V2128True20.7558486461639404
EfficientNetB3V2128True30.7694368362426758
EfficientNetB3V264True10.8248932361602783
EfficientNetB3V264True20.8313982486724854
EfficientNetB3V264True30.7825984954833984
EfficientNetB4 End-to-End
namen_componentsalready_loadeditertime_took(sec)
EfficientNetB4-1True131.089104413986206
EfficientNetB4-1True20.8255059719085693
EfficientNetB4-1True30.7638154029846191
EfficientNetB41024True10.8171260356903076
EfficientNetB41024True20.8361170291900635
EfficientNetB41024True30.8306858539581299
EfficientNetB4512True10.7886874675750732
EfficientNetB4512True20.8270878791809082
EfficientNetB4512True30.7964653968811035
EfficientNetB4256True10.7913780212402344
EfficientNetB4256True20.7964553833007812
EfficientNetB4256True30.7966964244842529
EfficientNetB4128True10.7879147529602051
EfficientNetB4128True20.8378396034240723
EfficientNetB4128True30.8303694725036621
EfficientNetB464True10.7877256870269775
EfficientNetB464True20.7898170948028564
EfficientNetB464True30.7847003936767578
EfficientNetB7 End-to-End
namen_componentsalready_loadeditertime_took(sec)
EfficientNetB7-1True17.930605411529541
EfficientNetB7-1True20.835761308670044
EfficientNetB7-1True30.8150243759155273
EfficientNetB71024True10.813317060470581
EfficientNetB71024True20.8263413906097412
EfficientNetB71024True30.8415734767913818
EfficientNetB7512True10.8173937797546387
EfficientNetB7512True20.8228459358215332
EfficientNetB7512True30.8680799007415771
EfficientNetB7256True10.8364121913909912
EfficientNetB7256True20.8222959041595459
EfficientNetB7256True30.8135280609130859
EfficientNetB7128True10.787550687789917
EfficientNetB7128True20.8863904476165771
EfficientNetB7128True30.8605077266693115
EfficientNetB764True10.8497884273529053
EfficientNetB764True21.692915678024292
EfficientNetB764True30.8910524845123291
ResNet50V1 End-to-End
namen_componentsalready_loadeditertime_took(sec)
ResNet50V1-1True11.8402047157287598
ResNet50V1-1True20.8064224720001221
ResNet50V1-1True30.7804932594299316
ResNet50V11024True10.8325328826904297
ResNet50V11024True20.7556729316711426
ResNet50V11024True30.7284393310546875
ResNet50V1512True10.7692885398864746
ResNet50V1512True20.7979185581207275
ResNet50V1512True30.7842895984649658
ResNet50V1256True10.8078567981719971
ResNet50V1256True20.7786800861358643
ResNet50V1256True30.7741687297821045
ResNet50V1128True10.8202202320098877
ResNet50V1128True20.8022265434265137
ResNet50V1128True30.7970542907714844
ResNet50V164True10.7963719367980957
ResNet50V164True20.8078024387359619
ResNet50V164True30.8760688304901123
ResNet50V2 End-to-End
namen_componentsalready_loadeditertime_took(sec)
ResNet50V2-1True11.9190337657928467
ResNet50V2-1True20.8073186874389648
ResNet50V2-1True30.8345324993133545
ResNet50V21024True10.8155877590179443
ResNet50V21024True20.8098921775817871
ResNet50V21024True30.8261690139770508
ResNet50V2512True10.8170630931854248
ResNet50V2512True20.7819721698760986
ResNet50V2512True30.7720069885253906
ResNet50V2256True10.7744569778442383
ResNet50V2256True20.8012771606445312
ResNet50V2256True30.7616298198699951
ResNet50V2128True10.7649922370910645
ResNet50V2128True20.7748057842254639
ResNet50V2128True30.7836484909057617
ResNet50V264True10.7733273506164551
ResNet50V264True20.7945435047149658
ResNet50V264True30.8264520168304443

We also observe same thing which we discussed in Experiment 1.
RESULT: We decided to go with EfficientNetB3V2 with PCA(n_components=256).

Experiment 5: accuracy of selected model with different classes

After selecting the model we decided to do one class accuracy check with model.

EfficientNetB3V2: Classes-Accuracy
categoryaccuracy
Architecture12.40%
Drawings32.80%
Flags42.50%
Fossils61.10%
Paintings32.20%
Politician48.10%
Sculptures21.80%
aircraft69.20%
apple70.30%
banner27.10%
bed49.10%
bicycle56.90%
bird54.50%
boat45.70%
book39.50%
bottle62.90%
brick25.40%
bridge49.80%
building13.80%
bus67.60%
cake46.90%
car73.90%
cat73.60%
chair43.50%
clothes34.90%
cow54.30%
dog51.30%
door43.50%
fence18.10%
fire_hydrant53.00%
flower37.90%
fog26.50%
food20.80%
fruit40.00%
furniture37.10%
hill39.80%
horse56.30%
house24.20%
leaves37.80%
light60.10%
motorcycle82.80%
mountain33.60%
mud38.20%
paper83.30%
person13.80%
plastic35.80%
potted_plant69.70%
river31.60%
road48.30%
rock28.40%
roof25.00%
rug57.70%
sand31.40%
sea29.70%
sheep65.00%
shoe62.60%
sky15.50%
snow30.80%
stairs23.80%
stone28.60%
street_sign74.50%
table34.20%
textile50.90%
tile41.10%
toilet74.40%
train73.50%
tree22.10%
truck55.30%
vase62.10%
vegetable49.30%
wall19.20%
water21.80%
water_droplets53.10%
window31.30%
wood30.20%
Overall Accuracy43.80%

After seeing these low accuracy of classes, we were confident that there is something wrong class labels, to confirm this we perform one more experiment Experiment 6

Experiment 6: Testing model with critical classes(having low accuracy)

These are some sample images of this experiment where we can easily observe that although images are similar to each other content-wise but they are present under different categories.

Food
Similar Images to Food category using EfficientNetB3

Above (source) image belongs to Food category, notice last three images contains food but they belongs to different category which was the cause of low score in preview experiment.

Similar Images to House category using EfficientNetB3
Similar Images to House category using EfficientNetB3

This is second sample of recommendation (source)image belongs to House category.

Spark: Recommendation Tree Generation

1. PCA generation for mapping embeddings to [256,] vector

When we were starting with this step there were mainly two choices available -
1. pyspark.ml.feature.PCA
2. sklearn.decomposition.IncrementalPCA
and both were have there problems -
The problem with pyspark.ml.feature.PCA was that it required spark context everytime even at the time of inference which was not possible in our case. And, the problem with sklearn.decomposition.IncrementalPCA was that it can't be distributed with spark.
We choose sklearn.decomposition.IncrementalPCA because we could fill it with batches of data, which will reduce the spark efficiency but it would not be required spark context at the time of inference which was required in our case.

# Created the instance of PCA with 256 components
pca = IncrementalPCA(n_components=256, batch_size=50000)

# this function will fill the batches of data to pca
def fill_pca(df):
    global pca
    df_list = df.collect()
    features = []
    for row in tqdm(df_list):
        features.append(list(row['features']))
    arr = np.array(features)
    pca = pca.partial_fit(arr)

These are the part of code which is used to fill pca with required data batch by batch

with open('pca256.pkl', 'wb') as f:
    pickle.dump(pca, f)

At the end we just dumped the pca instance for using it later for PCA generation of embeddings.

2. Index to URL map file generation

OK, we had the small problem is that annoy will return index of the matched embeddings of image, but the limitation with annoy was that only integers can be used as index but we needed URL of the matched image.
SO, to solve this problem we came up with a map which will accept index of the image and return URL of the image. To generate this map we used following code -

tree = AnnoyIndex(256, 'angular')

# will work as map file for index->URL
d = dict()

ite = 0
def fill_tree(df, ite):
    df_list = df.collect()
    # following will iterate batch row by row
    for row in tqdm(df_list):
        data = np.array(list(row['features'])).reshape(1, -1)

        # this will map image_url from index  
        d[ite] = row['image_url']

        tree.add_item(ite, pca.transform(data).reshape(-1))
        ite += 1
    return ite

At last we dumped the map file to use at the time of inferencing the API.

with open('./idx2url.pkl', 'wb') as f:
    pickle.dump(d, f)

3. Annoy tree generation for recommendation

After generating pca and idx2url map, it was time to complete the final step of generating annoy tree for making recommedation of similar image. Nothing much left to tell about annoy as we discussed every thing about this in this document, we will direct jump to code we use to fill embedding to tree:

# this is a instance of annoy tree
tree = AnnoyIndex(256, 'angular')
d = dict()
ite = 0

# this function will fill embeddings to tree row by row
def fill_tree(df, ite):
    df_list = df.collect()
    for row in tqdm(df_list):
        data = np.array(list(row['features'])).reshape(1, -1)
        d[ite] = row['image_url']
        
        # notice we are adding embeddings to tree after applying pca on it 
        tree.add_item(ite, pca.transform(data).reshape(-1))

        ite += 1
    return ite

In order to understand why 100 is passed in build method, you can refer to following paragraph which was taken from the annoy github page.
a.build(n_trees, n_jobs=-1) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.

We saved tree to use it in API

tree.build(100)
tree.save('./tree.ann')

4. Testing everything till now, before moving further

# We took url of image
url = 'https://www.malaysia-today.net/wp-content/uploads/2019/11/Taj-Mahal-Agra-India.jpg'
# downloading image data form URL
img_data = urllib.request.urlopen(url).read()
# image data to image conversion
image = generate_image(img_data)

# first we passed image to model(efficientNetB3)
features = model.predict(np.expand_dims(image, axis=0))
# here we are applying pca to embeddings
feature_pca = loaded_pca.transform(features)

# using tree to generate index of similar images which then mapped to image_url
res = [d[i] for i in tree.get_nns_by_vector(feature_pca.reshape(-1), n=10)]
Similar Images
Similar Images

Codes

API Codes Please refer to this page
UI Codes Check this out

Results

You can test result by yourself: []

On Phabricator

Follow this task

References

    Category:2022 projects Category:Active research projects