Commons:Wise

Overview

WISE is a semantic search tool for images and videos on Wikimedia Commons.

It currently searches only within Media of the Day (around 5,000 videos). The search works purely on the visual content of the media files, and does not use metadata such as filenames, descriptions, or structured data.

The tool also supports face detection and recognition, allowing users to find where a person appears across videos.


Features

Semantic visual search using a natural-language query
Multilingual search example returning relevant visual results

Search for images and videos using natural language descriptions. WISE interprets the visual content of media files directly.

Examples:

  • man at a train station
  • horse in an airplane
  • man with a flower
  • pirate with a pistol

Queries can be written in multiple languages. Try queries on your own language. For example:

  • Hindi: एक व्यक्ति रेलवे स्टेशन पर (means "man at a train station")
  • Telugu: విమానంలో గుర్రం (means "horse in an airplane")

Face search across Commons media using a reference image

Locate a specific person across images and videos using a reference image.

How to use

  1. Select Faces mode
  2. Click the green Image button
  3. Paste an image URL containing a clear face

Example inputs

The system detects faces, matches identities, and returns results including timestamps for videos.


Search within audio files to locate relevant segments based on spoken content or sound.

Note: Coverage is currently limited and under active development.


How it works

WISE uses multimodal machine learning to generate embeddings (numerical representations of content) and performs similarity search. The followings are configurable but the tool currently running at wise.wmcloud.org is configured with the following:

ComponentDetails
Frame extraction~1 frame every 0.5 seconds from video
Visual embeddingsOpenCLIP (ViT-B-16-SigLIP2-512/WebLI)
Face embeddingsInsightFace (buffalo_l)
Similarity searchIndex IVFFlat (Inverted File Flat) with FAISS


Known limitations

  • Searches only Media of the Day (~5,000 videos)
  • Results may sometimes be inaccurate or unexpected
  • Some videos (e.g., OGV format) may not play in all browsers
    • Workaround: open the file page on Wikimedia Commons
  • Search results are a rank. If there is no correct result, you still get back results ordered by how close they are to the query. Similary, even if there are "correct answers", at some point it will start providing "wrong results".

Planned improvements

  • Index all Wikimedia Commons images
  • Suggest similar images during upload
  • Assist with metadata (categories, filenames)
  • Improve video playback experience

Community interest and support

This section is intended to understand community interest in the project and help guide future development.

Currently, WISE indexes only Media of the Day (~5,000 videos). Expanding this to the full Wikimedia Commons collection (millions of images) requires significant computational resources.

In particular:

  • Processing ~5+ million files requires large-scale GPU computation
  • This is a costly operation in terms of infrastructure and maintenance
  • Scaling also involves storage, indexing, and ongoing updates

Because of this, we would like to understand:

  • Is this tool useful to the community?
  • Should effort and resources be invested to scale it further?
  • Are there concerns that should be addressed before expanding?

Support

  1.  Support The tool improves content discovery and fills an important gap in Commons search. Gopavasanth (talk) 23:12, 2 May 2026 (UTC)
  2.  Support can't wait to try it with all pictures Nat (WDU) (talk) 09:09, 3 May 2026 (UTC)
  3.  Support obviously very useful. Not sure what this vote is for though. Also doubtful whether facial search is a good idea to include. Only truly useful if covers nearly all files (or videos) on Commons. --Prototyperspective (talk) 17:13, 4 May 2026 (UTC)
    @Prototyperspective, I’ve updated the section to clarify why we are requesting support and interest. Gopavasanth (talk) 08:14, 5 May 2026 (UTC)
  4.  Support First of all it's already pretty cool to use and quite impressive, in particular the frame-wise overview of the matching video stills! Being able to easier find relevant scenes in video fames is super useful, considering videos are often only described as a whole, but not scene-wise. Furhermore, it could be quite useful for suggesting categories/depicts for media that are either uncategorized or poorly categorized, or for new uploads. Also it could possibly help people identify locations/art/objects via the image-input, by returning depicted items (P180) of the matching images (assuming it is acurate enough). Ultimately this would be great to have available as an API-action as well for the same reasons. It could be useful, if technical feasible, to include structured data (P180, P1071) in the training or otherwise correlate them, so that searching by wikidata item would be possible. I can't say anything about the involved costs and if it would be worth it, but I would find the feature definetely useful and interesting to have. Nylki (talk) 18:22, 5 May 2026 (UTC)
  5.  Support interesting project Amrit Sufi (talk) 06:16, 6 May 2026 (UTC)

Oppose


Neutral / Questions

  1. Perhaps it would be a good idea to make the project— and this page too —multilingual, so we can gather a wider range of opinions and raise awareness of the project.
P.S. This text was generated using deepl.com. AnBuKu (talk) 12:47, 5 May 2026 (UTC)

General feedback

I must say I'm quite impressed. I tried to come up with some intentionally weird queries (e.g. "window in the ceiling") and got surprisingly good results. That said, I noticed a few things that can be improved.

  • I cannot link to search results. https://wise.wmcloud.org/?q=example for example doesn't work.
  • My browser history is also broken. I cannot go back to a previous search result.
  • From the perspective of a Wikipedia editor my first idea would be to find good illustrations and photos for articles. Not videos. While being able to visually search video content is an absolutely amazing feature, I think it's not as important as visually searching the 120 million photos we have.
  • In some of my queries (e.g. "the oldest and most weird woodworking tool ever") I noticed a lot of frames that are taken in the exact middle of a cross-fade. This is arguable not terribly useful as such blended frames are not really part of the "content" of a video but merely a visual glitch as a result of how the content is presented. I believe it should be possible to detect cross-fades and exclude them.

--Thiemo Kreuz (WMDE) (talk) 09:49, 13 May 2026 (UTC)



Category:Commons tools Category:Wikimedia tools Category:Wikimedia Commons search
Category:Commons tools Category:Wikimedia Commons search Category:Wikimedia tools