Wikimore

for older archives, see User talk:Fæ/2021, User talk:Fæ/2020, User talk:Fæ/2019, et seq.

Old projects

Scan released last year at the Internet Archive

New upload of 1887 address book for Riga, Latvia. (800mb).

I had a look back at the IA upload project and realized that none of these scripts run because of Python, Pywikibot and internetarchive changes. It turns out I find it quite difficult to remember almost anything about these projects too. So don't be surprised if I'm testing it out and there are flaws. I'll do my best to repair any oddities. I'm seeing '+99' notices from my account which doesn't clear, could be a wm bug for big numbers, so I might not notice changes. It's not deliberate. For the moment please don't ask me to take on large projects, I'd rather pace myself at 'slow'. Fæ (talk) 13:07, 2 June 2026 (UTC)

Thanks, and welcome back! — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 08:11, 3 June 2026 (UTC)

Another new "feature" of the Wikimedia API is ratelimiting. Added a couple of slow down precautions including slapping down multiprocessing, but it's definitely dogging uploads despite being visible to the API as an established user. It may be necessary to revisit the throttling system rather than bumping into it. It's sad this creates extra work for volunteers.

For the first time the queries found restricted items at Internet Archive like 1826histoirenumismatiquedelare, where the Washington University appears to be claiming copyright in a 200 year old publication. Good grief, I hope this is not a trend that IA is tolerating. --Fæ (talk) 07:09, 5 June 2026 (UTC)

I have updated a process for finding the text of Google cover pages in the recent uploads. The 'pending' queue is at Category:OCR detected cover page. Yet to update the page removal process. There's no hurry and I will get to this slowly. --Fæ (talk) 06:29, 8 June 2026 (UTC)

Note that larger files, like the 800mb PDF transcluded, are now possible thanks to limits changing. The larger files might cause things to break, sometimes in predictable ways like the SHA check from mediawiki taking some time to process and be available on the system. Behind the scenes, these uploads do not behave well and invariably fail to report back that they are successfully uploaded. Fæ (talk) 14:26, 17 June 2026 (UTC)

Welcome back

As you probably have worked out, I've been following this talk page, checking whether DRs of your uploads are valid or not, and responding to the DR with a rationale if I think something should be kept. May I assume you are sufficiently "back" that I can let go of monitoring that?

Welcome back, in any case. - Jmabel ! talk 14:06, 3 June 2026 (UTC)

If you wish, though I'll probably continue to ignore most DRs and let others chip in. After your first million images, the DRs that matter are ones that are a meaningful case study for thousands of others and could be collated with some automated method.

I have noticed your patient work and very much appreciate it! Fæ (talk) 15:21, 3 June 2026 (UTC)

Yes, welcome back! So glad to see this. Krok6kola (talk) 16:06, 3 June 2026 (UTC)
It's been almost 5 years. I hope you had a good time! Welcome back. - Alexis Jazz ^{ping plz} 10:03, 5 June 2026 (UTC)

@Fæ Yes, happy to see you here again :) --PantheraLeo1359531 😺 (talk) 09:32, 8 June 2026 (UTC)

Good to see you again, also from me. -- Deadstar (msg) 10:27, 9 June 2026 (UTC)

Marvellous to see you editing again. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:15, 9 June 2026 (UTC)

IA uploads

Hi, I see that you have restarted uploading PDF from Internet Archive. FYI, I created a bot which fixed the 21,000+ files you uploaded and were in Category:Book scans with Google Books cover sheets (to remove). Could you please add files with Google cover pages to this category, so that the cover pages can be removed?

I think that all PDF files without any meaningful category are not very useful. If you can't add category, you should at the very least add them to a "PDF files needing category". Thanks, Yann (talk) 19:53, 8 June 2026 (UTC)

See Commons:IA_books#Automatic_detection_and_deletion_of_cover_pages and the section above on this talk page where I mention OCR. The use of pytesseract is slightly updated to be more accurate and has already detected many pdfs with secondary pages that were missed. The idea is to automate secondary detection when stripping each cover, as some have an English google cover followed by a German one. This is in testing, so a complete run and automated stripping are not ready yet due to false positives and it would be nice not to rely on manual double checking.

Automated categorization needs careful handling due to IA inconsistencies. Though a collection may have standard 'tags', this varies wildly. This is discussed at the IA book project page and it makes sense to revisit those same formulas to both encompass new releases at IA to those collections and think of including more. The IA uploads have several hidden categories, like 'Old books from American libraries' which acts as a default uncategorized category, and as anyone can use Petscan to list those with no visible categories, it seemed unnecessary. As you suggest it, we can add Category:Uncategorized images but it may lead to complaints about burdening others by appearing to leave hundreds of thousands in the queue, or duplicate categories being created because the categories available for old books are already complex.

None of this will be fast, weeks not days, time at the keyboard has to be in small doses these days. Fæ (talk) 08:22, 9 June 2026 (UTC)

Hi, Thanks for your answer. Yes, cover pages detection is a bit tricky. I did have to make a lot of try-and-error tests. My script now detects cover pages properly for most cases, except some books in Portuguese, Scandinavian, and Slavic languages. In these cases, I put back the file in the category, and the second page is then removed. Hopefully this only concerns a few books (maybe around 1%). Please see Special:ListFiles/YannBot.

For the categories, there is Category:IA uploads needing categories. Regards, Yann (talk) 09:31, 9 June 2026 (UTC)

BTW, I don't know if you have noticed, there is a nasty bug affecting PDF files for several months: phab:T420341. You may want to add your opinion there. Thanks, Yann (talk) 16:23, 9 June 2026 (UTC)

The zero sized PDF was a long running problem. Were one to be debugging it, a start could pulling the list of all pdfs with zero size matches, or pageid==0. At the time the presumption was a fundamental problem with the way the WMF servers were using off-the-shelf pdf handling and even if WMF devs looked at it, the solution might be a dirty work around rather than a rewrite of anything. The numbers involved were so small in proportion that it would be okay to shove these in a housekeeping category for manual attempts at reuploading or recoding the source pdf. --Fæ (talk) 08:28, 10 June 2026 (UTC)

This issue is never with the files themselves. They have changed the configuration, so now purging always displays them properly. I did a kind of off-the-shelf survey among the 20,000+ files I have processed. There is no discernible pattern, but it appears more often with big files than smaller ones. Yann (talk) 08:36, 10 June 2026 (UTC)

Pleasingly, the Google cover page detection is going well behind the scenes. There's no rush, trying to stick to my keyboard time limits, so making sure this can run using the equivalent of a local ramdisk for processing and reuploading and preferably with one touch on the files. Seeing German secondary pages fairly rarely and want to ensure there are not other variations to detect. It might make sense to invert the teapot and prove there's no secondary cover pages and process those first anyway. --Fæ (talk) 09:10, 10 June 2026 (UTC)

In related questions: perhaps you have useful insights for this topic Commons:Village_pump/Proposals#Leveraging_SD_for_book_categorizations_(PDFs,_déjavus,_categories,_). I wouldn't know where to start. -- Deadstar (msg) 10:36, 10 June 2026 (UTC)

See Commons:Bots/Work_requests#Book_renaming and the section on categorization or diffusion projects at COM:IA books. The biggest impact would be collection (often specialist library) related categories, though author might be practical if the variations in name could be corrected for by a bio database or wikidata. Taxonomies are a rabbit hole worth avoiding, so trying to extract topics and using those for categories might be a mistake unless there are specific obvious needs with large results, like medical texts of the 19th century. This might be too big a project for automation right now, so avoiding giving thoughts in the discussion, but it is of interest longer term. --Fæ (talk) 11:50, 10 June 2026 (UTC)

@Yann: Having run into the 0x0 bug for trimmed pdfs like this one (it looks okay now, but after overwriting this was returning a 0x0 size), it is correct that it can be fixed manually using the url based purge parameter. However when automated by using a pywikibot call, this returns the error "Fæ" does not have required user right "purge"; which seems weird when nothing stops this being done manually. Is this a right that can be added or readded? It could be bundled with something else as it does no appear in the list of groups. Meanwhile that job is paused out of caution. --Fæ (talk) 18:56, 10 June 2026 (UTC)

OMG, this is a complex area. The 'purge' right is implicit, but the invocation within pywikibot seems the issue. Don't worry about it, but it shows the 0x0 thing is muddy. Fæ (talk) 19:08, 10 June 2026 (UTC)

I do it with my bot with curl -s -d "action=purge" -d "titles=File:$filename" -d "format=json" "https://commons.wikimedia.org/w/api.php". Yann (talk) 19:16, 10 June 2026 (UTC)

Just in case anyone else runs into this, in pywikibot speak it looks like:

purge_req = pywikibot.data.api.Request(
 site=site,
 parameters={
  'action': 'purge',
  'titles': file_page.title(),
  'forcelinkupdate': True
 }
)
purge_req.submit()

This fix seems to work and I don't have an insight into why large pdfs might be unaffected. it's annoying that the size verification loop has to be on the (volunteer) uploader side, rather than reliably being immediately flagged and repaired on the (WMF) server side. It does seem the error is rare and has not appeared on pdfs even over 800 pages. --Fæ (talk) 19:57, 10 June 2026 (UTC)

Mathematics Journals

Hi, Do you have in your list of future uploads these journals: Annals of Mathematics, American Journal of Mathematics? If not, could you please add them? Thanks, Yann (talk) 07:39, 11 June 2026 (UTC)

Maybe, it looks like some thought on copyright is needed. In the Annals there are multiple index only prints, which arguably could be uncopyrightable, so best to be cautious anyway, but the scans of content with papers are likely to have copyright with mathematicians either still alive or having died within the last 70 years. Only flicked through that set of digitized microfiche but not found a copyright statement and the uploader has not made a statement about copyright being expired or similar. Taking a literal interpretation might mean only taking volumes up to 1920? Perhaps a later date could be agreed on as uncontroversial in the absence of copyright statements. Nice to see some names I recognize from my time studying last century.

So long as mentioned here, it probably will not be forgotten, though the uploads might be after the current update uploads, retrospective renames and coverpage housekeeping. --Fæ (talk) 07:56, 11 June 2026 (UTC)

Category:Annals of Mathematics, Category:American Journal of Mathematics, selected before 1931 as published in USA. Not sure what the best parent cat would be, so left that for you to think about. For some reason the links in the IA description have not been beautified, but as this is a small collection (300 odd matching files), parked that for now. Rusty, forgotten entirely how this worked until reading my own breadcrumbs.

Let me know if any are missing. For some reason the location test was getting flagged as non-US and not sure if the root cause could be bad data on the IA side, or a bug on my side. Fæ (talk) 08:17, 12 June 2026 (UTC)

Thanks for uploading these. FYI: s:American Journal of Mathematics. Yes, I have found more: . Also volumes after 1900 should be PD-US-expired rather than PD-old-100-expired. Idem for s:Annals of Mathematics: . Yann (talk) 15:19, 14 June 2026 (UTC)

There was a hard coding of ignoring everything after 1925, in a super precautionary way. Now set to 1930. 18th C. is so much quieter to handle. Fæ (talk) 19:41, 14 June 2026 (UTC)

The Cambridge History of English Literature

Hi, Could you please upload all these? You can add them to Category:The Cambridge History of English Literature. We already have some books, but not a complete set in good quality. Thanks, Yann (talk) 15:27, 17 June 2026 (UTC)

The advanced IA query is inaccurate as there is no collection defined. There are a couple of mismatches that can be moved to the old American books general category unless there is some obvious better one. Fæ (talk) 19:49, 17 June 2026 (UTC)

Hi, I am not sure I understand. You mean you only upload files from US libraries? But there are books from different libraries there: California, Cornell, Princeton, etc. Hopefully there will be a complete good set among them. Do you intend to upload from other sources (Internet Archive, Toronto?). Thanks, Yann (talk) 20:16, 17 June 2026 (UTC)

Welcome Back.. Even if your time is limited..

Ongoing projects, you may be able to assist with :

Long term slow - reviewing IA files, Optionally, I got one of the link templates set up to indicate a PD status, so contorbutors can quickly eliminate clearly PD works from the ongoing general upload category. :)

(I will also note here, that sometimes it was for Wikisource purposes, necessary to upload a new djvu for a PDF file already present on Commons. This is not a general request but an observation. It's entirely a case by case basis when scans were examined at Wikisource, and found to be too low res in scans, due to limitations in the PDF render in Mediawiki.)

You may wish to review DR's for PDF/DJVU files I filed during your absence ( I had concerns about materials that might have been Standard Reference Data, which is not straightforwardly PD-US-Gov NIST)

You also used to run a bot that tried to match up previously uploaded image sequences with the appropriate scans?

ShakespeareFan00 (talk) 08:15, 10 June 2026 (UTC)

As said above large projects are not sensible for me to take on for now. Vaguely, there's a memory of fiddling with DJVU versions but the decision was to drop that option. Details are elusive, apart from 'djvu' being unsatisfactory for some reason, but the clue may be if the problem is MW rendering that's a WMF bug to fix rather than a source file problem. Behind the scenes there were several unpublished tests for using JPEG2000 original scans in archive quality lossless ways, but the WMF was... uninterested.

It could be possible to revisit image hashing though the use case would need to be significant and focused, as random walking hashing images from large PDFs would be painfully expensive in manual programming and machine processing. Reading this there's a memory of the BM's Flickr stream of clippings but that's thousands of clips rather than millions of archived images. --Fæ (talk) 08:38, 10 June 2026 (UTC)

Your previous bot was matching by IA identifier in source fields IIRC, it wasn't hashing in-file images :).

ShakespeareFan00 (talk) 08:44, 10 June 2026 (UTC)

That sounds much easier. The sort of thing that the tools could do without programming needed? (No expert in the 'tools', much is forgotten.) --Fæ (talk) 08:52, 10 June 2026 (UTC)

Some tools :), my approach before you wrote the previous bot was to use the search field in FlickrCommons uploaded images. when you uploaded those, I think you had a specfic {{Information}} block that allowed for other images with the same IA identifier origin to be searched for :) .

ShakespeareFan00 (talk) 10:15, 10 June 2026 (UTC)

WB! ^__^

hey Fae;

if you are back in business, please could you "raid" this site:

https://xoax.net/sub_art/ref_jlg_ferris/

for the images of the paintings (all pd now), & dump them into here (primarily):

https://commons.wikimedia.org/wiki/Category:The_Pageant_of_a_Nation_(Jean_Leon_Gerome_Ferris)

& if possible here (secondarily):

https://commons.wikimedia.org/wiki/Category:Jean_Leon_Gerome_Ferris

& i'll take it from there... (just let me know when it's there? small, long term project i've been working on; & i DO NOT have your bot skills. )

best, & good to have you back.

Lx 121 (talk) 14:18, 10 June 2026 (UTC)

Had a quick look, but sorry to say the quality of these seem thumbnail sized and some are worse than what is already on Commons. I suggest double checking each using reverse image search to see if higher resolutions or better prints exist. Most of my upload projects are based on large catalogues and archives, such as the Rijksmuseum database, and for automation to be worth it, collections are several thousand archive quality scans. --Fæ (talk) 17:42, 10 June 2026 (UTC)

if you go to each painting's individual page, the main image is about "postcard-sized". - not ideal, but there are 78 paintings in the series (plus a handful of extra, related works), ALL on this one site, & we have less than 1/4 of the paintings from this set presently @ commons. hunting down better likenesses would be stage 2. having some representation of every painting in the series is stage 1 (both for commons & for the wikipedia article on the artist, this was their major "oevre"). xD if you can just "scrape" it & upload the files here @ wmc, i can take it from there; doing 78 (+) file saves & uploads manually would be a huge "grind". Lx 121 (talk) 18:34, 10 June 2026 (UTC)

@BMacZero: any idea whether you could easily do this, or know who could? - Jmabel ! talk 02:54, 11 June 2026 (UTC)

@Jmabel and Lx 121: I could handle that, if Lx 121 could (a) produce a list of the URLs (like https://xoax.net/sub_art/ref_jlg_ferris/pt1_abduction_poca_1612/) that don't already have a better version on Commons, perhaps on a user subpage and (b) add Artist, categories, and other metadata to the pages afterward, since it isn't available here in a structured way. – BMacZero (🗩) 15:21, 11 June 2026 (UTC)

@BMacZero: i can add the file info no problem (try to save the site's file names though, please/if possible, they seem to be based on the painting titles). i've been "low key" working on this subject for some time; just getting a complete list of the paintings in the series was difficult (until i found this one site, & i'm still trying to cross-check the scholarship/sourcing). it would be nice to have the full set/series of images in place for the big july 4th/250 (2026-07-04).

re: list - it would be easier to list the FEW that we do have here @ wmc (~18:78); or just scrape the whole collection? the handful of redundancies won't matter, i can just put them in subcats under "painting by title" (as i already have for the cases where we had more than 1 file image for a given work). Lx 121 (talk) 23:25, 15 June 2026 (UTC)

Category:Fleurons_by_year

Commons:Categories for discussion/2026/06/Category:Fleurons_by_year Themightyquill (talk) 11:25, 12 June 2026 (UTC)

File:Mural on the old Union Pacific train station in Giddings, the seat of Lee County, Texas, east of Austin LCCN2014630926.tif

File:Mural on the old Union Pacific train station in Giddings, the seat of Lee County, Texas, east of Austin LCCN2014630926.tif (edit|talk|history|links|watch|logs)
Commons:Deletion requests/File:Mural on the old Union Pacific train station in Giddings, the seat of Lee County, Texas, east of Austin LCCN2014630926.tif Nv8200pa (talk) 12:33, 12 June 2026 (UTC)

File:Peanut and Texas flag art on a peanut factory in Giddings, the seat of Lee County, Texas, east of Austin LCCN2014630929.tif

File:Peanut and Texas flag art on a peanut factory in Giddings, the seat of Lee County, Texas, east of Austin LCCN2014630929.tif (edit|talk|history|links|watch|logs)
Commons:Deletion requests/File:Peanut and Texas flag art on a peanut factory in Giddings, the seat of Lee County, Texas, east of Austin LCCN2014630929.tif Nv8200pa (talk) 12:34, 12 June 2026 (UTC)

Notification about possible deletion

Bundle DR:
Commons:Deletion requests/Files in Category:Lightning McQueen at Disney California Adventure Park

Affected:

File:The Disneyland Resort's "Destination- Cars Land" float in the 124th Rose Parade in Pasadena, California LCCN2013631345.tif

Yours sincerely, Grand-Duc (talk) 00:31, 13 June 2026 (UTC)

File:Brand Addiction Adicción a las Marcas (3152511010).jpg

File:Hurricane Ivan, Natural Hazards DVIDS726877.jpg

Duplicate but with a different DVIDS ID. Not sure how we should best handle the merge. - Jmabel ! talk 00:46, 15 June 2026 (UTC)

Category:Faebot identified duplicates

Category:Images from DoD uploaded by Fæ (duplicate)

Unfortunately it is a rabbit hole. As the SHA values do not match, there's no mediawiki way to find these 'almost but not quite' digitally identical copies. A lot of time was spent automating image hashing, but it is an expensive process and does not eliminate human intervention to decide the best way to merge.

Yes, the backlog really is over 20,000 files and 9 years old. However it remains a low priority issue compared to other enigmatic Commons puzzles.

WRT this example, they were released by the military one week apart according to the metadata. Picking the first one officially released on their system seems a fine choice considering neither is currently in use. Fæ (talk) 02:00, 15 June 2026 (UTC)

Semi-related: do you know about Commons:International Standard Content Code? Might be relevant for some of the work you do. - Jmabel ! talk 19:05, 15 June 2026 (UTC)

This is interesting, thank you. As it seems to work as a Hamming space, so gives "distance", this is presumably a type of image hash but seems to be used as a fingerprint, which may not be guaranteed with other methods. Looking at the database website though, it might be on an indefinite beta status, but worth a bit of research and reading for my education. Fæ (talk) 08:13, 16 June 2026 (UTC)

File:Arthur and Fritz Kahn Collection 1889-1932 (20345633841).jpg

Thank you

I just wanted to say, "Thank you," for all of the images you have made accessible. I count on them to complete family history books for my clients and each time I see "Fæ" pop up, I send a bit of gratitude your way. With full attribution to you, your time and effort show up in the analog world, too. ~2026-35350-40 (talk) 14:31, 17 June 2026 (UTC)

File:Arthur and Fritz Kahn Collection 1889-1932 (20345633841).jpg