Commons:Bots/Requests/Smallbot 9

Smallbot (talk · contribs)


Operator: Smallman12q (talk · contributions · Statistics · Recent activity · block log · User rights log · uploads · Global account information)

Bot's tasks for which permission is being sought: To fulfill Commons:Batch_uploading#VOA_pronunciation_sound_files. Uploading ~6500 pronunciation files from http://names.voa.gov

Automatic or manually assisted: Automatic

Edit type (e.g. Continuous, daily, one time run): Initial one run, followed by monthly run

Maximum edit rate (e.g. edits per minute): 10-15, as fast it uploads

Bot flag requested: (Y/N): No

Programming language(s): Python3.2 w/ requests, beautifulsoup4. ffmpeg for conversion.

Source
#!/usr/bin/env python3.2
# -*- coding: utf-8 -*-

#For uploading files from names.voa.gov to commons

from bs4 import BeautifulSoup
import requests
from subprocess import call
import os.path
import traceback
from PyRWiki import Wiki #Requests based wrapper for api
from p import p

DEBUG=False

#http://stackoverflow.com/questions/1752662/beautifulsoup-easy-way-to-to-obtain-html-free-contents
#http://stackoverflow.com/questions/10993612/python-removing-xa0-from-string
#http://stackoverflow.com/questions/2077897/substitute-multiple-whitespace-with-single-whitespace-in-python
def textOf(soup):
    return ' '.join(''.join(soup.findAll(text=True)).replace('\xa0', ' ').strip().split())

#make first letter of word Upper if after space or "-", only single space/-
def fixname(oldname):
    oldname=oldname.strip()
    fixed=""
    lastwasspace=True#Make first letter upper
    for i in oldname:
        if lastwasspace:
            fixed += i.upper()
            lastwasspace=False
        else:
            if i == " " or i == "-":
                lastwasspace = True
            else:
                lastwasspace = False
            fixed += i.lower()
    return fixed

def log(stuff):
    print(stuff)

#Log in
commons = Wiki("https://commons.wikimedia.org/w/api.php","Smallbot")
commons.login('Smallbot',p.bP)
commons.setEditToken()

counter= 2 #starts at 2
log('Checking for last id.')
if os.path.isfile('last.txt'):
    with open('last.txt', 'r') as content_file:
        counter = int(content_file.read())
        log('Last id found: ' + str(counter))
else:
    log('No prior id found. Starting at 2.')

session=requests.session()
if DEBUG:
    session.proxies = {'http': 'http://localhost:8888'}
session.headers = {'Referer': 'https://commons.wikimedia.org/wiki/Commons:Batch_uploading/VOA_pronunciation_sound_files'}

lastsuccess=counter
reached404=0
try:
    while reached404 < 25: # up to 25 can be skipped
        r = session.get('http://names.voa.gov/modal.phrasedetail.php?id=' + str(counter))

        #if r.status_code == 404:
        if "Cannot find the requested name" in r.text:
            reached404 += 1
            log('404 reached for ' + str(counter))
        else:
            reached404=0#reset 404 counter
            lastsuccess=counter
            soup=BeautifulSoup(r.content)
            soupbody=soup.select('div.modal-body')[0]
            if textOf(soupbody) != "How do you say ?":
                name=textOf(soupbody.select("h2")[0])[15:-1] # remove "How do you say" and '?'
                name=fixname(name)
                pronounce= textOf(soupbody.select('p')[0])
                region=textOf(soup.select('h4')[0].findNext('p'))#('h4 + p')[0]) #Adjacent sibling selector
                if textOf(soup.select('h4')[0]) != 'Region':
                    region=''

                r=session.get('http://names.voa.gov/sounds/' + str(counter) + '.mp3')
                r.raise_for_status() #should be no errors

                log('---------------------------------')
                log('ID: ' + str(counter))
                log('Name: ' + name)
                log('Pronounce: ' + repr(pronounce))
                log('Region: ' + region)
                log(str(len(r.content)) + ' bytes')

                with open('data.mp3','wb') as voamp3:
                    voamp3.write(r.content)

                filedesc="{{Information\n" +\
                            "|description= {{VOA pronunciation|term=" + name + "|region=" + region + "|transliteration=" +  pronounce + "}}\n" +\
                            "|date= 2013\n" +\
                            "|source= VOA pronunciation guide: [http://names.voa.gov/modal.phrasedetail.php?id="  + str(counter) + " " + name + "]\n" +\
                            "|author= Jim Tedder\n" +\
                            "|permission= {{PD-USGov-VOA}}\n" +\
                            "|other_versions=\n" +\
                            "}}\n"
                if os.path.exists('data.ogg'):
                    os.remove('data.ogg')
                #call(['avconv', '-i', 'data.mp3', '-acodec', 'libvorbis', '-aq', '7', 'data.webm'])
                call(['avconv', '-i', 'data.mp3', '-acodec', 'libvorbis', '-aq', '7', 'data.ogg']) #use .ogg instead

                if region != '':
                    region = ' from ' + region
                commons.upload(title="En-us-" + name + region + ' pronunciation (Voice of America).ogg',
                               filelocation='data.ogg',
                               text=filedesc,
                               comment='[[Commons:Bots/Requests/Smallbot 9]]: Uploading Voice of America pronunciation files from http://names.voa.gov',
                               uploadifduplicate=False)
                #TODO-upload data.webm as file
            else:
                log('Empty at ' + str(counter))
        counter += 1

except:
    traceback.print_exc()
finally: #
    with open('last.txt','w') as lastfp:
        lastfp.write(str(lastsuccess))
    log('Done.')

Also need a 'last.txt' with the value of 6937

Smallman12q (talk) 20:45, 2 May 2013 (UTC)

Discussion

What should the file description be? Should I use {{Pronunciation}}? Smallman12q (talk) 20:45, 2 May 2013 (UTC)

Looks like this template is not popular, but it's good idea to standardize media files class descriptions. BTW is this source so unique and Commons doesn't have such pronunciations? :-) --EugeneZelenko (talk) 14:39, 3 May 2013 (UTC)
I don't believe Commons has these pronunciations. Is there some standard pronunciation template? I'll probably make one for the VOA files.Smallman12q (talk) 03:09, 4 May 2013 (UTC)
The template would read:

Voice of America pronunciation of <term> from the region of <region>. Transliteration: <transliteration>

Is that fine? It'll also auto-categorize by region and first letter of the first name so "AL-HALQI, WAEL" would be "WAEL AL-HALQI" and categorized by W. Is the letter/region categorization needed?Smallman12q (talk) 19:51, 4 May 2013 (UTC)

Well... this should clearly be marked as an american pronounciation recommendation. At least for the few german names I have checked this is certainly not the gold-standard for pronounciation (Erik Honnecker, Frantz Muntefering, and many more). --Dschwen (talk) 16:49, 3 May 2013 (UTC)

there is contact info, i'm sure they would be open to your feedback. (or refer them to our local Goethe institute) - the value is that it is a currently maintained public domain source of pronunciations. Slowking4 †@1₭ 13:03, 4 May 2013 (UTC)

I've uploaded a few to Category:Terms from Voice of America pronunciation guide. Is it good to go?Smallman12q (talk) 14:03, 8 May 2013 (UTC)

It'll be good idea to include these files into some pronunciation categories. --EugeneZelenko (talk) 14:27, 8 May 2013 (UTC)
I could add them to Category:English pronunciation and also prepend the names with En-us so it'd be "File:En-us Abadilla from Philippines pronunciation (Voice of America).webm"? Would that be all?Smallman12q (talk) 23:17, 8 May 2013 (UTC)
Adding language code prefix is definitely good idea. BTW why not to upload in Ogg format? At least majority of pronunciations use this format. --EugeneZelenko (talk) 14:37, 9 May 2013 (UTC)
I've asked at w:Wikipedia:Village_pump_(technical)#Preferred_format_for_pronunciations whether it should be .webm or .ogg. Is there a reason you prefer one over the other? I can do either, it's only a one line change.Smallman12q (talk) 17:56, 9 May 2013 (UTC)
Bot is uploading as .ogg for all. Could you delete:
  • File:Egil Aarvik from Norway pronunciation (Voice of America).webm
  • File:Sani Abacha from Nigeria pronunciation (Voice of America).webm
  • File:Jorge Abadia from Panama pronunciation (Voice of America).webm
  • File:Abadilla from Philippines pronunciation (Voice of America).webm
  • File:Leonid Abalkin from Russia pronunciation (Voice of America).webm
  • File:Domingo Iturbe Abasolo from Spain pronunciation (Voice of America).webm

Smallman12q (talk) 00:36, 10 May 2013 (UTC)

You could just add {{Superseded}} or {{Delete}} on files.

If there is no other objections, I think task should be approved. --EugeneZelenko (talk) 14:31, 10 May 2013 (UTC)

Initial run is done. Will run monthly or so in the future.Smallman12q (talk) 23:20, 10 May 2013 (UTC)
Category:Pages using deprecated source tags