Commons:Batch uploading/Brooklyn Museum/HowTo
How do I process this batch job
First sorry for my bad english ;) This is a short documentation how I do this job. I hope this will help a bit if you start you own bot. You should be a linux or unix user to understand this. Currently I have to use Xubuntu (but I dislike it), so the following is done with Xubuntu.
I current work in this steps:
- analyse the website and find the best way to extract images and the metadata
- write some scripts with a lot of loops, sed and grep commands and then download all I need (images & metadata)
- (if needed parse the metadata again, check and format it) and create a script to upload all
- upload tests and upload (with another headless maschine)
First step. I test the api and find out that I can get all information about an object by its itemId. The simplest way to get all itemIds is to parse the search results. I wrote this simple bash script:
#!/bin/bash
outfile=itemIds.txt
# item count / 30 + 1
for i in {0..136} ; do
index="`expr $i \* 30`"
echo "Page #${i} ..."
lynx --source "http://www.brooklynmuseum.org/opencollection/search/?type=object&start_index=${index}&q=africa*&prev_q=&x=25&y=14"|tr '[\n\r\t]' ' '|sed 's/<div /\n<div /g'|grep 'item-info' |grep -v 'item-info-no-image'|grep '/opencollection/objects/[0-9]*/'|cut -d '"' -f 2| cut -d '/' -f 4 >> ${outfile}.tmp
done
cat ${outfile}.tmp | sort | uniq > ${outfile}
rm ${outfile}.tmp
After this I found only 1568 objects with images. Ok, I make a script to download the xml data to each object in a single file (I suggest using sub-folders for each object). You will need a api key you can get here. I little bit tuning the params to get the highest resolution and all other fine information.
#!/bin/bash
apikey="<insert your api key>"
cat itemIds.txt | while read item ; do
# create folder in 'files'
mkdir "files/${item}"
echo "ItemId: ${item} ..."
# get xml as-is
lynx -source "http://www.brooklynmuseum.org/opencollection/api/?method=collection.getItem&version=1&api_key=${apikey}&item_type=object&item_id=${item}&image_results_limit=20&include_html_style_block=true&max_image_size=1536" > files/${item}/${item}.xml
done
To analyse the available licences I wrote this script (extractRights.sh) and then pipe it to sort.
#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do
cat "${file}" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\" '{print $2}'
done
bash extractRights.sh | sort | uniq -c
This is the result:
1 1.0
80 copyright_artist_or_artists_estate
1450 creative_commons_by_nc
37 no_known_copyright_restrictions
Now I know the keywords for the licences I can use (creative_commons_by_nc and no_known_copyright_restrictions) and wrote script to remove all files that are not with this license.
Hint: Currently there is a mistake by the museum. They marked images as CC-BY on the website but the same as CC-BY-NC on the api. We are sure they mean CC-BY in the api too, so 'creative_commons_by_nc' in the api means 'creative_commons_by'.
#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do
rightstype="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
if [ "$rightstype" != "creative_commons_by_nc" ] && [ "$rightstype" != "no_known_copyright_restrictions" ] ; then
rm "$file"
rmdir "`dirname $file`"
fi
done
I do the same with the attribute 'collection' to get only items that are in the Arts of Africa collection.
#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do
collection="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
if [ "$collection" != "Arts of Africa" ] ; then
rm "$file"
rmdir "`dirname $file`"
fi
done
Ok, lets count:
find ./files -type f -name '*.xml' | wc -l
Ok, this are 1392 objects now.
I wrote a bash script that extract all information from the xml files, put them in singles files each and download and rename the images. I know that bash is not the perfect script language for this job, but I like to play around with and its easy to develop. Do not wonder I put any piece of information on a single file (I love to work with single files) but this will make the upload script small and easy to develop.
#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do
echo "Process ${file} ..."
xml="`cat \"${file}\" | tr '[\r\n]' ' '`"
# extract all information with some grep and sed magic
# please do not try to understand this while you are not a little bit crazy ;)
id="`echo \"${xml}\" | sed 's/id=/\nid=/g'| grep "id=" | head -n 1|sed '/id/s/\(.*id=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
title="`echo \"${xml}\" | sed 's/title=/\ntitle=/g'| grep "title=" | head -n 1|sed '/title/s/\(.*title=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
uri="`echo \"${xml}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
accession_number="`echo \"${xml}\" | sed 's/accession_number=/\naccession_number=/g'| grep "accession_number=" | head -n 1|sed '/accession_number/s/\(.*accession_number=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
object_date="`echo \"${xml}\"| sed 's/object_date=/\nobject_date=/g'| grep "object_date=" | head -n 1|sed '/object_date/s/\(.*object_date=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
medium="`echo \"${xml}\" | sed 's/medium=/\nmedium=/g'| grep "medium=" | head -n 1|sed '/medium/s/\(.*medium=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
dimensions="`echo \"${xml}\"| sed 's/dimensions=/\ndimensions=/g'| grep "dimensions=" | head -n 1|sed '/dimensions/s/\(.*dimensions=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
credit_line="`echo \"${xml}\" | sed 's/credit_line=/\ncredit_line=/g'| grep "credit_line=" | head -n 1|sed '/credit_line/s/\(.*credit_line=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
classification="`echo \"${xml}\" | sed 's/classification=/\nclassification=/g'| grep "classification=" | head -n 1|sed '/classification/s/\(.*classification=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
description="`echo \"${xml}\" | sed 's/description=/\ndescription=/g'| grep "description=" | head -n 1|sed '/description/s/\(.*description=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
location="`echo \"${xml}\" | sed 's/location=/\nlocation=/g'| grep "location=" | head -n 1|sed '/location/s/\(.*location=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
label="`echo \"${xml}\" | sed 's/label=/\nlabel=/g'| grep "label=" | head -n 1|sed '/label/s/\(.*label=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
#collection="`echo \"${xml}\" | sed 's/collection=/\ncollection=/g'| grep "collection=" | head -n 1|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
#rightstype="`echo \"${xml}\" | sed 's/rightstype=/\nrightstype=/g'| grep "rightstype=" | head -n 1|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
markings="`echo \"${xml}\" | sed 's/markings=/\nmarkings=/g'| grep "markings=" | head -n 1|sed '/markings/s/\(.*markings=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
dynasty="`echo \"${xml}\" | sed 's/dynasty=/\ndynasty=/g'| grep "dynasty=" | head -n 1|sed '/dynasty/s/\(.*dynasty=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
signed="`echo \"${xml}\" | sed 's/signed=/\nsigned=/g'| grep "signed=" | head -n 1|sed '/signed/s/\(.*signed=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
period="`echo \"${xml}\" | sed 's/period=/\nperiod=/g'| grep "period=" | head -n 1|sed '/period/s/\(.*period=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
if [ "$id" != "" ] ; then
echo -n "$id" > $file.id
fi
if [ "$title" != "" ] ; then
echo -n "$title" > $file.title
fi
if [ "$uri" != "" ] ; then
echo -n "$uri" > $file.uri
fi
if [ "$accession_number" != "" ] ; then
echo -n "$accession_number" > $file.accession_number
fi
if [ "$object_date" != "" ] ; then
echo -n "$object_date" > $file.object_date
fi
if [ "$medium" != "" ] ; then
echo -n "$medium" > $file.medium
fi
if [ "$dimensions" != "" ] ; then
echo -n "$dimensions" > $file.dimensions
fi
if [ "$credit_line" != "" ] ; then
echo -n "$credit_line" > $file.credit_line
fi
if [ "$classification" != "" ] ; then
echo -n "$classification" > $file.classification
fi
if [ "$description" != "" ] ; then
echo -n "$description" > $file.description
fi
if [ "$label" != "" ] ; then
echo -n "$label" > $file.label
fi
if [ "$location" != "" ] ; then
echo -n "$location" > $file.location
fi
#if [ "$collection" != "" ] ; then
# echo -n "$collection" > $file.collection
#fi
#if [ "$rightstype" != "" ] ; then
# echo -n "$rightstype" > $file.rightstype
#fi
# others
###################################################
if [ "$markings" != "" ] ; then
echo "* Markings: $markings" >> "$file.other"
fi
if [ "$signed" != "" ] ; then
echo "* Signed: $signed" >> "$file.other"
fi
if [ "$dynasty" != "" ] ; then
echo "* Dynasty: $dynasty" >> "$file.other"
fi
if [ "$period" != "" ] ; then
echo "* Period: $period" >> "$file.other"
fi
# artists (diffrent values)
echo "${xml}" | sed 's/<artist /\n<artist /g' | grep '<artist ' | while read artist ; do
artist_role="`echo \"${artist}\" | sed '/role/s/\(.*role=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
artist_name="`echo \"${artist}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
echo "* ${artist_role}: ${artist_name}" >> "$file.other"
done
# geolocations (diffrent values)
echo "${xml}" | sed 's/<geolocation /\n<geolocation /g' | grep '<geolocation ' | while read geolocation ; do
geolocation_name="`echo \"${geolocation}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
geolocation_type="`echo \"${geolocation}\" | sed '/location_type/s/\(.*location_type=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
echo "* ${geolocation_type}: ${geolocation_name}" >> $file.other
done
# images
image_count=0
echo "${xml}" | sed 's/<image uri=/\n<image uri=/g' | grep '<image uri=' | sed 's/\/size[0-9]\//\/size4\//g' |while read image ; do
image_link="`echo \"${image}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
image_color="`echo \"${image}\" | sed 's/is_color=/\nis_color=/g'| grep "is_color=" | head -n 1|sed '/is_color/s/\(.*is_color=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|grep 'true'`"
image_xray="`echo \"${image_link}\" | grep '_xrs_\|_xray_' &> /dev/null && echo \"true\"`"
image_name="`basename \"${image_link}\"`"
image_ext="`echo \"${image_name}\" | rev | cut -d '.' -f 1 | rev | tr '[A-Z]' '[a-z]'`"
image_count=`expr ${image_count} + 1`
if [ "$image_count" -gt "1" ] ; then
upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`_(${image_count}).${image_ext}"
else
upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`.${image_ext}"
fi
echo "> Download ${image_name} ..."
wget "${image_link}" -O "files/${id}/${upload_name}" &> "files/${id}/${upload_name}.log" || echo "ERROR!" >> "files/${id}/${upload_name}.log"
echo "File:${upload_name}" >> "$file.gallery"
if [ "${image_link}" != "" ] ; then
echo -n "$image_link" > "files/${id}/${upload_name}.link"
fi
if [ "${image_name}" != "" ] ; then
echo -n "$image_name" > "files/${id}/${upload_name}.name"
fi
if [ "${image_color}" != "" ] ; then
echo -n "$image_color" > "files/${id}/${upload_name}.color"
fi
if [ "${image_xray}" != "" ] ; then
echo -n "$image_xray" > "files/${id}/${upload_name}.xray"
fi
done
done
This result in a file-listing like this for each xml-file.
$ ls -l insgesamt 564 -rw-rw-r-- 1 xxx xxx 2238 Okt 15 20:27 2910.xml -rw-rw-r-- 1 xxx xxx 6 Okt 20 13:05 2910.xml.accession_number -rw-rw-r-- 1 xxx xxx 9 Okt 20 13:05 2910.xml.classification -rw-rw-r-- 1 xxx xxx 14 Okt 20 13:05 2910.xml.collection -rw-rw-r-- 1 xxx xxx 56 Okt 20 13:05 2910.xml.credit_line -rw-rw-r-- 1 xxx xxx 547 Okt 20 13:05 2910.xml.description -rw-rw-r-- 1 xxx xxx 52 Okt 20 13:05 2910.xml.dimensions -rw-rw-r-- 1 xxx xxx 70 Okt 20 13:05 2910.xml.gallery -rw-rw-r-- 1 xxx xxx 4 Okt 20 13:05 2910.xml.id -rw-rw-r-- 1 xxx xxx 18 Okt 20 13:05 2910.xml.medium -rw-rw-r-- 1 xxx xxx 31 Okt 20 13:05 2910.xml.object_date -rw-rw-r-- 1 xxx xxx 87 Okt 20 13:05 2910.xml.other -rw-rw-r-- 1 xxx xxx 22 Okt 20 13:05 2910.xml.rightstype -rw-rw-r-- 1 xxx xxx 5 Okt 20 13:05 2910.xml.title -rw-rw-r-- 1 xxx xxx 63 Okt 20 13:05 2910.xml.uri -rw-rw-r-- 1 xxx xxx 161155 Mär 10 2012 Brooklyn_Museum_22.233_Stool_(2).jpg -rw-rw-r-- 1 xxx xxx 80 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.link -rw-rw-r-- 1 xxx xxx 927 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.log -rw-rw-r-- 1 xxx xxx 13 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.name -rw-rw-r-- 1 xxx xxx 162477 Mär 15 2012 Brooklyn_Museum_22.233_Stool.jpg -rw-rw-r-- 1 xxx xxx 4 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.color -rw-rw-r-- 1 xxx xxx 91 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.link -rw-rw-r-- 1 xxx xxx 932 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.log -rw-rw-r-- 1 xxx xxx 24 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.name
Hint: Dependent on the source there can unusable filenames with %XX characters or double dots or double underlines. You should found this before upload and rename all dependent files correct, otherwise the upload fails silent with the upload-script. (you can try to log it and pipe all output to a logfile and analyse this after all, i.E. python pywikipedia/upload.py ... &>> alluploads.log)
Now we can upload the files, using pywikipedia and this upload script:
find ./files/ -name '*.jpg' | while read file ; do
if ! grep -m 1 "^${file}$" upload.log &> /dev/null ; then
path="`dirname \"${file}\"`"
number="`basename \"${path}\"`"
filename="`basename \"${file}\"`"
id="`cat \"${path}/${number}.xml.id\" 2> /dev/null`"
uri="`cat \"${path}/${number}.xml.uri\" 2> /dev/null`"
accession_number="`cat \"${path}/${number}.xml.accession_number\" 2> /dev/null`"
medium="`cat \"${path}/${number}.xml.medium\" 2> /dev/null`"
dimensions="`cat \"${path}/${number}.xml.dimensions\" 2> /dev/null`"
credit_line="`cat \"${path}/${number}.xml.credit_line\" 2> /dev/null`"
image_link="`cat \"${file}.link\" 2> /dev/null`"
image_name="`cat \"${file}.name\" 2> /dev/null`"
# prepare title
if test -e "${path}/${number}.xml.title" ; then
title="{{en|`cat \"${path}/${number}.xml.title\" 2> /dev/null`}}"
else
title=""
fi
# prepare date
if test -e "${path}/${number}.xml.object_date" ; then
if grep "^[0-9]*th century$" "${path}/${number}.xml.object_date" &> /dev/null ; then
yy="`cat \"${path}/${number}.xml.object_date\"| sed 's/[a-zA-Z]//g' | sed 's/[ ]*//g'`"
object_date="{{other_date|century|${yy}}}"
else
object_date="{{en|`cat \"${path}/${number}.xml.object_date\" 2> /dev/null`}}"
fi
else
object_date=""
fi
# prepare description (the line break and the empty line in the environment variable are important)
description="`cat \"${path}/${number}.xml.description\" 2> /dev/null`"
label="`cat \"${path}/${number}.xml.label\" 2> /dev/null`"
if [ "${description}" != "" ] && [ "${label}" != "" ] ; then
description="{{en|${description}}}
{{en|${label}}}"
else
if [ "${description}" == "" ] && [ "${label}" == "" ] ; then
description="${title}"
else
description="{{en|${description}${label}}}"
fi
fi
# prepare location
location="`cat \"${path}/${number}.xml.location\" 2> /dev/null`"
if test -e "${path}/${number}.xml.location" ; then
location="{{Brooklyn Museum location|collection=africa}} ${location}"
else
location="{{Brooklyn Museum location|collection=africa}}"
fi
# prepare additional notes
notes=""
if test -e "${path}/${number}.xml.other" 2> /dev/null ; then
notes="`cat \"${path}/${number}.xml.other\" | sed 's/ place / Place /g' 2> /dev/null`"
else
notes=""
fi
# add gallery if more than one image (the line breaks in the environment variables are important)
image_count="`cat \"${path}/${number}.xml.gallery\" 2> /dev/null | wc -l`"
if [ "${image_count}" -gt "1" ] ; then
gallery="<gallery>
`cat \"${path}/${number}.xml.gallery\" 2> /dev/null`
</gallery>"
else
gallery=""
fi
# add categories for b&w or x-ray (the line breaks in the environment variables are important)
add_categories=""
if test -e "${file}.xray" ; then
add_categories="
[[Category:X-rays of objects]]"
else
if ! test -e "${file}.color" ; then
add_categories="
[[Category:Black and white photographs]]"
fi
fi
# upload...
echo "Uploading $filename => "
starttime=$(date +"%s")
yes N | python pywikipedia/upload.py -simulate -keep -filename:${filename} -noverify ${file} "{{Artwork
| Artist = {{unknown}}
| Title = ${title}
| Year = ${object_date}
| Description = ${description}
| Technique =
| Dimensions = ${dimensions}
| Institution = {{Institution:Brooklyn Museum}}
| Location = ${location}
| Credit_line = ${credit_line}
| Inscriptions =
| Notes = ${notes}
| Source = [http://www.brooklynmuseum.org/opencollection/objects/${id} Online Collection] of [[w:Brooklyn Museum|Brooklyn Museum]]; Photo: Brooklyn Museum, [${image_link} ${image_name}]
| accession number = [http://www.brooklynmuseum.org/opencollection/objects/${id} ${accession_number}]
| Permission = {{WikiAfrica/Brooklyn Museum}}
| Other_versions = ${gallery}
}}
[[Category:African art in the Brooklyn Museum]]
[[Category:Import by User:Slick-o-bot/Brooklyn Museum]]${add_categories}" && echo "${file}" >> upload.log
# set throttle (means: $throttle uploads per minute)
throttle=4
stoptime=$(date +"%s")
uploadtime=$(($stoptime-$starttime))
sleep=`expr \( 60 - ${throttle} \* ${uploadtime} \) / \( ${throttle} - 1 \)`
if [[ ${sleep} -lt 0 ]] ; then sleep=0 ; fi
echo "-----------------------------------------------------------------"
echo ">> upload time was ${uploadtime} seconds, sleeping ${sleep} seconds"
echo "-----------------------------------------------------------------"
sleep ${sleep}
fi
done