How do I process this batch job

First sorry for my bad english ;) This is a short documentation how I do this job. I hope this will help a bit if you start you own bot. You should be a linux or unix user to understand this. Currently I have to use Xubuntu (but I dislike it), so the following is done with Xubuntu.

I current work in this steps:

analyse the website and find the best way to extract images and the metadata
write some scripts with a lot of loops, sed and grep commands and then download all I need (images & metadata)
(if needed parse the metadata again, check and format it) and create a script to upload all
upload tests and upload (with another headless maschine)

First step. I test the api and find out that I can get all information about an object by its itemId. The simplest way to get all itemIds is to parse the search results. I wrote this simple bash script:

#!/bin/bash
outfile=itemIds.txt

# item count / 30 + 1
for i in {0..136} ; do

	index="`expr $i \* 30`"

	echo "Page #${i} ..."
	lynx --source "http://www.brooklynmuseum.org/opencollection/search/?type=object&start_index=${index}&q=africa*&prev_q=&x=25&y=14"|tr '[\n\r\t]' ' '|sed 's/<div /\n<div /g'|grep 'item-info' |grep -v 'item-info-no-image'|grep '/opencollection/objects/[0-9]*/'|cut -d '"' -f 2| cut -d '/' -f 4 >> ${outfile}.tmp

done

cat ${outfile}.tmp | sort | uniq > ${outfile}
rm ${outfile}.tmp

After this I found only 1568 objects with images. Ok, I make a script to download the xml data to each object in a single file (I suggest using sub-folders for each object). You will need a api key you can get here. I little bit tuning the params to get the highest resolution and all other fine information.

#!/bin/bash
apikey="<insert your api key>"
cat itemIds.txt | while read item ; do

	# create folder in 'files'
	mkdir "files/${item}"

	echo "ItemId: ${item} ..."

	# get xml as-is
	lynx -source "http://www.brooklynmuseum.org/opencollection/api/?method=collection.getItem&version=1&api_key=${apikey}&item_type=object&item_id=${item}&image_results_limit=20&include_html_style_block=true&max_image_size=1536" > files/${item}/${item}.xml
done

To analyse the available licences I wrote this script (extractRights.sh) and then pipe it to sort.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do
	cat "${file}" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\" '{print $2}'
done

bash extractRights.sh | sort | uniq -c

This is the result:

      1 1.0
     80 copyright_artist_or_artists_estate
   1450 creative_commons_by_nc
     37 no_known_copyright_restrictions

Now I know the keywords for the licences I can use (creative_commons_by_nc and no_known_copyright_restrictions) and wrote script to remove all files that are not with this license.

Hint: Currently there is a mistake by the museum. They marked images as CC-BY on the website but the same as CC-BY-NC on the api. We are sure they mean CC-BY in the api too, so 'creative_commons_by_nc' in the api means 'creative_commons_by'.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do

	rightstype="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

	if [ "$rightstype" != "creative_commons_by_nc" ] && [ "$rightstype" != "no_known_copyright_restrictions" ] ; then
	
		rm "$file"
		rmdir "`dirname $file`"

	fi
done

I do the same with the attribute 'collection' to get only items that are in the Arts of Africa collection.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do

	collection="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

	if [ "$collection" != "Arts of Africa" ] ; then
	
		rm "$file"
		rmdir "`dirname $file`"

	fi
done

Ok, lets count:

find ./files -type f -name '*.xml'  | wc -l

Ok, this are 1392 objects now.

I wrote a bash script that extract all information from the xml files, put them in singles files each and download and rename the images. I know that bash is not the perfect script language for this job, but I like to play around with and its easy to develop. Do not wonder I put any piece of information on a single file (I love to work with single files) but this will make the upload script small and easy to develop.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do

	echo "Process ${file} ..."

	xml="`cat \"${file}\" | tr '[\r\n]' ' '`"

	# extract all information with some grep and sed magic
	# please do not try to understand this while you are not a little bit crazy ;)

	id="`echo \"${xml}\" | sed 's/id=/\nid=/g'| grep "id=" | head -n 1|sed '/id/s/\(.*id=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	title="`echo \"${xml}\" | sed 's/title=/\ntitle=/g'| grep "title=" | head -n 1|sed '/title/s/\(.*title=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	uri="`echo \"${xml}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	accession_number="`echo \"${xml}\" | sed 's/accession_number=/\naccession_number=/g'| grep "accession_number=" | head -n 1|sed '/accession_number/s/\(.*accession_number=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	object_date="`echo \"${xml}\"| sed 's/object_date=/\nobject_date=/g'| grep "object_date=" | head -n 1|sed '/object_date/s/\(.*object_date=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	medium="`echo \"${xml}\" | sed 's/medium=/\nmedium=/g'| grep "medium=" | head -n 1|sed '/medium/s/\(.*medium=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
	dimensions="`echo \"${xml}\"| sed 's/dimensions=/\ndimensions=/g'| grep "dimensions=" | head -n 1|sed '/dimensions/s/\(.*dimensions=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
	credit_line="`echo \"${xml}\" | sed 's/credit_line=/\ncredit_line=/g'| grep "credit_line=" | head -n 1|sed '/credit_line/s/\(.*credit_line=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	classification="`echo \"${xml}\" | sed 's/classification=/\nclassification=/g'| grep "classification=" | head -n 1|sed '/classification/s/\(.*classification=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	description="`echo \"${xml}\" | sed 's/description=/\ndescription=/g'| grep "description=" | head -n 1|sed '/description/s/\(.*description=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
	location="`echo \"${xml}\" | sed 's/location=/\nlocation=/g'| grep "location=" | head -n 1|sed '/location/s/\(.*location=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	label="`echo \"${xml}\" | sed 's/label=/\nlabel=/g'| grep "label=" | head -n 1|sed '/label/s/\(.*label=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
	#collection="`echo \"${xml}\" | sed 's/collection=/\ncollection=/g'| grep "collection=" | head -n 1|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	#rightstype="`echo \"${xml}\" | sed 's/rightstype=/\nrightstype=/g'| grep "rightstype=" | head -n 1|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	markings="`echo \"${xml}\" | sed 's/markings=/\nmarkings=/g'| grep "markings=" | head -n 1|sed '/markings/s/\(.*markings=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	dynasty="`echo \"${xml}\" | sed 's/dynasty=/\ndynasty=/g'| grep "dynasty=" | head -n 1|sed '/dynasty/s/\(.*dynasty=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	signed="`echo \"${xml}\" | sed 's/signed=/\nsigned=/g'| grep "signed=" | head -n 1|sed '/signed/s/\(.*signed=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
	period="`echo \"${xml}\" | sed 's/period=/\nperiod=/g'| grep "period=" | head -n 1|sed '/period/s/\(.*period=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

	if [ "$id" != "" ] ; then
		echo -n "$id" > $file.id
	fi

	if [ "$title" != "" ] ; then
		echo -n "$title" > $file.title
	fi

	if [ "$uri" != "" ] ; then
		echo -n "$uri" > $file.uri
	fi

	if [ "$accession_number" != "" ] ; then
		echo -n "$accession_number" > $file.accession_number
	fi

	if [ "$object_date" != "" ] ; then
		echo -n "$object_date" > $file.object_date
	fi

	if [ "$medium" != "" ] ; then
		echo -n "$medium" > $file.medium
	fi

	if [ "$dimensions" != "" ] ; then
		echo -n "$dimensions" > $file.dimensions
	fi

	if [ "$credit_line" != "" ] ; then
		echo -n "$credit_line" > $file.credit_line
	fi

	if [ "$classification" != "" ] ; then
		echo -n "$classification" > $file.classification
	fi

	if [ "$description" != "" ] ; then
		echo -n "$description" > $file.description
	fi

	if [ "$label" != "" ] ; then
		echo -n "$label" > $file.label
	fi

	if [ "$location" != "" ] ; then
		echo -n "$location" > $file.location
	fi

	#if [ "$collection" != "" ] ; then
	#	echo -n "$collection" > $file.collection
	#fi

	#if [ "$rightstype" != "" ] ; then
	#	echo -n "$rightstype" > $file.rightstype
	#fi

	# others
	###################################################

	if [ "$markings" != "" ] ; then
		echo "* Markings: $markings" >> "$file.other"
	fi

	if [ "$signed" != "" ] ; then
		echo "* Signed: $signed" >> "$file.other"
	fi

	if [ "$dynasty" != "" ] ; then
		echo "* Dynasty: $dynasty" >> "$file.other"
	fi

	if [ "$period" != "" ] ; then
		echo "* Period: $period" >> "$file.other"
	fi

	# artists (diffrent values)
	echo "${xml}" | sed 's/<artist /\n<artist /g' | grep '<artist ' | while read artist ; do

		artist_role="`echo \"${artist}\" | sed '/role/s/\(.*role=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
		artist_name="`echo \"${artist}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

		echo "* ${artist_role}: ${artist_name}" >> "$file.other"

	done
	
	# geolocations (diffrent values)
	echo "${xml}" | sed 's/<geolocation /\n<geolocation /g' | grep '<geolocation ' | while read geolocation ; do

		geolocation_name="`echo \"${geolocation}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
		geolocation_type="`echo \"${geolocation}\" | sed '/location_type/s/\(.*location_type=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

		echo "* ${geolocation_type}: ${geolocation_name}" >> $file.other	

	done

	# images
	image_count=0
	echo "${xml}" | sed 's/<image uri=/\n<image uri=/g' | grep '<image uri=' | sed 's/\/size[0-9]\//\/size4\//g' |while read image ; do

		image_link="`echo \"${image}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
		image_color="`echo \"${image}\" | sed 's/is_color=/\nis_color=/g'| grep "is_color=" | head -n 1|sed '/is_color/s/\(.*is_color=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|grep 'true'`"
		image_xray="`echo \"${image_link}\" | grep '_xrs_\|_xray_' &> /dev/null && echo \"true\"`"
		image_name="`basename \"${image_link}\"`"
		image_ext="`echo \"${image_name}\" | rev | cut -d '.' -f 1 | rev | tr '[A-Z]' '[a-z]'`"

		image_count=`expr ${image_count} + 1`
		if [ "$image_count" -gt "1" ] ; then
			upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`_(${image_count}).${image_ext}"
		else
			upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`.${image_ext}"
		fi

		echo "> Download ${image_name} ..."
		
		wget "${image_link}" -O "files/${id}/${upload_name}" &> "files/${id}/${upload_name}.log" || echo "ERROR!" >> "files/${id}/${upload_name}.log"

		echo "File:${upload_name}" >> "$file.gallery"

		if [ "${image_link}" != "" ] ; then
			echo -n "$image_link" > "files/${id}/${upload_name}.link"
		fi

		if [ "${image_name}" != "" ] ; then
			echo -n "$image_name" > "files/${id}/${upload_name}.name"
		fi

		if [ "${image_color}" != "" ] ; then
			echo -n "$image_color" > "files/${id}/${upload_name}.color"
		fi

		if [ "${image_xray}" != "" ] ; then
			echo -n "$image_xray" > "files/${id}/${upload_name}.xray"
		fi

	done

done

This result in a file-listing like this for each xml-file.

$ ls -l
insgesamt 564
-rw-rw-r-- 1 xxx xxx   2238 Okt 15 20:27 2910.xml
-rw-rw-r-- 1 xxx xxx      6 Okt 20 13:05 2910.xml.accession_number
-rw-rw-r-- 1 xxx xxx      9 Okt 20 13:05 2910.xml.classification
-rw-rw-r-- 1 xxx xxx     14 Okt 20 13:05 2910.xml.collection
-rw-rw-r-- 1 xxx xxx     56 Okt 20 13:05 2910.xml.credit_line
-rw-rw-r-- 1 xxx xxx    547 Okt 20 13:05 2910.xml.description
-rw-rw-r-- 1 xxx xxx     52 Okt 20 13:05 2910.xml.dimensions
-rw-rw-r-- 1 xxx xxx     70 Okt 20 13:05 2910.xml.gallery
-rw-rw-r-- 1 xxx xxx      4 Okt 20 13:05 2910.xml.id
-rw-rw-r-- 1 xxx xxx     18 Okt 20 13:05 2910.xml.medium
-rw-rw-r-- 1 xxx xxx     31 Okt 20 13:05 2910.xml.object_date
-rw-rw-r-- 1 xxx xxx     87 Okt 20 13:05 2910.xml.other
-rw-rw-r-- 1 xxx xxx     22 Okt 20 13:05 2910.xml.rightstype
-rw-rw-r-- 1 xxx xxx      5 Okt 20 13:05 2910.xml.title
-rw-rw-r-- 1 xxx xxx     63 Okt 20 13:05 2910.xml.uri
-rw-rw-r-- 1 xxx xxx 161155 Mär 10  2012 Brooklyn_Museum_22.233_Stool_(2).jpg
-rw-rw-r-- 1 xxx xxx     80 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.link
-rw-rw-r-- 1 xxx xxx    927 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.log
-rw-rw-r-- 1 xxx xxx     13 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.name
-rw-rw-r-- 1 xxx xxx 162477 Mär 15  2012 Brooklyn_Museum_22.233_Stool.jpg
-rw-rw-r-- 1 xxx xxx      4 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.color
-rw-rw-r-- 1 xxx xxx     91 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.link
-rw-rw-r-- 1 xxx xxx    932 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.log
-rw-rw-r-- 1 xxx xxx     24 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.name

Hint: Dependent on the source there can unusable filenames with %XX characters or double dots or double underlines. You should found this before upload and rename all dependent files correct, otherwise the upload fails silent with the upload-script. (you can try to log it and pipe all output to a logfile and analyse this after all, i.E. python pywikipedia/upload.py ... &>> alluploads.log)

Now we can upload the files, using pywikipedia and this upload script:

find ./files/ -name '*.jpg' | while read file ; do

	if ! grep -m 1 "^${file}$" upload.log &> /dev/null ; then

		path="`dirname \"${file}\"`"
		number="`basename \"${path}\"`"
		filename="`basename \"${file}\"`"

		id="`cat \"${path}/${number}.xml.id\" 2> /dev/null`"
		uri="`cat \"${path}/${number}.xml.uri\" 2> /dev/null`"
		accession_number="`cat \"${path}/${number}.xml.accession_number\" 2> /dev/null`"
		medium="`cat \"${path}/${number}.xml.medium\" 2> /dev/null`"
		dimensions="`cat \"${path}/${number}.xml.dimensions\" 2> /dev/null`"
		credit_line="`cat \"${path}/${number}.xml.credit_line\" 2> /dev/null`"

		image_link="`cat \"${file}.link\" 2> /dev/null`"
		image_name="`cat \"${file}.name\" 2> /dev/null`"

		# prepare title
		if test -e "${path}/${number}.xml.title" ; then
			title="{{en|`cat \"${path}/${number}.xml.title\" 2> /dev/null`}}"
		else
			title=""
		fi

		# prepare date
		if test -e "${path}/${number}.xml.object_date" ; then
			if grep "^[0-9]*th century$" "${path}/${number}.xml.object_date" &> /dev/null ; then
				yy="`cat \"${path}/${number}.xml.object_date\"| sed 's/[a-zA-Z]//g' | sed 's/[ ]*//g'`"
				object_date="{{other_date|century|${yy}}}"
			else
				object_date="{{en|`cat \"${path}/${number}.xml.object_date\" 2> /dev/null`}}"
			fi
		else
			object_date=""
		fi

		# prepare description (the line break and the empty line in the environment variable are important)
		description="`cat \"${path}/${number}.xml.description\" 2> /dev/null`"
		label="`cat \"${path}/${number}.xml.label\" 2> /dev/null`"

		if [ "${description}" != "" ] &&  [ "${label}" != "" ] ; then
			description="{{en|${description}}}

{{en|${label}}}"
		else
			if [ "${description}" == "" ] &&  [ "${label}" == "" ] ; then
				description="${title}"
			else
				description="{{en|${description}${label}}}"
			fi
		fi

		# prepare location
		location="`cat \"${path}/${number}.xml.location\" 2> /dev/null`"
		if test -e "${path}/${number}.xml.location" ; then
			location="{{Brooklyn Museum location|collection=africa}} ${location}"
		else
			location="{{Brooklyn Museum location|collection=africa}}"
		fi

		# prepare additional notes
		notes=""
		if test -e "${path}/${number}.xml.other" 2> /dev/null ; then 
			notes="`cat \"${path}/${number}.xml.other\" | sed 's/ place / Place /g' 2> /dev/null`"
		else
			notes=""
		fi

		# add gallery if more than one image (the line breaks in the environment variables are important)
		image_count="`cat \"${path}/${number}.xml.gallery\" 2> /dev/null | wc -l`"
		if [ "${image_count}" -gt "1" ] ; then
			gallery="<gallery>
`cat \"${path}/${number}.xml.gallery\" 2> /dev/null`
</gallery>"

		else
			gallery=""
		fi

		# add categories for b&w or x-ray (the line breaks in the environment variables are important)
		add_categories=""
		if test -e "${file}.xray" ; then
			add_categories="
[[Category:X-rays of objects]]"
		else
			if ! test -e "${file}.color" ; then
				add_categories="
[[Category:Black and white photographs]]"
			fi
		fi

		# upload...
		echo "Uploading $filename => "
		starttime=$(date +"%s")

		yes N | python pywikipedia/upload.py -simulate -keep -filename:${filename} -noverify ${file} "{{Artwork
 | Artist            = {{unknown}}
 | Title             = ${title}
 | Year              = ${object_date}
 | Description       = ${description}
 | Technique         = 
 | Dimensions        = ${dimensions}
 | Institution       = {{Institution:Brooklyn Museum}}
 | Location          = ${location}
 | Credit_line       = ${credit_line}
 | Inscriptions      = 
 | Notes             = ${notes}
 | Source            = [http://www.brooklynmuseum.org/opencollection/objects/${id} Online Collection] of [[w:Brooklyn Museum|Brooklyn Museum]]; Photo: Brooklyn Museum, [${image_link} ${image_name}]
 | accession number  = [http://www.brooklynmuseum.org/opencollection/objects/${id} ${accession_number}]
 | Permission        = {{WikiAfrica/Brooklyn Museum}}
 | Other_versions    = ${gallery}
}}

[[Category:African art in the Brooklyn Museum]]
[[Category:Import by User:Slick-o-bot/Brooklyn Museum]]${add_categories}" && echo "${file}" >> upload.log	

		# set throttle (means: $throttle uploads per minute)
		throttle=4

		stoptime=$(date +"%s")
		uploadtime=$(($stoptime-$starttime))
		sleep=`expr \( 60 - ${throttle} \* ${uploadtime} \) / \( ${throttle} - 1 \)`
		if [[ ${sleep} -lt 0 ]] ; then sleep=0 ; fi

		echo "-----------------------------------------------------------------"
		echo ">> upload time was ${uploadtime} seconds, sleeping ${sleep} seconds"
		echo "-----------------------------------------------------------------"
		sleep ${sleep}

	fi
done

Wikimore

Commons:Batch uploading/Brooklyn Museum/HowTo

How do I process this batch job