User:Tbayer (WMF)/Converting Google Docs to wikitext
- Export the document from Google Drive as ODT (File -> Download as -> OpenDocument Format (.odt))
- Install (LibreOffice and) the "Wiki Publisher" extension for LibreOffice (available e.g. in the Ubuntu software store)
- Export the document from LibreOffice as MediaWiki wikitext (File -> Export -> File type: MediaWiki)
- The resulting wikitext file should have most of the formatting preserved - even tables. But there is an annoying bug/feature making links that look like this in GDocs looklikethis in the exported wikitext. To fix these, and also remove extraneous blank lines, run this little script (Python needs to be installed):
python gdocodtwikimultilfix.py gdocodtwiki.txt gdocodtwiki_fixed.txt
where gdocodtwikimultilfix.py is the following (save it locally as a text file in the directory where you want to do the conversion):
#!/usr/bin/python
# Short script to take wikitext generated by the "Wiki Publisher" extension
# for LibreOffice from an ODT file exported from Google Docs
# and fix duplicated external links, as well as remove extraneous blank lines
# By T. Bayer ([[user:HaeB]])
import os
import sys
import re
import codecs
class gdocodtmultilfixerror(Exception):
def __init__(self, value):
self.value = value
def __str__(self):
return repr(self.value)
if len(sys.argv) < 3 or len(sys.argv) > 3:
raise gdocodtmultilfixerror('needs exactly two command line arguments: 1. input file (non-fixed wikitext, output of Wiki Publisher) 2. output files (fixed wikitext)')
urlpattern = r'https?://[^\ ]*'
inputfilename = sys.argv[1]
outputfilename = sys.argv[2]
inputfile = codecs.open(inputfilename, mode='r', encoding='utf-8')
outputfile = codecs.open(outputfilename, mode='w', encoding='utf-8')
precedingline = '\n'
for line in inputfile:
m = line
urls = set(re.findall(urlpattern, m))
for url in urls:
# Somehow, the space before an external link gets moved into the link during the export process.
# Replace this ('[http://www.example.com ]' --> ' ')
urle = re.escape(url)
old = r'([^\]])\['+urle+r' \]\['+urle
new = r'\1 ['+url
# Collapse duplicated links:
m = re.sub(old, new, m)
old = r'\['+urle+'( [^\\]]*)]\['+urle+' '
new = u'['+url+r'\1'
while re.search(old, m):
m = re.sub(old, new, m)
# Collapse multiple blank lines to one:
if not (precedingline == '\n' and m == '\n'):
outputfile.write(m)
precedingline = m
inputfile.close()
outputfile.close()
There may be still be other formatting errors (e.g. bolded text that is not bolded in the original, or vice versa), but for longer documents this solution can save a lot of time compared to manual conversion.
One may consider turning off "smart quotes" in Google Docs ("Tools" -> "Preferences" -> uncheck "Use smart quotes").
See also
- mw:User:JHernandez (WMF)/How to migrate from g00gle docs to wikitext
- en:Wikipedia:Tools#Importing (converting) content to Wikipedia (MediaWiki) format
- en:Wikipedia:Tools/Editing_tools#From_OpenOffice_and_LibreOffice
- Writer2MediaWiki for OpenOffice (doesn't seem to work with LibreOffice 4.1. though)
- https://github.com/rampradeepk/google-docs-to-wiki ?
- mw:Extension:Html2Wiki#Features