Initial spreadsheet importer for Arkhéia
Pretty crude implementation of a Libre Office Spreadsheet => Arkhéia DB. The Arkhéia format is totally bonkers. This implementation has been tested with a pretty small sample file. While it does seem to work, I'm still not 100% this will correctly scale on a larger import sample. Let's hope for the best and fix stuff along the way :)
This commit is contained in:
commit
8dd5cb8dbb
21
butcher-xml
Executable file
21
butcher-xml
Executable file
|
@ -0,0 +1,21 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Arkeia is expecting a BOM at the front utf-8 file, AKA. exactly what
|
||||
# the unicode spec tells you NOT TO DO... (W-T-F!!!!)
|
||||
|
||||
# If you miss the BOM, the file will be considered being ACII and
|
||||
# screwing your accents...
|
||||
|
||||
# Add BOM
|
||||
printf '\xEF\xBB\xBF' > $2
|
||||
|
||||
# Ah, yeah. Arkeia is also not expencting to get a valid XML but a
|
||||
# *really* weird format instead. Basically they expect a set of XML
|
||||
# elements for each entry. The entries being separated by a newline.
|
||||
|
||||
# Butching the XML file into something Arkeia will injest....
|
||||
# In no particular order:
|
||||
# - Removing <root> node.
|
||||
# - Removing <entry> nodes.
|
||||
# - Separating the entries by a newline.
|
||||
xmllint --format $1 | sed '/root/d' | sed '/entry/d' | sed '/xml/d' | awk '{$1=$1};1' | tr -d '\n' | sed 's/<numseque>/\n<numseque>/g' | tail -n +2 >> $2
|
11
import-spreadsheet
Executable file
11
import-spreadsheet
Executable file
|
@ -0,0 +1,11 @@
|
|||
#!/usr/bin/env nix-shell
|
||||
#!nix-shell -i bash -p libxml2
|
||||
|
||||
if [[ -z $1 || -z $2 ]]; then
|
||||
echo "usage: import-spreadsheet SPREADSHEET OUTPUT_FILE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
tmpFile=$(mktemp)
|
||||
./bin/python import.py $1 $tmpFile
|
||||
./butcher-xml $tmpFile $2
|
26
import.py
Normal file
26
import.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
from pyexcel_ods import get_data
|
||||
import sys
|
||||
import xml.etree.cElementTree as ET
|
||||
|
||||
def process_line(field_names, line, root):
|
||||
"""
|
||||
"""
|
||||
if(len(line) <= 0):
|
||||
return ""
|
||||
line_dict = dict(enumerate(line))
|
||||
xml_line_node = ET.SubElement(root, "entry")
|
||||
for field_index in range(len(field_names)):
|
||||
# Python lists do not have a safe get.
|
||||
# Converting it to a dict to get this safe get.
|
||||
ET.SubElement(xml_line_node, field_names[field_index]).text = \
|
||||
str(line_dict.get(field_index,""))
|
||||
|
||||
if __name__ == '__main__':
|
||||
spreadsheet_path = sys.argv[1]
|
||||
out_path = sys.argv[2]
|
||||
table = get_data(spreadsheet_path)['Sheet1']
|
||||
root = ET.Element("root")
|
||||
for line in table[1:]:
|
||||
process_line(table[0], line, root)
|
||||
tree = ET.ElementTree(root)
|
||||
tree.write(out_path, encoding="utf8")
|
Loading…
Reference in a new issue