Easily convert XML data to a Python dict with xmltodict

September 1, 2019

I hate working with XML mainly because I find it difficult to read.

There's a popular Python package for parsing XML called lxml which is very flexible, but I found xmltodict much better suited and easier to use for what I needed to do.

We have a inventory management app for Shopify, EZ Inventory, and one of the features we've added recently is XML support. We have a few customers whose suppliers can only provide an XML stock feed (they don't have CSV feeds) so we felt we needed to support it.

The xmltodict package allows us to easily convert that XML data to a Python dictionary which is much nicer to work with. It has pretty much handled all the scenarios we've seen so far. This includes XML feeds that use attributes instead of elements and also those that use CDATA sections. These were handled automatically by the package.

The usage is really simple too.

Example 1

Here's a typical XML data format:

xml_data = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
  <Products>
    <Product>
      <Code>2941</Code>
      <StockQty>65</StockQty>
      <Barcode>49020570284087</Barcode>
    </Product>
    <Product>
      <Code>2778</Code>
      <StockQty>200</StockQty>
      <Barcode>72020570064306</Barcode>
    </Product>
    <Product>
      <Code>2838</Code>
      <StockQty>140</StockQty>
      <Barcode>8802057003726</Barcode>
    </Product>
  </Products>
</root>'''

Let's parse this and convert it to a dict using xmltodict:

import xmltodict

xmltodict.parse(xml_data)

Yup, it's a one-liner. The output will look something like this:

OrderedDict([('root',
              OrderedDict([('Products',
                            OrderedDict([('Product',
                                          [OrderedDict([('Code', '2941'),
                                                        ('StockQty', '65'),
                                                        ('Barcode',
                                                         '49020570284087')]),
                                           OrderedDict([('Code', '2778'),
                                                        ('StockQty', '200'),
                                                        ('Barcode',
                                                         '72020570064306')]),
                                           OrderedDict([('Code', '2838'),
                                                        ('StockQty', '140'),
                                                        ('Barcode',
                                                         '8802057003726')])])]))]))])

Note that the output is an OrderedDict type to ensure that the elements' ordering don't change which is an issue with dict data prior to Python 3.6.

In our app specifically, we just want the <Product> element. We can pull just that section with this code:

xmltodict.parse(xml_data)['root']['Products']['Product']

Which outputs:

[OrderedDict([('Code', '2941'),
              ('StockQty', '65'),
              ('Barcode', '49020570284087')]),
 OrderedDict([('Code', '2778'),
              ('StockQty', '200'),
              ('Barcode', '72020570064306')]),
 OrderedDict([('Code', '2838'),
              ('StockQty', '140'),
              ('Barcode', '8802057003726')])]

Example 2

Sometimes the XML data has these CDATA sections. CDATA is short for "Character Data" and tells the parser that the characters inside this tag should always be treated as regular text (in case the data contains characters that can be interpreted as XML markup):

xml_data = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
  <Products>
    <Product>
      <Code><![CDATA[ 2941 ]]></Code>
      <StockQty><![CDATA[ 65 ]]></StockQty>
      <Barcode><![CDATA[ 49020570284087 ]]></Barcode>
    </Product>
    <Product>
      <Code><![CDATA[ 2778 ]]>2778</Code>
      <StockQty><![CDATA[ 200 ]]></StockQty>
      <Barcode><![CDATA[ 72020570064306 ]]></Barcode>
    </Product>
  </Products>
</root>'''

This is handled automatically by xmltodict, so code is unchanged:

xmltodict.parse(xml_data)['root']['Products']['Product']

Will output:

[OrderedDict([('Code', '2941'),
              ('StockQty', '65'),
              ('Barcode', '49020570284087')]),
 OrderedDict([('Code', '2778 2778'),
              ('StockQty', '200'),
              ('Barcode', '72020570064306')])]

Example 3

Sometimes the XML feed might use a mix of attributes and elements like this:

xml_data = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
  <Products>
    <Product CODE="2941" BARCODE="49020570284087">
      <StockQty>65</StockQty>
    </Product>
    <Product CODE="2778" BARCODE="72020570064306">
      <StockQty>200</StockQty>
    </Product>
  </Products>
</root>'''

Again, no change in the code, let's use the same one-liner to pull the Product data:

xmltodict.parse(xml_data)['root']['Products']['Product']

Will output:

[OrderedDict([('@CODE', '2941'),
              ('@BARCODE', '49020570284087'),
              ('StockQty', '65')]),
 OrderedDict([('@CODE', '2778'),
              ('@BARCODE', '72020570064306'),
              ('StockQty', '200')])]

The only difference in this case is the dict keys for attributes will be prepended with an "@" symbol.

This package has really saved us quite a bit of time parsing XML data. We've basically set up our EZ Inventory app to have methods for reading different types of data (currently CSV, XLS, XLSX, and XML) and have all of them converted to a Python dict as the final form before processing them. So with just a few lines of code using xmltodict, we were able to add support for XML data fairly quickly.

Tags: tech, software development, python, howto