Python XML Parser Tutorial

On January 12, 2010, in Web Coding, by admin

A lot of times developers use REST services or other data feeds that move data using XML. I tried doing this today and noticed the lack of a simple, extremely easy to follow tutorial on how to parse XML using python.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#import library to do http requests:
import urllib2
 
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
 
#download the file:
file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you downloaded
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('tagName')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<tagName>','').replace('</tagName>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

There you have it, that is all you need to do to get a value out of a simple web based xml file using python.

If you want to do it from a file you can do it in a similar fashion:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
 
#open the xml file for reading:
file = open('somexmlfile.xml','r')
#convert to string:
data = file.read()
#close file because we dont need it anymore:
file.close()
#parse the xml you got from the file
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('tagName')[0].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<tagName>','').replace('</tagName>','')
#print out the xml tag and data in this format: <tag>data</tag>
print xmlTag
#just print the data
print xmlData

And thats all you need to know to do the same with a file on your local machine.

If you have any questions post some comments on this post, thanks.

Be Sociable, Share!
Tagged with:  

36 Responses to Python XML Parser Tutorial

  1. Roseele Dahang says:

    great :)

  2. Gustaf says:

    Great! Just what I was looking for.

    You could make it a bit shorter though if you skip .toxml() and .replace() and replace those lines with this one:
    xmlData = dom.getElementsByTagName(‘tagName’)[0].firstChild.data

  3. Swilky says:

    THANKYOU plain and simple! exactly what I am looking for.

  4. Snic says:

    Not sure if anyone is still watching this post. But, always worth a shot to ask. I love your example, it’s nice, clean and too the point. However, what I’m wondering is how can I expand this and make it parse all the xml docs in a single directory?

  5. Snic says:

    Nevermind, I was having a dumb moment, got it.

  6. Thomas says:

    exact what i am was searching for.

    Thanks Travis and Gustav

  7. Thomas says:

    I read my gpsies track with this:


    import urllib, time, os.path
    import xml.dom.minidom

    PATH = u"D:\\Data\\IGC"
    if not os.path.exists(PATH):
    os.makedirs(PATH)
    GPSIES_XML_FILE_PATH =PATH+"\\"+u"GPSies.xml"

    URL = "http://www.gpsies.com/api.do?key=youreownapikey&username=youreusername&limit=50&filetype=gpxTrk"
    urllib.urlretrieve(URL, GPSIES_XML_FILE_PATH)

    gpsies_xml_file = open(GPSIES_XML_FILE_PATH,'r')

    gpsies_xml_file_data = gpsies_xml_file.read()
    #close file because we dont need it anymore:
    gpsies_xml_file.close()

    dom = xml.dom.minidom.parseString(gpsies_xml_file_data)

    def getText(nodelist):
    rc = []
    for node in nodelist:
    if node.nodeType == node.TEXT_NODE:
    rc.append(node.data)
    return ''.join(rc)

    tracks = dom.getElementsByTagName("track")
    n = 0
    mytracks = []

    for track in tracks:
    km = float((getText(track.getElementsByTagName("trackLengthM")[0].childNodes)))/1000
    trackname = getText(track.getElementsByTagName("title")[0].childNodes)
    localfilename = "%s.gpx" % trackname.replace(" ","")
    link = getText(track.getElementsByTagName("downloadLink")[0].childNodes)
    urllib.urlretrieve(link, PATH+"\\"+localfilename)

    i hope it is readable, i put it in the code tag.

  8. Thomas says:

    indentation are gone, some linefeed to much…

  9. ahmed says:

    thanks for help
    if you have levels you can use something like this
    latTag = dom.getElementsByTagName(‘item’)[i].getElementsByTagName(‘geo:lat’)[0].firstChild.data

  10. veeresh says:

    I am new to python, i need to Comment XML using python, could any one can help me out with, example as given below

    <!—-> # Ineed to Comment sountry Element

  11. Andres says:

    For some reason, my xml file get printed twice. Do you guys have any idea why? I’m looking at the python code and there doesn’t seem to be anything wrong.

  12. Andres says:

    Oh, never mind. I figured it out. The python script is replacing the given tag with spaces and printing the xml again.

  13. Danny says:

    Cool! Got me started.

  14. jeffa says:

    Cool! Saved me some RTFM. Changed the tag names and the url. Had what I needed.

    Thanks!

  15. Orcris says:

    Thanks. Just what I was looking for. With a few tweaks, I was able to get it to display the feed as HTML and convert it to Python 3. Great job.

  16. Gary says:

    Hi, I’ve a word doc that I’ve saved as an xml file, I’m looking to parse this file and take relevant info out of tables. So the tags I have are for table headings eg Interface. When I insert my own tags, I get the error:

    xmlTag = dom.getElementsByTagName(‘Interface’)[0].toxml()
    IndexError: list index out of range

    I’m a student and very new to parsing and xml, so my apologies if the above seems idiotic….I’m just going out of my mind at this stage.

    Thanks in advance for any help and advice provided.

  17. Gary says:

    I’ve run it on a basic xml file

    Tove
    Jani
    Reminder
    Don’t forget me this weekend!

    So the issue are my tags..

    and it works fine, however, the word doc I’ve converted is pretty data heavy, I have been trying to use VB but this seems like a much simpler solution…

    I must mention that I have to parse 100’s of word docs, extract the data and then export it to excel…

  18. Gary says:

    Hi…I’m just having some issues with this, I’ve converted a word document to an xml doc and have tried the above on my xml file. The problem is that the information I’m looking to extract are in tables and the tags I have are table headings, it’s not as straight forward as “Gary” which would pull out Gary, the tags I have are “Product
    “. I apologize in advance it this question is idiotic, but I am looking for something just like this, a quick simple easy parser, I’ve been researching this for weeks and am pulling my hair out, any assistance would be greatly appreciated…

    Gary

  19. danish says:

    ok guys i want to ask a question. i am having problem in parsing data using this code.
    i experimented with a simple xml file that has data

    0
    2

    now when i run the code it gives me this error..

    Traceback (most recent call last):
    File “C:\Python25\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py”, line 310, in RunScript
    exec codeObject in __main__.__dict__
    File “C:\Users\Danish Muneer\xmlparser.py”, line 15, in
    dom = parseString(data)
    File “C:\Python25\lib\xml\dom\minidom.py”, line 1923, in parseString
    return expatbuilder.parseString(string)
    File “C:\Python25\lib\xml\dom\expatbuilder.py”, line 940, in parseString
    return builder.parseString(string)
    File “C:\Python25\lib\xml\dom\expatbuilder.py”, line 223, in parseString
    parser.Parse(string, True)
    ExpatError: junk after document element: line 2, column 0

    but when i remove the second line…and keep only one line in the data file..the code runs and gives the tags and data correctly..please help what i am doing wrong..am i making a wrong xml file

  20. danish says:

    the data 0 and 2 are between and tags

  21. pythonkid says:

    Hallo All,

    Nice tutorial for parsing, but could anyone suggest me how to read the tags? If I have xml file and want to extract the tags and their values in a pairwise list

    Many thanks for hints.

  22. Sean says:

    Thanks! I am going to be testing this out on some of my own stuff.

  23. Dale says:

    Thanks so much for publishing this. It got me looking at Python’s documentation on minidom. :)

    Also note that if you don’t want to put the file into a string first, you can use this instead:
    from xml.dom.minidom import parse
    dom = parse(‘someXMLFilename.xml’)

  24. tarun says:

    hey can you please tell me how to write code if i want to open open window by clicking on open option in menu in gui made by boa constructor.

  25. Anand says:

    Hi,

    I am getting an error like this :

    Traceback (most recent call last):
    File “htmlstrip.py”, line 10, in
    dom = xml.dom.minidom.parseString(data)
    File “C:\Python27\lib\xml\dom\minidom.py”, line 1930, in parseString
    return expatbuilder.parseString(string)
    File “C:\Python27\lib\xml\dom\expatbuilder.py”, line 940, in parseString
    return builder.parseString(string)
    File “C:\Python27\lib\xml\dom\expatbuilder.py”, line 223, in parseString
    parser.Parse(string, True)
    xml.parsers.expat.ExpatError: syntax error: line 1, column 0
    Press any key to continue . . .

    my code is a copycat of above example :

    import urllib2

    from xml.dom.minidom import parseString

    file1 = open(‘testcasesbig.txt’,’r’)
    data = file1.read()
    print data
    file1.close()
    dom = parseString(data)
    xmlTag = dom.getElementsByTagName(‘tagName’)[0].toxml()
    print xmlTag

  26. Hoshpak says:

    Thank you very much for this really helpful post. I looked for an understandable tutorial how to parse xml in python for a long time and finally found it here. It’s actually pretty easy if you know how to do it, thanks!

    @Anand: I believe you need to replace ‘tagName’ in line 10 with the actual tag you are looking for unless you are really using a xml file that uses that tag.

  27. Elwood Downey says:

    This is helpful, thank you.

    However, in my case I do not have documents in files. I have a socket stream that contains endless self-contained “documents”, or top level tags, one after another. I would appreciate ideas on how to handle these. Thanks in advance.

  28. Dan Roy says:

    I have a question on how to generate XML to the PARCA schema to the UN/CEFACT standard, published on the PARCA website (http://dcarc.cape.osd.mil/Files/Training/EV_Training/PARCA%20EVM%20Policies%20and%20IPMR%20DID%20102012.pdf). I have to apply a capability to produce XML output per this standard on the AMS RTP (RealTime Planning) software, described at http://www.amsusa.com. This software has internal Python capabilities. PARCA has provided various schemas that must be used to produce the extracts.
    How would one go about producing xml extracts of this database? Are there commercial products out there that can do it externally?

  29. Rocz says:

    Hmm does it works with Python 3.2?

    Are the library xml.dom.minidom by default in python 3.2?

  30. Rocz says:

    here’s the big deal! when you’re trying to execute the code, you have to pay attention at your script’s name!

    do not put “xml” as your script’s name! “xml.py” will bring lot of error trying to import the real xml’s library.

    ;-)

  31. Jonatas C D says:

    Hey! Well done, I’d suggest one thing only.
    When you do the #strip off the tag, I’d add:
    .replace(” ,”)

    I bumped in a case that a I needed, so it works – just a feedback

    =]
    cheers

  32. Jonatas C D says:

    ops, my last comment was posted without the tagName inside – mybad

  33. Daniel says:

    Gr8 post! I was wondering how I can update an XML file from python, let’s suppose my XML files have the configurations for a game like; difficulty level and player’s name, if I want to change the difficulty level and name, how can i do this??

    Thanks in advance
    Daniel

  34. gsavix says:

    yes it is a good example. based on your description i build my:

    this python code:

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    import urllib, urllib2
    from xml.dom.minidom import parseString
    axurlp1 = “http://maps.googleapis.com/maps/api/geocode/xml?sensor=false&address=”
    axurlp2 = “avenida paulista 1000, São Paulo, SP, Brasil”
    axurlformat = urllib.quote(axurlp2)
    axurl2 = axurlp1 + axurlformat
    print axurl2
    axresp = urllib2.urlopen(axurl2)
    axdados = axresp.read()
    print axdados
    axdom = parseString(axdados)
    xmlTag = axdom

    produces this output:

    OK

    street_address
    Paulista Avenue, 1000 – Bela Vista, São Paulo, 01310-100, Brazil

    1000
    1000
    street_number

    Paulista Avenue
    Paulista Avenue
    route

    Bela Vista
    Bela Vista
    sublocality
    political

    São Paulo
    São Paulo
    locality
    political

    São Paulo
    SP
    administrative_area_level_1
    political

    Brazil
    BR
    country
    political

    01310-100
    01310-100
    postal_code

    -23.5649897
    -46.6520078

    RANGE_INTERPOLATED

    -23.5663450
    -46.6533629

    -23.5636470
    -46.6506650

    -23.5650023
    -46.6520201

    -23.5649897
    -46.6520078

    Paulista Avenue, 1000 – Bela Vista, São Paulo, 01310-100, Brazil

  35. Jessie says:

    Thanks, this was exactly what I was doing with a Perl script but I need to convert the functionality over to Python.

  36. Shane says:

    Awesome, exactly what I was looking for.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>