A lot of times developers use REST services or other data feeds that move data using XML. I tried doing this today and noticed the lack of a simple, extremely easy to follow tutorial on how to parse XML using python.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #import library to do http requests: import urllib2 #import easy to use xml parser called minidom: from xml.dom.minidom import parseString #all these imports are standard on most modern python implementations #download the file: file = urllib2.urlopen('http://www.somedomain.com/somexmlfile.xml') #convert to string: data = file.read() #close file because we dont need it anymore: file.close() #parse the xml you downloaded dom = parseString(data) #retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName: xmlTag = dom.getElementsByTagName('tagName')[0].toxml() #strip off the tag (<tag>data</tag> ---> data): xmlData=xmlTag.replace('<tagName>','').replace('</tagName>','') #print out the xml tag and data in this format: <tag>data</tag> print xmlTag #just print the data print xmlData |
There you have it, that is all you need to do to get a value out of a simple web based xml file using python.
If you want to do it from a file you can do it in a similar fashion:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | #import easy to use xml parser called minidom: from xml.dom.minidom import parseString #all these imports are standard on most modern python implementations #open the xml file for reading: file = open('somexmlfile.xml','r') #convert to string: data = file.read() #close file because we dont need it anymore: file.close() #parse the xml you got from the file dom = parseString(data) #retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName: xmlTag = dom.getElementsByTagName('tagName')[0].toxml() #strip off the tag (<tag>data</tag> ---> data): xmlData=xmlTag.replace('<tagName>','').replace('</tagName>','') #print out the xml tag and data in this format: <tag>data</tag> print xmlTag #just print the data print xmlData |
And thats all you need to know to do the same with a file on your local machine.
If you have any questions post some comments on this post, thanks.





great
Great! Just what I was looking for.
You could make it a bit shorter though if you skip .toxml() and .replace() and replace those lines with this one:
xmlData = dom.getElementsByTagName(‘tagName’)[0].firstChild.data
THANKYOU plain and simple! exactly what I am looking for.
Not sure if anyone is still watching this post. But, always worth a shot to ask. I love your example, it’s nice, clean and too the point. However, what I’m wondering is how can I expand this and make it parse all the xml docs in a single directory?
Nevermind, I was having a dumb moment, got it.
exact what i am was searching for.
Thanks Travis and Gustav
I read my gpsies track with this:
import urllib, time, os.path
import xml.dom.minidom
PATH = u"D:\\Data\\IGC"
if not os.path.exists(PATH):
os.makedirs(PATH)
GPSIES_XML_FILE_PATH =PATH+"\\"+u"GPSies.xml"
URL = "http://www.gpsies.com/api.do?key=youreownapikey&username=youreusername&limit=50&filetype=gpxTrk"
urllib.urlretrieve(URL, GPSIES_XML_FILE_PATH)
gpsies_xml_file = open(GPSIES_XML_FILE_PATH,'r')
gpsies_xml_file_data = gpsies_xml_file.read()
#close file because we dont need it anymore:
gpsies_xml_file.close()
dom = xml.dom.minidom.parseString(gpsies_xml_file_data)
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
tracks = dom.getElementsByTagName("track")
n = 0
mytracks = []
for track in tracks:
km = float((getText(track.getElementsByTagName("trackLengthM")[0].childNodes)))/1000
trackname = getText(track.getElementsByTagName("title")[0].childNodes)
localfilename = "%s.gpx" % trackname.replace(" ","")
link = getText(track.getElementsByTagName("downloadLink")[0].childNodes)
urllib.urlretrieve(link, PATH+"\\"+localfilename)
i hope it is readable, i put it in the code tag.
indentation are gone, some linefeed to much…
thanks for help
if you have levels you can use something like this
latTag = dom.getElementsByTagName(‘item’)[i].getElementsByTagName(‘geo:lat’)[0].firstChild.data
I am new to python, i need to Comment XML using python, could any one can help me out with, example as given below
<!—-> # Ineed to Comment sountry Element
For some reason, my xml file get printed twice. Do you guys have any idea why? I’m looking at the python code and there doesn’t seem to be anything wrong.
Oh, never mind. I figured it out. The python script is replacing the given tag with spaces and printing the xml again.
Cool! Got me started.
Cool! Saved me some RTFM. Changed the tag names and the url. Had what I needed.
Thanks!
Thanks. Just what I was looking for. With a few tweaks, I was able to get it to display the feed as HTML and convert it to Python 3. Great job.
Hi, I’ve a word doc that I’ve saved as an xml file, I’m looking to parse this file and take relevant info out of tables. So the tags I have are for table headings eg Interface. When I insert my own tags, I get the error:
xmlTag = dom.getElementsByTagName(‘Interface’)[0].toxml()
IndexError: list index out of range
I’m a student and very new to parsing and xml, so my apologies if the above seems idiotic….I’m just going out of my mind at this stage.
Thanks in advance for any help and advice provided.
I’ve run it on a basic xml file
Tove
Jani
Reminder
Don’t forget me this weekend!
So the issue are my tags..
and it works fine, however, the word doc I’ve converted is pretty data heavy, I have been trying to use VB but this seems like a much simpler solution…
I must mention that I have to parse 100′s of word docs, extract the data and then export it to excel…
Hi…I’m just having some issues with this, I’ve converted a word document to an xml doc and have tried the above on my xml file. The problem is that the information I’m looking to extract are in tables and the tags I have are table headings, it’s not as straight forward as “Gary” which would pull out Gary, the tags I have are “Product
“. I apologize in advance it this question is idiotic, but I am looking for something just like this, a quick simple easy parser, I’ve been researching this for weeks and am pulling my hair out, any assistance would be greatly appreciated…
Gary
ok guys i want to ask a question. i am having problem in parsing data using this code.
i experimented with a simple xml file that has data
0
2
now when i run the code it gives me this error..
Traceback (most recent call last):
File “C:\Python25\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py”, line 310, in RunScript
exec codeObject in __main__.__dict__
File “C:\Users\Danish Muneer\xmlparser.py”, line 15, in
dom = parseString(data)
File “C:\Python25\lib\xml\dom\minidom.py”, line 1923, in parseString
return expatbuilder.parseString(string)
File “C:\Python25\lib\xml\dom\expatbuilder.py”, line 940, in parseString
return builder.parseString(string)
File “C:\Python25\lib\xml\dom\expatbuilder.py”, line 223, in parseString
parser.Parse(string, True)
ExpatError: junk after document element: line 2, column 0
but when i remove the second line…and keep only one line in the data file..the code runs and gives the tags and data correctly..please help what i am doing wrong..am i making a wrong xml file
the data 0 and 2 are between and tags
Hallo All,
Nice tutorial for parsing, but could anyone suggest me how to read the tags? If I have xml file and want to extract the tags and their values in a pairwise list
Many thanks for hints.
Thanks! I am going to be testing this out on some of my own stuff.
Thanks so much for publishing this. It got me looking at Python’s documentation on minidom.
Also note that if you don’t want to put the file into a string first, you can use this instead:
from xml.dom.minidom import parse
dom = parse(‘someXMLFilename.xml’)
…
hey can you please tell me how to write code if i want to open open window by clicking on open option in menu in gui made by boa constructor.
Hi,
I am getting an error like this :
Traceback (most recent call last):
File “htmlstrip.py”, line 10, in
dom = xml.dom.minidom.parseString(data)
File “C:\Python27\lib\xml\dom\minidom.py”, line 1930, in parseString
return expatbuilder.parseString(string)
File “C:\Python27\lib\xml\dom\expatbuilder.py”, line 940, in parseString
return builder.parseString(string)
File “C:\Python27\lib\xml\dom\expatbuilder.py”, line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
Press any key to continue . . .
my code is a copycat of above example :
import urllib2
from xml.dom.minidom import parseString
file1 = open(‘testcasesbig.txt’,'r’)
data = file1.read()
print data
file1.close()
dom = parseString(data)
xmlTag = dom.getElementsByTagName(‘tagName’)[0].toxml()
print xmlTag
Thank you very much for this really helpful post. I looked for an understandable tutorial how to parse xml in python for a long time and finally found it here. It’s actually pretty easy if you know how to do it, thanks!
@Anand: I believe you need to replace ‘tagName’ in line 10 with the actual tag you are looking for unless you are really using a xml file that uses that tag.
This is helpful, thank you.
However, in my case I do not have documents in files. I have a socket stream that contains endless self-contained “documents”, or top level tags, one after another. I would appreciate ideas on how to handle these. Thanks in advance.