Parsing data couldn't be easier with XML::Dataset
It’s hard to believe that when it comes to XML parsing CPAN hasn’t already got you covered, but XML::Dataset is a new module that fills a useful void. XML::Dataset let’s you declare a plaintext data collection schema, and then goes and extracts the data for you, super fast. Read on to see how it works.
Requirements
The CPAN Testers results show that XML::Dataset v0.06 will run on any platform with Perl (down to 5.8.9). To install the module with CPAN, open up the terminal and type:
$ cpan XML::Dataset
Your data, extracted
To use XML::Dataset you’ll need some stringified XML source data and a data profile. A profile is just a plaintext schema which specifies the data you’d like to extract. Let’s look at an example:
use strict;
use warnings;
use XML::Dataset;
use Data::Printer;
my $sample_data = q(<?xml version="1.0"?>
<colleagues>
<colleague>
<title>The Boss</title>
<phone>+1 202-663-9108</phone>
</colleague>
<colleague>
<title>Admin Assistant</title>
<phone>+1 347-999-5454</phone>
<email>inbox@the_company.com</email>
</colleague>
<colleague>
<title>Minion</title>
<phone>+1 792-123-4109</phone>
</colleague>
</colleagues>);
my $sample_data_profile
= q(colleagues
colleague
title = dataset:colleagues
email = dataset:colleagues
phone = dataset:colleagues);
p parse_using_profile($sample_data, $sample_data_profile);
The code above declares a simple XML dataset ($sample_data) and a data profile to extract the required data ($sample_data_profile). XML::Dataset requires every indented newline in the data profile to map to another nested level of the data set. Once we reach the data attributes we want to extract, we simply assign a dataset to them (dataset:colleagues).
XML::Dataset exports the “parse_using_profile” function which extracts the data using our data profile and returns a Perl data structure. We use Data::Printer to print out the results. Running this code we get this output:
\ {
colleagues [
[0] {
phone "+1 202-663-9108",
title "The Boss"
},
[1] {
email "inbox@the_company.com",
phone "+1 347-999-5454",
title "Admin Assistant"
},
[2] {
phone "+1 792-123-4109",
title "Minion"
},
]
}
Note that XML::Dataset had no problem extracting the one email address that was present in the data, even though the other colleagues did not have that attribute. What if we wanted to collect emails and phone numbers, but in separate datasets? All we need to do is update $sample_data_profile with two datasets:
my $sample_data_profile
= q(colleagues
colleague
title = dataset:emails dataset:phones
email = dataset:emails
phone = dataset:phones);
Re-running the code, XML::Dataset now produces two datasets for us:
\ {
emails [
[0] {
title "The Boss"
},
[1] {
email "inbox@the_company.com",
title "Admin Assistant"
},
[2] {
title "Minion"
}
],
phones [
[0] {
phone "+1 202-663-9108",
title "The Boss"
},
[1] {
phone "+1 347-999-5454",
title "Admin Assistant"
},
[2] {
phone "+1 792-123-4109",
title "Minion"
}
]
}
A real example
Let’s write a program to parse a a more realistic data set. Many websites provide a sitemap that lists all of the content on the website, and when it was last updated. This information is used by search engines to optimize their crawling routines. The sitemap has a defined xml format, so it’s a cinch to parse it with XML::Dataset:
use strict;
use warnings;
use XML::Dataset;
use Data::Printer;
use HTTP::Tiny;
my $url = 'http://perltricks.com/sitemap.xml';
my $sitemap_data
= HTTP::Tiny->new->get($url)->{content};
my $sitemap_data_profile
= q(urlset
url
loc = dataset:sitemap_locations_modified
lastmod = dataset:sitemap_locations_modified);
p parse_using_profile($sitemap_data, $sitemap_data_profile);
The code above downloads the PerlTricks.com sitemap using HTTP::Tiny and extracts every URL and last modified timestamp from the sitemap. Running the code, we get this output:
\ {
sitemap_locations_modified [
[0] {
lastmod "2014-05-09",
loc "http://perltricks.com/"
},
[1] {
lastmod "2013-03-24",
loc "http://perltricks.com/article/1/2013/3/24/3-quick-ways-to-find-out-the-version-number-of-an-installed-Perl-module-from-the-terminal"
},
[2] {
lastmod "2013-03-27",
loc "http://perltricks.com/article/3/2013/3/27/How-to-cleanly-uninstall-a-Perl-module"
},
...
]
}
No problem! We could re-use that same program to download and parse any sitemap on the Internet.
Conclusion
XML::Dataset is fantastic for extracting fixed data schemas from XML. The plaintext data profiles are so easy to use, a non-programmer could write them. XML::Dataset is also fast: under the hood it uses XML::LibXML (and a few optimizations) and could be adapted for well-formatted HTML. It has great documentation and offers some advanced features like partial dataset parse dispatching. Module author James Spurin deserves credit for producing a quality module and a welcome addition to CPAN’s XML namespace.
Do you have a much-loved CPAN module that you’d like us to cover? Drop us an email
Cover image © Duncun Hull
This article was originally posted on PerlTricks.com.
Tags
David Farrell
David is the editor of Perl.com. An organizer of the New York Perl Meetup, he works for ZipRecruiter as a software developer, and sometimes tweets about Perl and Open Source.
Browse their articles
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub