Parse and process XML (and HTML) with xml2
Please note that the information presented in this post reflects the package as it stood when initially released, and may now be outdated. For the most up-to-date information, kindly refer to https://xml2.r-lib.org/.
I’m pleased to announced that the first version of xml2 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library that makes it easier to work with XML and HTML in R:
Read XML and HTML with
read_xml()andread_html().Navigate the tree with
xml_children(),xml_siblings()andxml_parent(). Alternatively, use xpath to jump directly to the nodes you’re interested in withxml_find_one()andxml_find_all(). Get the full path to a node withxml_path().Extract various components of a node with
xml_text(),xml_attrs(),xml_attr(), andxml_name().Convert to list with
as_list().Where appropriate, functions support namespaces with a global url -> prefix lookup table. See
xml_ns()for more details.Convert relative urls to absolute with
url_absolute(), and transform in the opposite direction withurl_relative(). Escape and unescape special characters withurl_escape()andurl_unescape().Support for modifying and creating xml documents in planned in a future version.
This package owes a debt of gratitude to Duncan Temple Lang who’s XML package has made it possible to use XML with R for almost 15 years!
Usage
You can install it by running:
```{r}
install.packages("xml2")
```(If you’re on a mac, you might need to wait a couple of days - CRAN is busy rebuilding all the packages for R 3.2.0 so it’s running a bit behind.)
Here’s a small example working with an inline XML document:
```{r}
library(xml2)
x <- read_xml("<foo>
<bar>text <baz id = 'a' /></bar>
<bar>2</bar>
<baz id = 'b' />
</foo>")
xml_name(x)
#> [1] "foo"
xml_children(x)
#> {xml_nodeset (3)}
#> [1] <bar>text <baz id="a"/></bar>
#> [2] <bar>2</bar>
#> [3] <baz id="b"/>
# Find all baz nodes anywhere in the document
baz <- xml_find_all(x, ".//baz")
baz
#> {xml_nodeset (2)}
#> [1] <baz id="a"/>
#> [2] <baz id="b"/>
xml_path(baz)
#> [1] "/foo/bar[1]/baz" "/foo/baz"
xml_attr(baz, "id")
#> [1] "a" "b"
```Development
Xml2 is still under active development. If notice any problems (including crashes), please try the development version, and if that doesn’t work, file an issue.