Non ASCII Char and HTTP Requests with Tomcat/WebLogic
Decoding non-ASCII characters
One problem that has come out in production is on URL encoding when there are "special" characters. Special characters are those that are not part of the set of characters allowed in URLs, as we see on the RFC (Appendix A), namely:
PChar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved = ALPHANUM | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | " '" |
"(" | ")"
escaped = "%" hex hex
hex = digit | "A" | "B" | "C" | D | E | F |
"a" | "b" | "c" | "d" | "and" | "f"
ALPHANUM = alpha | digit
alpha = lowalpha | upalpha
lowalpha = "a" | "b" | "c" | "d" | E | F | G | H | I |
"j" | "k" | "l" | "m" | "n" | "or" | "p" | "q" | "r" |
"s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
upalpha = "A" | "B" | "C" | D | E | F | G | H | I |
"J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
"S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
Even a very limited subset of characters that are part of US-ASCII.
But how can we pass a GET parameter whose value contains, for example, accented characters, and also one of those reserved as the "+"?
Caffè+Zuccherato
We take as a test case the string "caffè+zuccherato", and try to do the encoding with the JavaScript function "encodeURI", which converts the URL into "UTF-8". In this case our test case is converted to "caff%C3%A8+zuccherato", but when it arrives, Tomcat parameters' parser (Tomcat will be used for these tests) finds some obstacles.
The first we notice immediately is the fact that the "+" is not encoded. In a URL that is parsed as a space and then lost. The character "è" instead is encoded with the 16 bit string %C3%A8 and Tomcat (default behaviour) is not able to parse 16-bit characters. In fact going to analyze the parameters decoder (Parameters.urlDecode) we see this line of code that creates Java string with the GET parameters:
cbuf [i] = (char) (bbuf [i + start] & 0xff);
or the 16-bit character "è" is converted into 2 characters that are meaningless.
Tomcat and encoding
Actually there is the possibility to use a different parameters decoding on Tomcat. Indeed, we can set the parameter "URIEncoding" in the HTTP Connector (in server.xml), so that parsing is not done by default but with "UTF-8" (Do not use "UTF-16" or you'll read an ancient chinese proverb). But this is unuseful, especially if the Tomcat is shared between multiple applications or even worse if runs on an external hosting service.
Continuing our search for the best method for encoding, we note that there is also the JavaScript "escape". This actually uses iso-8859-1, the same used by Tomcat, but doesn't codify our + which then ends up in the GET "talis qualis" and then parsed by Tomcat as a space.
The good old form
In fact the solution is right under our eyes, and does not use any built-in function of the JavaScript engine, or anyone of the thousands of encoding functions that are on the Net. Just use "x-form-urlencoded" as Content-Type which encodes iso-8859-1 in the accented character "è" and also "+" or "caff%E8%2Bzuccherato", which is precisely what we needed from the beginning.
Well, really the Content-Type is only a side effect, the real reason why the encoding is working properly now is because we are letting the browser do the work, and he knows how to do it right, using a form. So we can send at any time parameters well encoded just using a function that uses a hidden form with dynamic input parameters (Toy Code dont' use it !!!)
encodeAndSend function (value) (
document.encodeForm.action = '/encoder-test/encoder', // (1)
encoded = document.createElement ( "input"); // (2)
encoded.setAttribute ( 'type', 'text');
encoded.setAttribute ( 'name', 'encoded');
encoded.setAttribute ( 'value', value); // (3)
document.encodeForm.appendChild (encoded);
document.encodeForm.submit (); // (4)
return false;
)
Here you can see how:
1) Set the path to the service that takes the input parameters
2) We dynamically create a container element for our parameter, must be of "type" text and with the "name" that we want to give the parameter.
3) Set the value to send
4) Finally we give the browser the command submit, that "packs" the request and sends it to the browser. Pack means that the various parameters are encoded as expected.
And WebLogic?
Obviously the above also applies to WebLogic.
But WebLogic also provides a method to map on the application deployment descriptor (weblogic.xml) the character-set with which the encoding will be done. This, unlike the server configuration of Tomcat, is part of each application and therefore can be safely used without interfering with each other, plus also has the advantage of being configurable at path level, this mean that if we need we can use a different encoding for different path. An example of such a configuration, adapted to this use-case, is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<wls:weblogic-web-app xmlns:wls="http://www.bea.com/ns/weblogic/90"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd http://www.bea.com/ns/weblogic/90 http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd">
<wls:weblogic-version>10.0</wls:weblogic-version>
<wls:context-root>encoder-test</wls:context-root>
<wls:charset-params>
<wls:input-charset>
<wls:resource-path>/encoder</wls:resource-path>
<wls:java-charset-name>UTF-8</wls:java-charset-name>
</wls:input-charset>
<wls:input-charset>
<wls:resource-path>/other</wls:resource-path>
<wls:java-charset-name>iso-8859-1</wls:java-charset-name>
</wls:input-charset>
</wls:charset-params>
</wls:weblogic-web-app>
This way, passing the URL "/encoder-test/encoder?encoded=caf%C3%A8" the engine will properly parse and the Java String extracted with request.getParameter ("encoded") will be "caffè".
But it's worth using this approach?
Not in my opinion. Primarly because we shift to use a non standard feature of the container (and this is something that I agree only if there is no alternative) and then because we still have the problem with the "+", which is parsed as a "space".
Maybe there are some use-case for which is convenient, for example, if for some obscure reasons we can not make the request with a form, so it is good to know it exists.
Attached the Maven project I used for the tests.