Unescaping UTF-8 Strings in Ruby 1.9
Today Radar and I encountered a little issue with String encodings in Ruby 1.9. In this project, some Merb UTF-8 params needed to be unescaped. We spent some trying to force strings into UTF-8 encoding but for some reason while the encoding was UTF-8, the actual contents of the strings were getting massacred.
Long story short:
# In Ruby 1.9# Brokenputs URI::unescape("Baden-W%C3%BCrttemberg") # => "Baden-Württemberg"# Workingputs CGI::unescape("Baden-W%C3%BCrttemberg") # => "Baden-Württemberg"
Another lesson we learnt on our adventures today is the following:
# In Ruby 1.9m1 = "Munich"puts = m1.encoding # let's suppose it is something other than UTF-8, such as ASCII-8Bitm2 = "München"puts = m2.encoding # let's suppose we have a UTF-8 encoded string# The following will raise an exception about mismatched encodingsm3 = m1 << m2# The easiest way to get around this is to do the concatenation like so:m3 = m3 = m1 << m2.force_encoding("UTF-8")
Comments
-
http://yob.id.au says:
Encodings on 1.9 are awesome powerful, but they are guaranteed to get in your way unless *every single* library in your environment is encoding aware.
As an example, in a Rails 2.3 app, strings coming from the MySQL driver, Rack and ERB templates are generally encoded as ASCII-8BIT. If your source files are encoded as utf-8 (and most are), hello exceptions in previously working code.
nasty++ -
http://yob.id.au says:
Also, unescaping params from HTTP GET/POST requests is tricky.
There's no recommendations in the HTTP spec for GET or POST requests to contain the encoding of the request in the headers.
In the case of a URL encoded string like 'Baden-W%C3%BCrttemberg', how do you know to treat the decoded string as utf-8? -
Bodaniel Jeanes says:
Yes this is tricky. I hear you are somewhat experienced in this area. Are there any tricks for trying to determine an encoding? If so, I would imagine that's what strings would do anyway.
Wouldn't UTF-8 be the safest bet for most sites, though?