String encoding is something that we don't really think until we see

     Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT 

Or when users complains about missing special characters like "" (apostrophe copied from Microsoft Word) or when "菜医生" becomes "иЏњеЊ»з”џ".

Before we go into encoding problems, lets understand what encoding is.

A string can be considered as an array of bytes:

irb(main):001:0> "world".bytes
=> [119, 111, 114, 108, 100]

Here 119 means w, 111 means o and so on. This relationship between bytes and characters is defined by Encoding.

Lets see what happens when we change encoding

irb(main):001:0> str = "Café"
=> "Café"
irb(main):002:0> str.bytes
=> [67, 97, 102, 195, 169]
irb(main):003:0> str.force_encoding("windows-1251"); str.encode("utf-8");
=> "CafГ©"
irb(main):004:0> str.bytes
=> [67, 97, 102, 195, 169]

Changing the encoding changes how the string is printed, without changing the bytes. You'll see that error when a character in one encoding doesn't exist in another, or when Ruby can't figure out how to translate a character between two encodings.

irb(main):001:0> str = "Café"
=> "Café"
irb(main):002:0> str.encode("windows-1251")
Encoding::UndefinedConversionError: U+00E9 to WINDOWS-1251 in conversion from UTF-8 to WINDOWS-1251
        from (irb):8:in `encode'
        from (irb):8

To prevent the error we can pass extra arguments invalid and undef to encode. The invalid and undef options replaces characters that cannot be translated to different character with a ? or with any character passed in replace option.

irb(main):016:0> str = "Café"
=> "Café"
irb(main):017:0> str.encode("windows-1251", invalid: :replace, undef: :replace, replace: " ")
=> "Caf "

Unfortunately we lose information while replacing characters with encode. We would have no idea which characters were replaced. But losing data can be better than things being broken in new encoding.

Encoding problems we faced

Recently we stumbled upon string encoding while implementing CSV import feature. We first open the CSV located in remote location and read it. While reading one of the CSV file we got Encoding::CompatibilityError. This error is raised when source encoding is incompatible with the target encoding. So we need to encode the CSV string to UTF-8.

     open(file_url).read.encode('UTF-8', invalid: :replace, undef: :replace,  replace: ' ' )

Due to replace option, apostrophe ("’") which is copied from Microsoft word was being replaced by blank string including. To fix this we had to first encode string with windows-1251 encoding and then encode it back to UTF-8.

irb(main):001:0> str = "Ruby\x92s string encoding"
=> "Ruby\x92s string encoding" 
irb(main):002:0> str.force_encoding("windows-1251").encode("utf-8", invalid: :replace, undef: :replace,  replace: ' ' )
=> "Ruby’s string encoding"

We encountered chinese characters. Chinese characters were converted to weird characters after encoding from windows-1251 and back to utf-8.

irb(main):001:0> str = "菜医生"
=> "菜医生"
irb(main):002:0> str.force_encoding("windows-1251").encode("utf-8")
=> "иЏњеЊ»з”џ"

To fix this we had no options but to replace \x92 separately using gsub and then process CSV files. While replacing make sure the strings (original and substitution string) are encoded using same encoding else it would throw error.

Its a brain consuming task to fix encoding issues. To become comfortable with encodings - just play around encode and force_encoding methods in irb console.

Do let me know in case there is better solution for fixing encoding problems.


Your comment