Latin1 encoded tables or databases with UTF8 data stored: convert them to native mysql UTF8 Tables/Database

Submitted by fabio on Thu, 2007-08-02 01:42.

In the years before MySQL 4.1, the first MySQL version which provided a complete and good support to different character encodings, developers who needed their databases to store UTF8 encoded data used different encodings (usually latin1) for table and databases while still storing real UTF8 data inside.

While the DB and tables were, as an example latin1 encoded, inside there were actually UTF8 encoded data.

So.. I had to play with a latin1 encoded database which actually contained UTF8 encoded data inside. As I was converting the data stored in the database to Drupal, I had to convert it to a fully UTF8 based database.

The simple way of simply changing the table/db encoding using an ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8; will not work. MySQL, expecting latin1 data, will convert the data to UTF8. But, as the data is already UTF8, this will generate lot of strange characters (especially on accented letters).

Fixing this one wasn't easy .. and I had to play a bit with Google, the MySQL docs and my mysql client.

Finally I've been able to do it.. The trick is not to convert directly to UTF8. The first step is to convert to a BLOB (or BINARY), then convert the BLOB field to UTF8.
Using this approach mysql will not automatically convert between character encodings and your stored data will not be changed.

Following some php/mysql code which could help you if you need to convert a DB/Tables using the above approach.

/**
 * Converts fields contained into $fields array on the table $table on the db $dbname to blob than to utf8.
 * This is the procedure to follow when convert the field encoding when it's different than utf8 (eg latin1) while it already contains utf8 data inside
*/
function convert_to_utf8($dbname, $tables_fields) {

  foreach($tables_fields as $table => $fields) {
    foreach($fields as $field) {
      // convert each field to BLOB
      myq('ALTER TABLE ' . $dbname . '.' . $table . ' MODIFY ' . $field . ' BLOB');
    }
  }
  myq("ALTER DATABASE $dbname charset=utf8");
  
  foreach($tables_fields as $table => $fields) {
    myq('ALTER TABLE ' . $dbname . '.' . $table . ' charset utf8');
  }
  
  foreach($tables_fields as $table => $fields) {
    foreach($fields as $field) {
      // convert each field to utf8
      myq('ALTER TABLE ' . $dbname . '.' . $table . ' MODIFY ' . $field . ' TEXT CHARACTER SET utf8');
    }
  }
  
  print "converted to utf8 db" . $dbname ."\n";
}

function myq($query) {
  $result = mysql_query($query) or die("error: " . mysql_error() . "\nQuery: " . $query);
  return $result;
}

The above function can be invoked using a syntax similar to the following:

$table_fields = array(
  'blog_entry' => array('body_html', 'title'),
  'blog_entry_comment' => array('comment', 'name'),
  'media_file' => array('comment', 'name'),
  'media_file_comment' => array('comment', 'name'),
  'media_folder' => array('comment', 'name'),
);

convert_to_utf8('database_name', $table_fields);

Based on:
MySQL Documentation: Column Character Set Conversion
Worpress Documentation: Converting Database Character Sets

UTF convertion does not happen

Submitted by Ratheesh (not verified) on Sat, 2008-05-31 13:27.

Hello,
Require your help pls.

I have a table with a column of charset latin1.
I purposely stored utf data into the column.

As expected the data is visible as ????????? due to the latin1 encoding

Now as per your article I converted the column from varchar to varbinary and then back to varchar with utf8 encoding.

But still the data in the column is displayed as ?????????
I expected it to be displayed properly as the charset has been changed now.

Whats the solution for this?

Thanks

well.. If you followed the

Submitted by fabio on Sun, 2008-06-01 19:01.

well.. If you followed the instruction correctly the data should be displayed correctly now. Check that the configuration you have is actually the same I had and that you made each step correctly.

Also check your db connection encoding.

Thanks!

Submitted by Andreas (not verified) on Thu, 2007-08-02 14:09.

Thanks for this informations! Really helpful!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <pre> <img> <h2> <h3> <h4> <b>
  • Lines and paragraphs break automatically.
  • Images can be added to this post.
  • You may use [inline:xx] tags to display uploaded files or images inline.

More information about formatting options

2 + 13 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.