Latin1 encoded tables or databases with UTF8 data stored: convert them to native mysql UTF8 Tables/Database

In the years before MySQL 4.1, the first MySQL version which provided a complete and good support to different character encodings, developers who needed their databases to store UTF8 encoded data used different encodings (usually latin1) for table and databases while still storing real UTF8 data inside.

While the DB and tables were, as an example latin1 encoded, inside there were actually UTF8 encoded data.

So.. I had to play with a latin1 encoded database which actually contained UTF8 encoded data inside. As I was converting the data stored in the database to Drupal, I had to convert it to a fully UTF8 based database.

The simple way of simply changing the table/db encoding using an ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8; will not work. MySQL, expecting latin1 data, will convert the data to UTF8. But, as the data is already UTF8, this will generate lot of strange characters (especially on accented letters).

Fixing this one wasn’t easy .. and I had to play a bit with Google, the MySQL docs and my mysql client.

Finally I’ve been able to do it.. The trick is not to convert directly to UTF8. The first step is to convert to a BLOB (or BINARY), then convert the BLOB field to UTF8.
Using this approach mysql will not automatically convert between character encodings and your stored data will not be changed.

Following some php/mysql code which could help you if you need to convert a DB/Tables using the above approach.

/**
 * Converts fields contained into $fields array on the table $table on the db $dbname to blob than to utf8.
 * This is the procedure to follow when convert the field encoding when it's different than utf8 (eg latin1) while it already contains utf8 data inside
*/
function convert_to_utf8($dbname, $tables_fields) {

  foreach($tables_fields as $table => $fields) {
    foreach($fields as $field) {
      // convert each field to BLOB
      myq('ALTER TABLE ' . $dbname . '.' . $table . ' MODIFY ' . $field . ' BLOB');
    }
  }
  myq("ALTER DATABASE $dbname charset=utf8");
  
  foreach($tables_fields as $table => $fields) {
    myq('ALTER TABLE ' . $dbname . '.' . $table . ' charset utf8');
  }
  
  foreach($tables_fields as $table => $fields) {
    foreach($fields as $field) {
      // convert each field to utf8
      myq('ALTER TABLE ' . $dbname . '.' . $table . ' MODIFY ' . $field . ' TEXT CHARACTER SET utf8');
    }
  }
  
  print "converted to utf8 db" . $dbname ."\n";
}

function myq($query) {
  $result = mysql_query($query) or die("error: " . mysql_error() . "\nQuery: " . $query);
  return $result;
}

The above function can be invoked using a syntax similar to the following:

$table_fields = array(
  'blog_entry' => array('body_html', 'title'),
  'blog_entry_comment' => array('comment', 'name'),
  'media_file' => array('comment', 'name'),
  'media_file_comment' => array('comment', 'name'),
  'media_folder' => array('comment', 'name'),
);

convert_to_utf8('database_name', $table_fields);

Based on:
MySQL Documentation: Column Character Set Conversion
Worpress Documentation: Converting Database Character Sets

Scroll to Top