Tech-Tidbit: HTML Entity to Decimal Translation

It became necessary this week for me to find a way to convert html entities into their decimal equivalents. I’m not very fond of dealing with character encoding and entity issues but I was finally able to get things working. In the end we modified data within the database which meant I wouldn’t get to use this script, but it was a very nice find.

The nice thing about this is that it leaves any existing decimal-format characters alone, and properly translates html entities to the decimal format.

After searching the net for a while I found a piece of code that seemed like a good solution. The original needed a bit of cleanup and tweaking, but here it is:

[php]
function htmlentities2unicodeentities ($input) {
$htmlEntities = array_values (get_html_translation_table (HTML_ENTITIES, ENT_QUOTES));
$entitiesDecoded = array_keys (get_html_translation_table (HTML_ENTITIES, ENT_QUOTES));
$num = count ($entitiesDecoded);
for ($u = 0; $u < $num; $u++) {
$utf8Entities[$u] = '&#'.ord($entitiesDecoded[$u]).';';
}
return str_replace ($htmlEntities, $utf8Entities, $input);
}
[/php]

You can test the translation with this:

[php]
$test_strs = array('Kūlob', 'Kópavogur');

foreach($test_strs as $test_str){

print 'ORIGINAL: ' . $test_str . '
‘.”\n”;
print ‘TRANSLATED: ‘ . entities_to_unicode($test_str) . ‘

‘.”\n”;

}
[/php]

If you’re interested in converting things the other way around, good luck. I’ve been searching around for a while and I just can’t seem to find anything that works well enough without using some long array to translate things. Let me know if you find one!