Česky
Kamil Dudka

CharsetDetector - Tutorial

This short tutorial let you walk trough CharsetDetector component's interface. If you just want to use this component in the most simple way, read just stage 1.

Stage 1 - Simple interface

Imagine you have 3 text files, which use different charsets. You need their text in UTF-8, but you don't know their original charset. You can use static method file_get_contents of class CharsetDetector, which is equivalent of PHP function file_get_contents. Consider following example:

$text1 = CharsetDetector::file_get_contents('file1.iso8859-2.txt');
$text2 = CharsetDetector::file_get_contents('file1.cp1250.txt');
$text3 = CharsetDetector::file_get_contents('file1.utf-8.txt');

If you just need to convert string of unknown charset to UTF-8, use static method convert of class CharsetDetector:

$stringUTF8 = CharsetDetector::convert($string);

Once you have UTF-8 string, you can simply convert it using iconv PHP extension:

$stringCP1250 =  iconv('UTF-8', 'CP1250', $stringUTF8);

This was most common component's usage. If this is not enough for you, look at stage 2.

Stage 2 - Advanced component usage

To change components behavior, you need to create instance of class CharsetDetector and use it as object. At first you give it a piece of text to analyze. Then you can transform text using convertIfRelevant method:

$detector = new CharsetDetector;
$detector->analyze($string);
$stringUTF8 = $detector->convertIfRelevant($string);

Feel free to call thees methods periodically. If you experience bad match of charset (propably on damaged text files or whatsever), you can try to adjust relevance threshold:

$detector->setMinRelevance(0.2);

Using method setTargetCharset of CharsetDetector, you can change target charset. But this is only useful with extensions, so go to stage 3.

Stage 3 - Extensions

CharsetDetector is highly extensible. You can add your own charsets to its configuration or replace the built-in charsets. To use extensions there is a parametrized constructor of CharsetDetector object. It expects initialized instance of CharsetStreamAnalyzer class. To create such instance use CharsetStreamAnalyzerFactory:

// Use default charsets as base
$analyzer = CharsetStreamAnalyzerFactory::createDefault();
 
// Define data for charset
$charWeightMap = Array();
// ...
 
// Add charset as extension
$analyzer->addWeightMap('ISO-XXXX', $charWeightMap);
 
$detector = new CharsetDetector($analyzer);
// ...

For another languages (as Czech) is it good to begin with clear analyzer instance. You can get such instance using createEmpty method of CharsetStreamAnalyzerFactory:

// Create clear analyzer instance
$analyzer = CharsetStreamAnalyzerFactory::createEmpty();
 
// Define data for charsets
// ...
 
// Add charsets
$analyzer->addWeightMap('ISO-XXXX', $charWeightMap1);
$analyzer->addWeightMap('ISO-XXXX', $charWeightMap2);
 
$detector = new CharsetDetector($analyzer);

Before writing your own extensions, look at implementation of CharsetStreamAnalyzerFactory::createDefault method. There are charset definitions for ISO8859-2 and CP1250.

Stage 4 - Working with analyze results

Result of text analyse are weights for individual charsets. You can let user to decide, which charset is the right one - and you can give him advice using thees weights. You can read them using getCharsetWeightMap of CharsetStreamAnalyzer class:

$weightMap = $analyzer->getCharsetWeightMap();
$iterator = $weightMap->createSortedIterator();
while (null!=($item = $iterator->next())) {
  // Process result's item...
}

To understand this code, consider next class diagram (click to enlarge):

src/files/uml/CharsetDetector.png