CharsetDetector - Tutorial
This short tutorial let you walk trough CharsetDetector
component's interface. If you just want to use this component in the most simple way, read just stage 1.
Stage 1 - Simple interface
Imagine you have 3 text files, which use different charsets. You need their text in UTF-8, but you don't know their original charset. You can use static method file_get_contents
of class CharsetDetector
, which is equivalent of PHP function file_get_contents
. Consider following example:
$text1 = CharsetDetector::file_get_contents('file1.iso8859-2.txt'); $text2 = CharsetDetector::file_get_contents('file1.cp1250.txt'); $text3 = CharsetDetector::file_get_contents('file1.utf-8.txt');
If you just need to convert string of unknown charset to UTF-8, use static method convert
of class CharsetDetector
:
$stringUTF8 = CharsetDetector::convert($string);
Once you have UTF-8 string, you can simply convert it using iconv
PHP extension:
$stringCP1250 = iconv('UTF-8', 'CP1250', $stringUTF8);
This was most common component's usage. If this is not enough for you, look at stage 2.
Stage 2 - Advanced component usage
To change components behavior, you need to create instance of class CharsetDetector
and use it as object. At first you give it a piece of text to analyze. Then you can transform text using convertIfRelevant
method:
$detector = new CharsetDetector; $detector->analyze($string); $stringUTF8 = $detector->convertIfRelevant($string);
Feel free to call thees methods periodically. If you experience bad match of charset (propably on damaged text files or whatsever), you can try to adjust relevance threshold:
$detector->setMinRelevance(0.2);
Using method setTargetCharset
of CharsetDetector
, you can change target charset. But this is only useful with extensions, so go to stage 3.
Stage 3 - Extensions
CharsetDetector
is highly extensible. You can add your own charsets to its configuration or replace the built-in charsets. To use extensions there is a parametrized constructor of CharsetDetector
object. It expects initialized instance of CharsetStreamAnalyzer
class. To create such instance use CharsetStreamAnalyzerFactory
:
// Use default charsets as base $analyzer = CharsetStreamAnalyzerFactory::createDefault(); // Define data for charset $charWeightMap = Array(); // ... // Add charset as extension $analyzer->addWeightMap('ISO-XXXX', $charWeightMap); $detector = new CharsetDetector($analyzer); // ...
For another languages (as Czech) is it good to begin with clear analyzer instance. You can get such instance using createEmpty
method of CharsetStreamAnalyzerFactory
:
// Create clear analyzer instance $analyzer = CharsetStreamAnalyzerFactory::createEmpty(); // Define data for charsets // ... // Add charsets $analyzer->addWeightMap('ISO-XXXX', $charWeightMap1); $analyzer->addWeightMap('ISO-XXXX', $charWeightMap2); $detector = new CharsetDetector($analyzer);
Before writing your own extensions, look at implementation of CharsetStreamAnalyzerFactory
::createDefault
method. There are charset definitions for ISO8859-2
and CP1250
.
Stage 4 - Working with analyze results
Result of text analyse are weights for individual charsets. You can let user to decide, which charset is the right one - and you can give him advice using thees weights. You can read them using getCharsetWeightMap
of CharsetStreamAnalyzer
class:
$weightMap = $analyzer->getCharsetWeightMap(); $iterator = $weightMap->createSortedIterator(); while (null!=($item = $iterator->next())) { // Process result's item... }
To understand this code, consider next class diagram (click to enlarge):