April 11, 2025

This guide will explore the importance of character encoding, common encoding types, and how to leverage Java’s capabilities to identify and work with the correct encoding for text files. Whether you’re developing a web application, a data processing tool, or any system that interacts with text data, mastering character encoding detection will enhance your ability to manage and manipulate text files reliably and accurately.

Scenario

The client uploads text file formats like CSV, XML, or JSON in Java web applications, but the content contains unique characters.

Recommend

The developer must set an agreement with the client on which encoding to use for the text file. Still, the application must detect the correct encoding if the client wants multiple encodings to prevent unexpected characters in the file’s content.

Automatic Encoding Detection

Java offers several libraries and techniques for automatically detecting the character encoding of text data. One popular approach is libraries like Apache Tika or ICU4J, which provide robust APIs for encoding detection. These libraries analyze the byte patterns of text data and apply heuristics to determine the most likely encoding.

Apache Tika

Apache Tika is a powerful toolkit for content analysis and detection, including character encoding detection. It provides a CharsetDetector The class that can automatically detect the encoding of text data. Developers can use this class to analyze text streams and accurately determine the encoding.

Apache Tika – Apache Tika

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page. Please see the Getting Started page for more information on how to start using Tika.

ICU4J

ICU4J (International Components for Unicode for Java) is another comprehensive library for Unicode and globalization support in Java applications. It offers encoding detection capabilities through its CharsetDetector class. ICU4J uses sophisticated algorithms to analyze text data and infer the correct encoding, even for multilingual content.

ICU4J

ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. The ICU User Guide provides documentation on how to use ICU.

Implementing ICU4J in Java Projects.

1. This example will demonstrate detecting character encoding with the ICU4J libraries.

<!-- https://mvnrepository.com/artifact/com.ibm.icu/icu4j -->
<dependency>
 <groupId>com.ibm.icu</groupId>
 <artifactId>icu4j</artifactId>
 <version>74.2</version>
</dependency>

import com.ibm.icu.text.CharsetDetector;
import com.ibm.icu.text.CharsetMatch;

public class CharsetDetectionExample {

  public static void main(String[] args) throws IOException, TikaException, SAXException {
      CharsetDetector detector = new CharsetDetector();
      // Read all bytes from the file into a byte array
      byte[] fileBytes = Files.readAllBytes(Path.of(filePath));
      detector.setText(fileBytes);

      CharsetMatch charsetMatch = detector.detect();
      System.out.println("charsetMatch="+charsetMatch.getName());
      Arrays.stream(detector.detectAll())
                    .forEach(match -> System.out.println(match.getName() + " - Confidence: " + match.getConfidence()));
  }

}

2. Example text file encoding with Shift_JIS.

おはよう。（おはようございます)

3. After executing the CharsetDetectionExample, the console is shown.

charsetMatch=Shift_JIS
Shift_JIS - Confidence: 100
windows-1252 - Confidence: 18
windows-1250 - Confidence: 17
Big5 - Confidence: 10
GB18030 - Confidence: 10
UTF-16LE - Confidence: 10
UTF-16BE - Confidence: 10
ISO-8859-5 - Confidence: 9

4. The result shows the charset match and the detected percentage of encoding.

Java Apache Tika tutorial

1. This example will demonstrate detecting character encoding with the Apache Tika libraries.

<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
 <groupId>org.apache.tika</groupId>
 <artifactId>tika-core</artifactId>
 <version>2.9.1</version>
</dependency>
<dependency>
 <groupId>org.apache.tika</groupId>
 <artifactId>tika-parsers-standard-package</artifactId>
 <version>2.9.1</version>
</dependency>

import org.apache.tika.detect.EncodingDetector;
import org.apache.tika.parser.txt.UniversalEncodingDetector;
import java.io.*;
import java.nio.charset.Charset;

public class CharsetDetectionExample {

  public static void main(String[] args) throws IOException {
      EncodingDetector encodingDetector = new UniversalEncodingDetector();
      byte[] fileBytes = Files.readAllBytes(Path.of(filePath));
      // Create a ByteArrayInputStream from the byte array
      ByteArrayInputStream inputStream = new ByteArrayInputStream(fileBytes);
      Charset detectedCharset = encodingDetector.detect(inputStream, new Metadata());
      System.out.println("detectedCharset="+detectedCharset.displayName());
  }

}

2. Example text file encoding with Shift_JIS.

おはよう。（おはようございます)

3. After executing the CharsetDetectionExample, the console is shown.

detectedCharset=Shift_JIS

4. The result shows the detected charset.

Troubleshooting

What happens when the developer uses the wrong or default encoding to read a text file?

public void readTextFile(){
   try (FileInputStream fis = new FileInputStream(filePath);
         InputStreamReader isr = new InputStreamReader(fis);
         BufferedReader reader = new BufferedReader(isr)) {

        String line;
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

���͂悤�B�i���͂悤�������܂�)

After executing the method, the “readTextFile” console is shown. The text was unreadable. If the developer changes “InputStreamReader” with the “encoding” parameter, it executes again.

public void readTextFile(){
    Charset charset = Charset.forName("Shift_JIS"); // Specify the charset encoding here
    try (FileInputStream fis = new FileInputStream(filePath);
         InputStreamReader isr = new InputStreamReader(fis, charset);
         BufferedReader reader = new BufferedReader(isr)) {

        String line;
        while ((line = reader.readLine()) != null) {
            System.out.println(line);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

おはよう。（おはようございます)

Then, after executing the method, the “readTextFile” console is shown. The text is readable.

Finally

The auto-detect character encoding isn’t 100% accurate, so this is an optional validation. The best way is to agree with the client on which encoding to use for the text file to prevent unrecognized characters or some part of the file’s content from being unreadable.

How to Detect Character Encoding in Text Files Using Java, Apache Tika, and ICU4J.

Scenario

Recommend

Automatic Encoding Detection

Apache Tika

ICU4J

Implementing ICU4J in Java Projects.

1. This example will demonstrate detecting character encoding with the ICU4J libraries.

2. Example text file encoding with Shift_JIS.

3. After executing the CharsetDetectionExample, the console is shown.

4. The result shows the charset match and the detected percentage of encoding.

Java Apache Tika tutorial

1. This example will demonstrate detecting character encoding with the Apache Tika libraries.

2. Example text file encoding with Shift_JIS.

3. After executing the CharsetDetectionExample, the console is shown.

4. The result shows the detected charset.

Troubleshooting

Finally

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

How to Detect Character Encoding in Text Files Using Java, Apache Tika, and ICU4J.

Scenario

Recommend

Automatic Encoding Detection

Apache Tika

ICU4J

Implementing ICU4J in Java Projects.

1. This example will demonstrate detecting character encoding with the ICU4J libraries.

2. Example text file encoding with Shift_JIS.

3. After executing the CharsetDetectionExample, the console is shown.

4. The result shows the charset match and the detected percentage of encoding.

Java Apache Tika tutorial

1. This example will demonstrate detecting character encoding with the Apache Tika libraries.

2. Example text file encoding with Shift_JIS.

3. After executing the CharsetDetectionExample, the console is shown.

4. The result shows the detected charset.

Troubleshooting

Finally

Related Posts

Leave a Comment Cancel Reply