This guide will explore the importance of character encoding, common encoding types, and how to leverage Java’s capabilities to identify and work with the correct encoding for text files. Whether you’re developing a web application, a data processing tool, or any system that interacts with text data, mastering character encoding detection will enhance your ability to manage and manipulate text files reliably and accurately.
Scenario
The client uploads text file formats like CSV, XML, or JSON in Java web applications, but the content contains unique characters.
Recommend
The developer must set an agreement with the client on which encoding to use for the text file. Still, the application must detect the correct encoding if the client wants multiple encodings to prevent unexpected characters in the file’s content.
Automatic Encoding Detection
Java offers several libraries and techniques for automatically detecting the character encoding of text data. One popular approach is libraries like Apache Tika or ICU4J, which provide robust APIs for encoding detection. These libraries analyze the byte patterns of text data and apply heuristics to determine the most likely encoding.
Apache Tika
Apache Tika is a powerful toolkit for content analysis and detection, including character encoding detection. It provides a CharsetDetector
The class that can automatically detect the encoding of text data. Developers can use this class to analyze text streams and accurately determine the encoding.

ICU4J
ICU4J (International Components for Unicode for Java) is another comprehensive library for Unicode and globalization support in Java applications. It offers encoding detection capabilities through its CharsetDetector
class. ICU4J uses sophisticated algorithms to analyze text data and infer the correct encoding, even for multilingual content.

Implementing ICU4J in Java Projects.
1. This example will demonstrate detecting character encoding with the ICU4J libraries.
<!-- https://mvnrepository.com/artifact/com.ibm.icu/icu4j --> <dependency> <groupId>com.ibm.icu</groupId> <artifactId>icu4j</artifactId> <version>74.2</version> </dependency>
import com.ibm.icu.text.CharsetDetector; import com.ibm.icu.text.CharsetMatch; public class CharsetDetectionExample { public static void main(String[] args) throws IOException, TikaException, SAXException { CharsetDetector detector = new CharsetDetector(); // Read all bytes from the file into a byte array byte[] fileBytes = Files.readAllBytes(Path.of(filePath)); detector.setText(fileBytes); CharsetMatch charsetMatch = detector.detect(); System.out.println("charsetMatch="+charsetMatch.getName()); Arrays.stream(detector.detectAll()) .forEach(match -> System.out.println(match.getName() + " - Confidence: " + match.getConfidence())); } }
2. Example text file encoding with Shift_JIS.
おはよう。(おはようございます)
3. After executing the CharsetDetectionExample, the console is shown.
charsetMatch=Shift_JIS Shift_JIS - Confidence: 100 windows-1252 - Confidence: 18 windows-1250 - Confidence: 17 Big5 - Confidence: 10 GB18030 - Confidence: 10 UTF-16LE - Confidence: 10 UTF-16BE - Confidence: 10 ISO-8859-5 - Confidence: 9
4. The result shows the charset match and the detected percentage of encoding.
Java Apache Tika tutorial
1. This example will demonstrate detecting character encoding with the Apache Tika libraries.
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core --> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>2.9.1</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers-standard-package</artifactId> <version>2.9.1</version> </dependency>
import org.apache.tika.detect.EncodingDetector; import org.apache.tika.parser.txt.UniversalEncodingDetector; import java.io.*; import java.nio.charset.Charset; public class CharsetDetectionExample { public static void main(String[] args) throws IOException { EncodingDetector encodingDetector = new UniversalEncodingDetector(); byte[] fileBytes = Files.readAllBytes(Path.of(filePath)); // Create a ByteArrayInputStream from the byte array ByteArrayInputStream inputStream = new ByteArrayInputStream(fileBytes); Charset detectedCharset = encodingDetector.detect(inputStream, new Metadata()); System.out.println("detectedCharset="+detectedCharset.displayName()); } }
2. Example text file encoding with Shift_JIS.
おはよう。(おはようございます)
3. After executing the CharsetDetectionExample, the console is shown.
detectedCharset=Shift_JIS
4. The result shows the detected charset.
Troubleshooting
What happens when the developer uses the wrong or default encoding to read a text file?
public void readTextFile(){ try (FileInputStream fis = new FileInputStream(filePath); InputStreamReader isr = new InputStreamReader(fis); BufferedReader reader = new BufferedReader(isr)) { String line; while ((line = reader.readLine()) != null) { System.out.println(line); } } catch (IOException e) { e.printStackTrace(); } }
���͂悤�B�i���͂悤�������܂�)
After executing the method, the “readTextFile” console is shown. The text was unreadable. If the developer changes “InputStreamReader” with the “encoding” parameter, it executes again.
public void readTextFile(){ Charset charset = Charset.forName("Shift_JIS"); // Specify the charset encoding here try (FileInputStream fis = new FileInputStream(filePath); InputStreamReader isr = new InputStreamReader(fis, charset); BufferedReader reader = new BufferedReader(isr)) { String line; while ((line = reader.readLine()) != null) { System.out.println(line); } } catch (IOException e) { e.printStackTrace(); } }
おはよう。(おはようございます)
Then, after executing the method, the “readTextFile” console is shown. The text is readable.
Finally
The auto-detect character encoding isn’t 100% accurate, so this is an optional validation. The best way is to agree with the client on which encoding to use for the text file to prevent unrecognized characters or some part of the file’s content from being unreadable.