在C++程序中读写文件时，有两个因素涉及到文本编码。一是文本内容的编码，二是文件路径的编码。这两个问题如果不处理好，就可能会出现乱码或者找不到文件的问题。本文以Visual C++为例讨论相关的编码问题，并推广到Linux系统中的相关问题。

文本文件内容的编码

C++中读取或写入文件可以使用fstream或者wfstream。fstream是basic_fstream<char, char_traits<char>>的别名，其字符类型为char；wfstream是basic_fstream<wchar_t, char_traits<wchar_t>>的别名，其字符类型为wchar_t。

微软Visual C++文档页面<iostream> | Microsoft Docs显示，字符类型为char的基本输入输出流cin、cout、cerr以逐字符方式输入输出，而字符类型为wchar_t的基本输出输出流wcin、wcout、wcerr等则是会在数据与“程序内部操作的宽字符”之间进行相互的转换。而以下的实验表明，对应的文件输入输出流在文本编码上的行为也是一致的。

实验：文件读取时的编码转换

首先，在Visual Studio 2017中创建一个“Windows 控制台应用程序”项目，并在项目根目录创建一个UTF-8编码的文本文件“UTF-8.txt”以及GBK编码的文本文件“GBK.txt”。在项目的“属性”-“C/C++”-“预编译头”属性页中将“预编译头”项设置为“不使用预编译头”。

在main函数所在的代码文件中，输入以下代码并编译运行，得到如注释所示的结果。

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

template<typename T>
void inspect(T *str) {
	while (*str != 0) {
		cout << (int)*str++ << ' ';
	}

	cout << endl;
}

int main() {
	string str;
	wstring wstr;
	ifstream f;
	wifstream wf;

	f.open("UTF-8.txt");
	getline(f, str);
	inspect(str.c_str()); // Output: -28 -67 -96 -27 -91 -67
	f.close();

	f.open("GBK.txt");
	getline(f, str);
	inspect(str.c_str()); // Output: -60 -29 -70 -61
	f.close();

	f.open("UTF-8.txt");
	f.imbue(locale(".65001")); // https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=vs-2019
	getline(f, str);
	inspect(str.c_str()); // Output: -28 -67 -96 -27 -91 -67
	f.close();

	f.open("GBK.txt");
	f.imbue(locale(".936"));
	getline(f, str);
	inspect(str.c_str()); // Output: -60 -29 -70 -61
	f.close();

	wf.open("GBK.txt");
	getline(wf, wstr);
	inspect(wstr.c_str()); // Output: 196 227 186 195
	wf.close();

	wf.open("UTF-8.txt");
	getline(wf, wstr);
	inspect(wstr.c_str()); // Output: 228 189 160 229 165 189
	wf.close();

	wf.open("UTF-8.txt");
	wf.imbue(locale(".65001"));
	getline(wf, wstr);
	inspect(wstr.c_str()); // Output: 20320 22909
	wf.close();

	wf.open("GBK.txt");
	wf.imbue(locale(".936"));
	getline(wf, wstr);
	inspect(wstr.c_str()); // Output: 20320 22909
	wf.close();
}

使用两种文件流读取文本文件时，程序对编码的处理情况如下表所示。

	UTF-8编码文件	GBK编码文件
fstream（默认）	原始数据	原始数据
fstream（设置了locale）	原始数据	原始数据
wfstream（默认）	原始数据	原始数据
wfstream（设置了locale）	转换为UTF-16（程序中的宽字符类型编码）	转换为UTF-16 （程序中的宽字符编码）

可见，使用wfstream读取文本文件时将会进行编码转换，读取的文件会被转换为“程序内部操作的宽字符”。

结论

在Visual C++中，使用wfstream存取文件，并使用wstring或const wchar_t*等宽字符结构来存储字符串，可以比较方便地处理文件的编码转换。

扩展：在Linux系统中读取UTF-8编码文件

使用g++成功编译并运行的将UTF-8文件读取到wstring的示例代码如下，注意locale构造函数的参数与Windows中不同。

	wf.open("UTF-8.txt");
	wf.imbue(locale("en_US.utf-8"));
	getline(wf, wstr);
	inspect(wstr.c_str()); // Output: 20320 22909
	wf.close();

文件路径的编码

在Visual C++中读写文件时，若路径涉及到非ASCII字符（例如中文），则需要考虑编码问题。

Win32的文件相关API均有W版本和A版本，其中W版本接受UTF-16编码的宽字符字符串，A版本接受ANSI编码（也就是根据系统地区设置的当地文本编码）的窄字符字符串。（参见Working with Strings | Microsoft Docs）

在使用C++标准库中的fstream或是wfstream时，构造函数和open函数中的_Filename参数同样支持const char*和const wchar_t*两种类型。实验表明，系统对这两种编码的理解与Win32 API中的宽窄字符是一致的，其宽字符版本接受UTF-16编码，窄字符版本接受ANSI编码。

实验：文件路径的编码

首先，在Visual Studio 2017中创建一个“Windows 控制台应用程序”项目，并在项目根目录创建一个文件“文件.txt”。在项目的“属性”-“C/C++”-“预编译头”属性页中将“预编译头”项设置为“不使用预编译头”。

在main函数所在的代码文件中，输入以下代码并编译运行，得到如注释所示的结果。注意，Visual Studio保存cpp文件时默认以ANSI编码。

#include <iostream>
#include <fstream>
#include <string>

using namespace std;

template<typename T>
void inspect(T *str) {
	while (*str != 0) {
		cout << (int)*str++ << ' ';
	}

	cout << endl;
}

int main() {
	string fileName;
	wstring wFileName;
	ifstream f;

	fileName = "文件.txt";
	inspect(fileName.c_str()); // -50 -60 -68 -2 46 116 120 116
	f.open(fileName);
	cout << f.is_open() << endl; // 1
	f.close();

	fileName = u8"文件.txt";
	inspect(fileName.c_str()); // -26 -106 -121 -28 -69 -74 46 116 120 116
	f.open(fileName);
	cout << f.is_open() << endl; // 0
	f.close();

	wFileName = L"文件.txt";
	inspect(wFileName.c_str()); // 25991 20214 46 116 120 116
	f.open(wFileName);
	cout << f.is_open() << endl; // 1
	f.close();
}

可以看出，窄字符字符串保留了原始的ANSI编码，也就是GBK编码的数据，也能成功作为文件路径打开文件。u8字符串在以UTF-8编码存储，但不能正确打开文件。L字符串以UTF-16格式存储，并且能够正确打开文件。由此看出，Visual C++中的fstream对文件路径编码的处理符合Win32 API的风格。

理论上，使用没有任何前缀的窄字符串字面量，编译器会源文件的编码，不作任何编码转换，所以需要保证编译环境与运行环境的区域设置一致才能正确运行，否则编译环境的ANSI与运行环境的ANSI并非同一种编码。

结论

在Visual C++中打开文件时，若使用窄字符串，应确保字符串为ANSI编码。由于ANSI编码与地区设置有关，并且表示字符的能力不如Unicode系列的编码，最好使用L前缀的宽字符串作为文件路径参数。

扩展：Linux系统中的文件路径

g++中的fstream并不支持wchar_t类型的参数，所以只能使用UTF-8编码的窄字符字符串。可以使用std::codecvt<wchar_t, char, std::mbstate_t>将宽字符字符串转换为UTF-8格式字符串。

在C++17中的<filesystem>头文件中，可以通过std::filesystem::path的构造函数将宽字符字符串wstring或const wchar_t*转换为系统原生的path类型，并用作fstream构造函数的参数。参考代码如下，该代码在Visual C++ 2017中，使用“ISO C++17 标准”配置（/std:c++17）测试成功。

	filesystem::path p(L"文件.txt");
	f.open(p);
	cout << f.is_open() << endl; // 1
	f.close();

Visual C++文本文件读写编码问题