How to write Unicode to console?

SAn

Hello.

I made an image-processing program (with console interface). It takes file named, for example, filename.ext , and produces filenameCR2.png . The file name is passed as a command-line argument. There can be any characters in file names (for example, Greek or Cyryllic), so I decided to make my program Unicode-enabled:

I declared _UNICODE macro
included <tchar.h>
changed char to TCHAR
main to _tmain
strcmp to _tcscmp
std::string to std::wstring
std::cout to std::wcout
and so on...

I have read somewhere that on Windows platform wchar_t is actually store text in UTF16-encoding instead of one Unicode character per copy of wchar_t . So I looked over the code to check if it will work with multiple elements per code point.

Now program working fine and produces file named DSC01150[e]alpha[/e][e]beta[/e]CR2.png as a result of DSC01150[e]alpha[/e][e]beta[/e].JPG

THE PROBLEM IS that I cannot output the name of file being processed to the console!

The code is:

... //Some includes here

using namespace std;

int _tmain(int const argC, TCHAR const *const *const argV)
{
	try
	{
		if( argC < 2 ) throw runtime_error("You should specify file name in the command line");

		wstring const inputFileName( argV[1] );

		...

		wcout << _T("Processing file ") << inputFileName << _T(" ; it may take several days to complete...");

		...
	}
	catch(exception const &e)
	{
		...
	}
}

If I pass file named DSC01150.JPG , then the output will be:
Processing file DSC01150.JPG ; it may take several days to complete...
But if I pass file named DSC01150[e]sum[/e].JPG , then the output will be:
Processing file DSC01150
After that the program can not output any characters to console!

I searched information concerning my problem in the Internet, I have read all sites there, and have not found any useful information: all the «solutions» and «workarounds» just do not work!

I rewrote the complex programs to use Unicode and cannot make the simplest thing: print the file name to screen

P.S.: MS Win Vista, MS VS 2008.

Have you considered using the WinAPI output functions to do the Unicode output? I've had a lot of problems with iostreams as well when it came to Unicode (or possibly any other encoding standard...). Besides that, I'm not sure about Vista's command line support for Unicode characters; I don't have a lot of experience with Vista, but judging from what I've seen in XP and all the other versions of Windows I guess that support for Unicode is rather limited. You can change the code page but that's not really what you are looking for. What you could do is redirect your iostreams to an own console-like interface...

Badestrand

I don't know if the console has unicode-support but at least you can specify the codepage it works on. Maybe take a look at SetConsoleCP and SetConsoleOutputCP.

SAn

I have found that the console has 866 codepage set on my russified Win Vista.
This is codepage with russian and english letters.

I decided to test it.
I created files with different names (russian, english, greek, chinese) and typed DIR command in the command-line to see how these names will be displayed. Interesting that english and russian symbols are displayed, but other letters were substituted by question marks (?).

So, I don't want MY program to be better than DIR in this case. I just want to display all symbols that can be displayed in the current codepage. Because if the user has his files named in russian, then (probably) he is russian and also set the appropriate codepage.

Also I read in MSDN that it is not a good practice to change console codepage, because console is not belongs to the aplication and may be shared between applications.

Now, the questions are:

How to convert wide-string from UTF16 to the codepage used in the console?
Why this conversion is not performed automatically by wcout? Why the output stops at the first non-english character?

I know it doesn't directly fit your needs, but at least I can give you a hint on how to do it. boost provides a facet to perform a similar task. You should look for something like that.

C++ Forumbot

Dieser Thread wurde von Moderator/in HumeSikkins aus dem Forum C++ in das Forum DOS und Win32-Konsole verschoben.

Im Zweifelsfall bitte auch folgende Hinweise beachten:
C/C++ Forum :: FAQ - Sonstiges :: Wohin mit meiner Frage?

Dieses Posting wurde automatisch erzeugt.

SAn

I have found out (by myself) explanation and solution.

Explanation
The console has default codepage 866 (come from DOS times) in my russian Windows.
The char strings ( "Text" ) in the program have codepage 1251.
The wchar_t strings ( L"Text" ) in the program have UTF-16 encoding.

The simplest code: cout << "Text" should not work! It works due to the great fortune: the characters 'T' , 'e' , 'x' and 't' are in the same places in 866 and 1251 codepages!

This clears up why the cout << "[Russian text]" produce mojibake: the russian letters are in different places in 866 and 1251 codepages.

Let us speak about wcout . It does not support Unicode! The only thing it does is conversion of all Unicode characters to 7-bit ASCII characters. If wcout encounter the character having code > 127, its behaviour is "implementation dependent". In my case the wcout become unusable after that. Even if wcout could support Unicode, it will not work, since the console can't support it!

Since the cout can support character codes 0..255, and wcout can only support 0..127, the wcout is almost useless and should never be used!

Now the Solution (Microsoft specific :)).
I have found the great WinAPI function CharToOemA and its counterpart CharToOemW . What they do?

The CharToOemA converts string of char from default program codepage to default console codepage. If I compile my program having char s[] = "[Russian text]" in it, then the program codepage will be 1251 (stored in executable file resources), and [Russian text] will be stored in this codepage. Now I can convert and output it: CharToOemA(s, buf); cout<<buf<<endl; . This will output russian text on every system that have russian letters in its console codepage (not only 866)! If a system has not russian sybols in the console codepage, than symbols will become question marks (?). At least user will know that some text trying to output, but his console codepage has no support for it.

As soon as I have default system codepage 1251, I can not use this: char s[] = "[e]alpha[/e][e]beta[/e]" . I'm getting compile warning: "warning C4566: character represented by universal-character-name '\u03B1' cannot be represented in the current code page (1251)". How to fix it? Read ahead!

The CharToOemW is more interesting. It always consider the input encoding UTF-16, the same as in L"Text here" . Thus, this will not depend on the system default codepage (1251 in my case). So I can do that way: wchar_t s[] = L"[e]alpha[/e][e]beta[/e]"; CharToOemW(s, buf); cout<<buf<<endl; . Notice that I use here cout , not wcout . Now every man who will run this program and who has greek letters in his codepage will get greek characters in the console! If the codepage has no greek characters, then underscores (_) are output instead of these letters.

I have overloaded the minus operator (the << operator is already defined :() to do all possible conversions automatically for me:

#include <tchar.h>
#include <windows.h>

#include <iostream>
#include <string>
#include <sstream>
#include <cstring>

#ifdef _UNICODE
	#define tString std::wstring
	#define tStringStream std::wstringstream
#else
	#define tString std::string
	#define tStringStream std::stringstream
#endif

//The output stream is always char stream (not wchar_t stream)
std::ostream &operator-(std::ostream &stream, std::wstring const &s)
{
	char *const buf( new char[ s.length()+1 ] ); //Not more than one char per wchar_t plus zero terminator
	CharToOemW(s.c_str(), buf);
	stream << buf;
	delete[] buf;

	return stream;
}

std::ostream &operator-(std::ostream &stream, wchar_t const *const s)
{
	char *const buf( new char[ std::wcslen(s)+1 ] );
	CharToOemW(s, buf);
	stream << buf;
	delete[] buf;

	return stream;
}

std::ostream &operator-(std::ostream &stream, std::string const &s)
{
	char *const buf( new char[ s.length()+1 ] );
	CharToOemA(s.c_str(), buf);
	stream << buf;
	delete[] buf;

	return stream;
}

std::ostream &operator-(std::ostream &stream, char const *const s)
{
	char *const buf( new char[ std::strlen(s)+1 ] );
	CharToOemA(s, buf);
	stream << buf;
	delete[] buf;

	return stream;
}

template<class C> std::ostream &operator-(std::ostream &stream, C const &c)
{ //Other types
	tStringStream s; s << c; //First convert to string (or wstring, depending of _UNICODE)
	stream - s.str(); //Output string with conversion

	return stream;
}

Now I can write as follows:

cout - "[Russian Text]" << endl; //1251 -> 866 conversion here
cout - L"[Russian Text]" <<endl; //UTF-16 -> 866 conversion here
cout << 10 << endl; //Wrong. The symbols '0' and '1' can be in different places in the program codepage and in the console codepage.
cout - 10 << endl; //Correct.
cout - _T("[Russian text]") << endl; //1251 -> 866 or UTF-16 -> 866 conversion depending on _UNICODE project setting

The above code works great with _UNICODE defined and with _UNICODE undefined too.

The following code work only with _UNICODE defined:

cout - _T("[e]alpha[/e][e]beta[/e]") << endl;

As it should be on my system.

Does anybody know, how to make my own cout object to write << instead of - ?

P.S.: Why I can not write russian text in the Forum? This is ungerecht. For example, in most russian forums visitors can write in English (and in German too).

SAn

I have noticed the Latex tag...
$\left|\frac{1}{\zeta-z-h}-\frac{1}{\zeta-z}\right|=\left|\frac{(\zeta-z)-(\zeta-z-h)}{(\zeta-z-h)(\zeta-z)}\right|=\left|\frac{h}{(\zeta-z-h)(\zeta-z)}\right|\leq\frac{2|h|}{|\zeta-z|^2}$
What I am doing wrong?

AFAIK it is broken at the moment.

Badestrand

Hey this is really cool and CharToOem seems to be really genius! Thank you for this little article, SAn To be honest, I don't really like the minus-workaround for cout but I can't think of a better solution myself (maybe plain functions for output?).

To create your own stream-object afaik you should derive from basic_ostream<charT,traits> . Maybe you could take a look here.

SAn

Where I can find full description of facets (std::codecvt) ?

comment1,

comment1,

comment5,

comment5,

comment5,

comment1,

comment1,

comment3,

der gehirnspack ist wieder unterwegs