[Perl] Trying to write text to a new unicode file (28)

1 Name: #!/usr/bin/anonymous : 2006-09-19 01:51 ID:fr52Zfmu

I'm trying to use a html form and script to write utf8 characters to a utf8 encoded text file. When I copy and paste "⊂二二二( ^ω^)二二二⊃" into the text area, the text file will end up being this: http://sageru.org/stuff/test.txt. If I put the latter into an html file it would show up as the original text, but thats not what I want.

#!/usr/bin/perl
use CGI qw/:standard/;
use Encode;
use utf8;
print header;
if ( param('do') eq "write") {
my $text = param("text");
    print "$text ";
    open(OUT, '>:utf8', "test.txt");
print OUT Encode::encode('utf8', $text);
close(OUT);
    print "done";
}
else {
print "<html><body><form method='post' action='go.pl'><input type='hidden' name='do' value='write'><textarea name='text'>";
print "</textarea><input type='submit'></form></body></html>";
}

I have tried replacing

    Encode::encode('utf8', $text);

with plain old

    $text;

but the result is the same. Can anyone offer help?

2 Name: #!/usr/bin/anonymous : 2006-09-19 03:28 ID:Fq9K21WI

you're sending the page as iso-8859-1, so your browser converts the characters to html numerical entities.
replacing print header; with print header -charset=>"utf8"; should do what you want.

3 Name: #!/usr/bin/anonymous : 2006-09-19 10:34 ID:MZS1kl67

>>2

No, it's not that simple. The charset header only controls what the page is DISPLAYED as, not what forms are submitted as. The two can be different.

The form charset is set by a <meta> tag in the header:

<meta http-equiv="Content-Type" content="text/html; charset=utf8">

4 Name: #!/usr/bin/anonymous : 2006-09-19 11:15 ID:BCFyAg5p

Don't >>2 and >>3 have the same effect?

IIRC, setting the charset of the document to UTF-8 (either through HTTP or pretend HTTP in the META element) ought to push the form's encoding to UTF-8. If that doesn't work, you should set the accept-charset attribute of the FORM element to UTF-8 too.

5 Name: #!/usr/bin/anonymous : 2006-09-19 14:20 ID:MZS1kl67

>>4

No. The HTTP header affects the document display only, and the <meta> tag affects the forms only.

I think some web servers might parse the HTML to find the <meta> tag and change the HTTP header accordingly, but that is not standard behaviour.

6 Name: #!/usr/bin/anonymous : 2006-09-19 17:31 ID:Heaven

>>5
that meta tag should have the same effect as the header. if it doesn't, your browser is very badly broken.

7 Name: #!/usr/bin/anonymous : 2006-09-19 19:04 ID:tpDBkp+v

>>6

Go try it in any browser. Also note that the meta tag can't have the same effect as the HTTP header - because the meta tag can't even be parsed until the browser knows both the Content-Type and the charset, and then it's too late to change its mind even if it does see the tag.

8 Name: #!/usr/bin/anonymous : 2006-09-19 19:35 ID:Heaven

>>7
http://www.w3.org/TR/html4/struct/global.html#edef-META

> The http-equiv attribute can be used in place of the name attribute and has a special significance when documents are retrieved via the Hypertext Transfer Protocol (HTTP). HTTP servers may use the property name specified by the http-equiv attribute to create an [RFC822]-style header in the HTTP response. Please see the HTTP specification ([RFC2616]) for details on valid HTTP headers.
>
> The following sample META declaration:
>
> <META http-equiv="Expires" content="Tue, 20 Aug 1996 14:25:27 GMT">
>
> will result in the HTTP header:
>
> Expires: Tue, 20 Aug 1996 14:25:27 GMT

9 Name: #!/usr/bin/anonymous : 2006-09-19 19:43 ID:tpDBkp+v

> HTTP servers may use the property name specified by the http-equiv attribute to create an [RFC822]-style header in the HTTP response.

Note the emphasis on the words "servers" and "may".

10 Name: #!/usr/bin/anonymous : 2006-09-19 19:50 ID:Heaven

>>9
i would assume that if browsers do anything with http-equiv meta tags, they should emulate what would happen if the server did send the header.

11 Name: #!/usr/bin/anonymous : 2006-09-19 21:05 ID:tpDBkp+v

>>10

First, as I already said, they can't. Second, they don't, so your assumption is wrong.

And none of that is "broken".

12 Name: #!/usr/bin/anonymous : 2006-09-20 03:31 ID:Heaven

>>11
they can. and using a content-type http-equiv meta tag for anything except the content type and encoding for the entire document makes no sense at all.

13 Name: #!/usr/bin/anonymous : 2006-09-20 11:25 ID:Heaven

>>12

First, using it for the content type makes no sense either, because you have to assume it is text/html to parse it, and if it says anything else, what the hell is THAT supposed to mean?

As for the character set, that would mean you first have to guess the character set used, parse the file, find out the real character set (assuming you guessed right the first time and could parse it), and then re-parse if you got it wrong. This is not a behaviour you should be relying on.

14 Name: #!/usr/bin/anonymous : 2006-09-20 12:50 ID:Af+S+4Le

>>13
i'm saying that the browser should ignore the meta tag completely. form submissions should use the same character set as the document the form is in.

15 Name: #!/usr/bin/anonymous : 2006-09-20 15:13 ID:Heaven

>>14

Ah, I see.

The problem with that is, most pages are sent without a character set, and the browser guesses. However, this guess is not communicated forwards when the form is submitted, and the script receiving it must be sure about that character set. So there has to be some mechanism to make sure that you know the character set of the POST data, even when you have no control over the headers the original page is sent in.

Now, if you want to argue that the current solution is bad, I would agree. However, it is a workable solution, it's just kind of silly.

(If you ask me, <form> elements should have a charset attribute.)

16 Name: #!/usr/bin/anonymous : 2006-09-20 15:15 ID:6In1HFud

>>13
You don't have to guess. A standard compliant (X)HTML file will normally not have anything that cannot be decoded as plain ascii before the meta tag that specifies the charset.

Read http://www.joelonsoftware.com/articles/Unicode.html, that explains it better than I could.

>>1
If you can't solve it, just try to work around it: Decode the HTML entities and save the result. There should be a module in the CPAN that can do this, HTML::Entities or something.

17 Name: #!/usr/bin/anonymous : 2006-09-20 17:55 ID:tpDBkp+v

>>16

Unless it's UTF-16. Or UTF-32. Or UTF-7. And so on. Even disregarding that, your assumption is awful shaky - the file can easily have comments, a <title> tag, and whatever else before that.

Also, using a module to decode entities is ridiculous overkill for something that can be done in two regexps.

$text=~s/&#([0-9]+);/chr($1)/ge;
$text=~s/&#x([0-9a-fA-F]+);/chr(hex($1))/ge;

Or just one regexp, if you want to be tricky:

$text=~s/&#(([0-9]+)|x([0-9a-fA-F]+));/chr($2 or hex($3))/ge;

18 Name: #!/usr/bin/anonymous : 2006-09-20 19:05 ID:Heaven

>>17
what about stuff like #? the two regex version would turn that into #, the second would turn it into #... and what if the user actually types #?

19 Name: #!/usr/bin/anonymous : 2006-09-20 19:07 ID:Heaven

>>18
oh crap kareha messed up my post...
trying again...
what about stuff like & & # 3 5 ; x 2 3 ;? the two regex version would turn that into #, the second would turn it into & # x 2 3 ;... and what if the user actually types & # 3 5 ;?

20 Name: #!/usr/bin/anonymous : 2006-09-20 20:30 ID:tpDBkp+v

>>19

True. Use the one-regexp version. And there is no way to know if the user typed in an entity by hand or if the browser auto-converted it. This is a shortcoming of HTTP, and there is no solution.

21 Name: #!/usr/bin/anonymous : 2006-09-20 21:46 ID:6In1HFud

>>17
Here, have a spec:
http://www.w3.org/TR/html4/charset.html#doc-char-set

"The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element."

So, you can use this tag that way, but using it when everything before the tag isn't ascii decodable is not valid HTML. It's normal to put the charset tag as the first element in HEAD anyways, or not? That means that you are not allowed to use some charsets together with the meta tag, but for when you have a useable charset (Say, UTF-8), it's a simple solution that works very well.

>>1
Even more spec stuff. Try setting accept-charset on your form. Don't know if browsers implement this, though.
http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset

>>20

>This is a shortcoming of HTTP, and there is no solution.

Wrong. The guys who invented the protocol/language thought of that, of course. Thats just basic escaping, you know. Try typing "&amp;" into a some box, e.g. the google search box. The browser will convert it to "%26amp;" and if you type "%26amp;", it will turn into "%2526amp;". So, unless the user was talking to the server directly with telnet or something, everything will be properly escaped and the server will get exactly what the user typed.

22 Name: #!/usr/bin/anonymous : 2006-09-20 23:16 ID:BCFyAg5p

>>13

>As for the character set, that would mean you first have to guess the character set used, parse the file, find out the real character set (assuming you guessed right the first time and could parse it), and then re-parse if you got it wrong. This is not a behaviour you should be relying on.

And yet, I'm pretty sure that's what happens (I certainly recall reading that this is the behaviour used by Firefox). No, it isn't behaviour you should be relying on. This is why the W3C recommends specifying the charset in the Content-type header. The META solution is for those cases where you can't change the server's behaivour (i.e. .htaccess is disabled).

>>17

>the file can easily have comments, a <title> tag, and whatever else before that.

It can. The Content-type META tag ought to really appear as soon as possible since the browser will throw out the whole parse tree and change the charset as soon as it gets there. Of course a lot of people don't understand this. As long as the browser can read the META tag, it's fine though.

23 Name: #!/usr/bin/anonymous : 2006-09-20 23:49 ID:Heaven

>it isn't behaviour you should be relying on.

It's slightly hack-ish, but it is part of the standard, so any browser should be able to do it. It certainly beats the "Hey lets guess the charset based on character count!" approach. Though I agree, if you can you should avoid it.

>As long as the browser can read the META tag, it's fine though.

IIRC, you should not have any non-ascii-interpretable stuff even before the meta tag (So you shouldn't place <title> tags with chars > 127 before the meta tag), but most browsers will probably manage to figure everything out anyways. Simple solution/best practice: If you use the meta tag, make it the first thing in the header.

24 Name: #!/usr/bin/anonymous : 2006-09-21 03:44 ID:Heaven

>Wrong. The guys who invented the protocol/language thought of that, of course. Thats just basic escaping, you know. Try typing "&amp;" into a some box, e.g. the google search box. The browser will convert it to "%26amp;" and if you type "%26amp;", it will turn into "%2526amp;". So, unless the user was talking to the server directly with telnet or something, everything will be properly escaped and the server will get exactly what the user typed.

the google search box isn't a good example, since it uses utf-8...
make a form in a page that uses us-ascii as the charset and try typing æ and & # 230 ; (without the spaces, of course) into it. they'll both give you the exact same result.

25 Name: #!/usr/bin/anonymous : 2006-09-21 12:17 ID:Heaven

> Wrong. The guys who invented the protocol/language thought of that, of course. Thats just basic escaping, you know. Try typing "&amp;" into a some box, e.g. the google search box. The browser will convert it to "%26amp;" and if you type "%26amp;", it will turn into "%2526amp;". So, unless the user was talking to the server directly with telnet or something, everything will be properly escaped and the server will get exactly what the user typed.

No, not wrong. You are talking about URL encoding in HTTP GET, I am talking about encoding characters outside the current character set in form data. These are two different layers of encoding, applied on top of each other.

If you type a non-ASCII character into a form that is using ASCII encoding, it will first be converted to &#nnnn;. Then, if it is submitted over HTTP GET, it will be further encoded as %26%23nnnn;. Thus, you still can't tell if the user typed &#nnnn;, or if the browser auto-converted.

26 Name: #!/usr/bin/anonymous : 2006-09-21 12:20 ID:MZS1kl67

>>21

Back when I was experimenting with this, a year or two ago, browsers completely ignored <meta> tags for the display character set. They only used it for the forms. This may have been improved since, but usually, you should not rely on the <meta> tag applying to anything but the forms.

27 Name: #!/usr/bin/anonymous : 2006-09-21 13:24 ID:Heaven

>>25
>>24
I see... hm. Another good reason to use UTF-8 or something similiar, I guess.

28 Name: #!/usr/bin/anonymous : 2006-09-21 15:17 ID:Heaven

>>27

That is probably the closest you can get to a fix of this problem, yes.

This thread has been closed. You cannot post in this thread any longer.