Problem with national characters in XHTML

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with national characters in XHTML

Lennart Borgman
I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
            "http://www.w3.org/TR/REC-html40/loose.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.

I would be glad for some hints and pointers! I am using nxml-mode if that matters here.



_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Jason Rumney-4
LENNART BORGMAN wrote:

>  <?xml version="1.0" encoding="utf-8"?>
>...
>The swedish character ? looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.
>  
>
Emacs and Firefox are doing the right thing. The byte \344 by itself is
not a valid UTF-8 character. Replace it with ? in Emacs, and it should
appear correctly.



_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

David Hansen-2
In reply to this post by Lennart Borgman
On Wed, 28 Sep 2005 10:29:21 +0200 LENNART BORGMAN wrote:

> I have run into a problem with swedish national characters in
> an XHTML document. The header of the document is like this:
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>
> The swedish character ? looks like \344 in CVS Emacs
> (2005-09-23).

\344 is a Latin-1 encoded ? not UTF-8.

David



_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Paul Pogonyshev
In reply to this post by Lennart Borgman
LENNART BORGMAN wrote:

> I have run into a problem with swedish national characters in an XHTML
> document. The header of the document is like this:
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>
> The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks
> ok in Internet Explorer, but not in Firefox. Looking at the file with
> Notepad also shows the swedish characters as expected.
>
> I would be glad for some hints and pointers! I am using nxml-mode if that
> matters here.

There is probably conflict of encodings.  Note that encoding is often duplicated
in <meta ... /> tag:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
          PUBLIC "-//W3C//DTD XHTML 1.1//EN"
          "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html>

<head>
  ...  
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  ...

Check that you have UTF-8 there too.  Finally, check that your non-ASCII characters
are indead encoded in UTF-8.

Paul



_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Tomas Zerolo
In reply to this post by Lennart Borgman
On Wed, Sep 28, 2005 at 10:29:21AM +0200, LENNART BORGMAN wrote:
> I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

Hm. Note that the header says of itself that it's encoded in utf-8. I
don't know whether it's relevant.

> The swedish character ä looks like \344 in CVS Emacs (2005-09-23).

If Emacs honors the header above, then this won't work: Octal 344 is an
a-with-dieresis, but in iso 8859-1 encoding, not utf-8.

> It looks ok in Internet Explorer, but not in Firefox.

I'd say Firefox is right on this one ;-)

Seriously: you can force the browser to assume an encoding, so what the
browser shows depends on settings which may vary from time to time. On
Firefox, it's under View -> Character Encoding. No idea about IE (and
I'm glad not to know ;-).

>                                                       Looking at the
> file with Notepad also shows the swedish characters as expected.

Notepad uses whatever encoding its font has; i guess an 8-bit fixed
encoding.

> I would be glad for some hints and pointers! I am using nxml-mode if
> that matters here.

You may try two things: changing the utf-8 in the header to iso-8859-1
or (better) insert your a-dieresis as an utf8-encoded char.

Regards
-- tomás

_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Juanma Barranquero
In reply to this post by Lennart Borgman
On 9/28/05, LENNART BORGMAN <[hidden email]> wrote:

> I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
>
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>
> The swedish character ä looks like \344 in CVS Emacs (2005-09-23).

Hmm. An XHTML document with encoding="utf-8" should not have "swedish
national characters" in it, should it? Upon reading the file, Emacs
will set its coding system to mule-utf-8, so it's no surprise than
high-bit, non-valid utf8 byte sequences appear as \xxx...

I've created a document with your header, and put an "É" in it with
notepad. Emacs shows this char as \311. I would not consider this an
error :)

--
                    /L/e/k/t/u


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Lennart Borgman
In reply to this post by Lennart Borgman
Ok, thanks for help to all that replied. I tried to learn a bit;-)

Putting iso-8859-1 in the header instead of utf-8 as Tomas Zerolo suggested solved the problem.


----- Original Message -----
From: Juanma Barranquero <[hidden email]>
Date: Wednesday, September 28, 2005 12:44 pm
Subject: Re: Problem with national characters in XHTML

> On 9/28/05, LENNART BORGMAN <[hidden email]> wrote:
>
> > I have run into a problem with swedish national characters in an
> XHTML document. The header of the document is like this:
> >
> >   <?xml version="1.0" encoding="utf-8"?>
> >   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
> >             "http://www.w3.org/TR/REC-html40/loose.dtd">
> >   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
> >
> > The swedish character ä looks like \344 in CVS Emacs (2005-09-23).
>
> Hmm. An XHTML document with encoding="utf-8" should not have "swedish
> national characters" in it, should it? Upon reading the file, Emacs
> will set its coding system to mule-utf-8, so it's no surprise than
> high-bit, non-valid utf8 byte sequences appear as \xxx...
>
> I've created a document with your header, and put an "É" in it with
> notepad. Emacs shows this char as \311. I would not consider this an
> error :)
>
> --
>                    /L/e/k/t/u
>



_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Kenichi Handa
In reply to this post by Lennart Borgman
In article <[hidden email]>, LENNART BORGMAN <[hidden email]> writes:

> I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
>   <?xml version="1.0" encoding="utf-8"?>
>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">

> The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.

> I would be glad for some hints and pointers! I am using nxml-mode if that matters here.

Could you please send me the whole file?

---
Kenichi Handa
[hidden email]


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Lennart Borgman
Kenichi Handa wrote:

>In article <[hidden email]>, LENNART BORGMAN <[hidden email]> writes:
>
>  
>
>>I have run into a problem with swedish national characters in an XHTML document. The header of the document is like this:
>>  <?xml version="1.0" encoding="utf-8"?>
>>  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>>            "http://www.w3.org/TR/REC-html40/loose.dtd">
>>  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>>    
>>
>
>  
>
>>The swedish character ä looks like \344 in CVS Emacs (2005-09-23). It looks ok in Internet Explorer, but not in Firefox. Looking at the file with Notepad also shows the swedish characters as expected.
>>    
>>
>
>  
>
>>I would be glad for some hints and pointers! I am using nxml-mode if that matters here.
>>    
>>
>
>Could you please send me the whole file?
>  
>
I have attached to test files in XHTML, one user utf-8 in the header and
the other iso-8859-1. Those files tells what is displayed in IE and
Firefox and how the swedish character ä was entered (though I guess some
info might be missing for the experts here).

I find this a bit confusing still. What character is entered by Emacs
when I type ä on my swedish keyboard? When I look at the character ä in
Emacs with (following-char) it in both test files returns 2276. Is that
what I would expect in the iso-8859-1 test file? (It starts with <?xml
version="1.0" encoding="iso-8859-1"?>)


Testing National Characters in Emacs, IE and Firefox

Testing National Characters in Emacs, IE 6.0 SP 1 and Firefox 1.0.7

Using GNU Emacs 22.0.50.1 (i386-mingw-nt5.0.2195) of 2005-09-28

The header in this file contains <xml version="1.0" encoding="iso-8859-1"?>

Character and context Internet Explorer Firefox
This is the swedish character ä entered in a new iso-8859-1 file. Correct Correct
This is swedish ä entered in a new utf-8 file.

Compare this with using UTF-8


Testing National Characters in Emacs, IE and Firefox

Testing National Characters in Emacs, IE and Firefox

Using GNU Emacs 22.0.50.1 (i386-mingw-nt5.0.2195) of 2005-09-28

The header in this file contains <xml version="1.0" encoding="utf-8"?>

Character and context Internet Explorer Firefox
This is swedish ä entered in a new utf-8 file. Wrong Correct
This is swedish ä entered after opening the file again. Wrong Correct
This is the swedish character ä entered in a new iso-8859-1 file.

Compare this with using ISO-8859-1

Testing Emacs display

If <xml version="1.0" encoding="utf-8"?> is changed to use 8859-1 Emacs still displays the entered characters as they were correct.

Conclusion

Emacs and Firefox seems to handle this correctly. However due to bugs in Internet Explorer only ISO-8859-1 currently can handle both browsers.


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Lennart Borgman
Lennart Borgman wrote:

> Kenichi Handa wrote:
>
>> Could you please send me the whole file?
>>  
>>
> I have attached to test files in XHTML, one user utf-8 in the header
> and the other iso-8859-1. Those files tells what is displayed in IE
> and Firefox and how the swedish character ä was entered (though I
> guess some info might be missing for the experts here).
>
> I find this a bit confusing still. What character is entered by Emacs
> when I type ä on my swedish keyboard? When I look at the character ä
> in Emacs with (following-char) it in both test files returns 2276. Is
> that what I would expect in the iso-8859-1 test file? (It starts with
> <?xml version="1.0" encoding="iso-8859-1"?>)

I have placed the files I attached last time at
http://ourcomments.org/Emacs/char/ and added some more comments. I have
tried different ways to add the swedish character ä and all of them
seems to result in a character with value 2276 beeing added. Even C-q 3
4 4 RET results in this which surprises me. Should it be that way?


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Tomas Zerolo
On Wed, Sep 28, 2005 at 09:12:45PM +0200, Lennart Borgman wrote:
> Lennart Borgman wrote:

[...]

> >I find this a bit confusing still. What character is entered by Emacs
> >when I type ä on my swedish keyboard? When I look at the character ä
> >in Emacs with (following-char) it in both test files returns 2276. Is
> >that what I would expect in the iso-8859-1 test file? (It starts with
> ><?xml version="1.0" encoding="iso-8859-1"?>)

Ah. You have to distinguish between Emacs's internal representation
(that's possibly the 2276 you mention), which doesn't change (al least
unless you try hard ;) and what is in the file (how Emacs writes or
interprets what it reads). You can change those things changing the
coding system (look for something like `multilingual environment').

You can see what coding system is active by doing
`M-x describe-coding-system´.

HTH
-- tomás

_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Mathias Dahl-4
In reply to this post by Juanma Barranquero
Juanma Barranquero <[hidden email]> writes:

> On 9/28/05, LENNART BORGMAN <[hidden email]> wrote:
>
>> I have run into a problem with swedish national characters in an
>> XHTML document. The header of the document is like this:
>>
>>   <?xml version="1.0" encoding="utf-8"?>
>>   <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
>>             "http://www.w3.org/TR/REC-html40/loose.dtd">
>>   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
>>
>> The swedish character ? looks like \344 in CVS Emacs (2005-09-23).
>
> Hmm. An XHTML document with encoding="utf-8" should not have
> "swedish national characters" in it, should it? Upon reading the
> file, Emacs will set its coding system to mule-utf-8, so it's no
> surprise than high-bit, non-valid utf8 byte sequences appear as
> \xxx...

I might be wrong here, but doesn't UTF-8 encode all characters in
Latin-1 (ISO 8859-1) exactly as they are *in* Latin-1 encoding?



_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Piet van Oostrum
>>>>> Mathias Dahl <[hidden email]> (MD) wrote:

>MD> I might be wrong here, but doesn't UTF-8 encode all characters in
>MD> Latin-1 (ISO 8859-1) exactly as they are *in* Latin-1 encoding?

No. Iso 8859-1 uses 1 byte for all characters, while UTF-8 uses two bytes
for those characters that are in iso-8859-1. What you probably mean is that
the Unicode value (code point) for each iso-8859-1 character is the same as
its encoding in iso-8859-1.

--
Piet van Oostrum <[hidden email]>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: [hidden email]


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Piet van Oostrum
In reply to this post by Tomas Zerolo
>>>>> [hidden email] (Tomas Zerolo) (TZ) wrote:

>TZ> Ah. You have to distinguish between Emacs's internal representation
>TZ> (that's possibly the 2276 you mention), which doesn't change (al least
>TZ> unless you try hard ;) and what is in the file (how Emacs writes or
>TZ> interprets what it reads). You can change those things changing the
>TZ> coding system (look for something like `multilingual environment').

By default Emacs uses different internal representations for the "same"
character in different coding systems. So a iso-8859-1 "ä" is a different
thing than a utf-8 "ä". This difference will disappear when Emacs switches
to Unicode internally. For the time being the OP could use Unicode
unification, if his Emacs version is young enough. I have used this for
some years now without any problems. Maybe it solves the original problem.

(require 'ucs-tables)
(unify-8859-on-encoding-mode 1)
(unify-8859-on-decoding-mode 1)

--
Piet van Oostrum <[hidden email]>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: [hidden email]


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Lennart Borgman
In reply to this post by Piet van Oostrum
Piet van Oostrum wrote:

>>>>>>Mathias Dahl <[hidden email]> (MD) wrote:
>>>>>>            
>>>>>>
>
>  
>
>>MD> I might be wrong here, but doesn't UTF-8 encode all characters in
>>MD> Latin-1 (ISO 8859-1) exactly as they are *in* Latin-1 encoding?
>>    
>>
>
>No. Iso 8859-1 uses 1 byte for all characters, while UTF-8 uses two bytes
>for those characters that are in iso-8859-1. What you probably mean is that
>the Unicode value (code point) for each iso-8859-1 character is the same as
>its encoding in iso-8859-1.
>  
>
This is not easy. What you say make it even more interesting why C-q 3 4
4 RET is stored as 2276 (or what it was) in the XHTML files. How can
that be? (For the context see my earlier mails.)


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Lennart Borgman
In reply to this post by Piet van Oostrum
Piet van Oostrum wrote:

>>>>>>[hidden email] (Tomas Zerolo) (TZ) wrote:
>>>>>>            
>>>>>>
>
>  
>
>>TZ> Ah. You have to distinguish between Emacs's internal representation
>>TZ> (that's possibly the 2276 you mention), which doesn't change (al least
>>TZ> unless you try hard ;) and what is in the file (how Emacs writes or
>>TZ> interprets what it reads). You can change those things changing the
>>TZ> coding system (look for something like `multilingual environment').
>>    
>>
>
>By default Emacs uses different internal representations for the "same"
>character in different coding systems. So a iso-8859-1 "ä" is a different
>thing than a utf-8 "ä". This difference will disappear when Emacs switches
>to Unicode internally. For the time being the OP could use Unicode
>unification, if his Emacs version is young enough. I have used this for
>some years now without any problems. Maybe it solves the original problem.
>
>(require 'ucs-tables)
>(unify-8859-on-encoding-mode 1)
>(unify-8859-on-decoding-mode 1)
>  
>
The values I have I have in CVS emacs.exe -Q is

  (featurep 'ucs-tables)  = t
  unify-8859-on-encoding-mode = t
  unify-8859-on-decoding-mode = nil

Though I do not understand what it means right now ;-)

Evaling (unify-8859-on-decoding-mode 1) does not change the behaviour of
C-q 3 4 4 RET. It still enters a character that (following-char) reports as

  2276 (04344, 0x8e4)

I did not notice before that there only seem to be on bit that differs
(see the second figure) - if that in some way matters.



_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Piet van Oostrum
>>>>> Lennart Borgman <[hidden email]> (LB) wrote:

>LB> Evaling (unify-8859-on-decoding-mode 1) does not change the behaviour of
>LB> C-q 3 4 4 RET. It still enters a character that (following-char) reports as

>LB>   2276 (04344, 0x8e4)

That is just the internal representation of the character in Emacs. It's
not important. What matters is what Emacs writes to your file. When you
write out utf-8 (for example by giving the command
(set-buffer-file-coding-system 'utf-8) it will write out C3 A4,
whereas if you use (set-buffer-file-coding-system 'latin-1) it will write
out E4.
--
Piet van Oostrum <[hidden email]>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: [hidden email]


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Lennart Borgman
Piet van Oostrum wrote:

>>LB> Evaling (unify-8859-on-decoding-mode 1) does not change the behaviour of
>>LB> C-q 3 4 4 RET. It still enters a character that (following-char) reports as
>>    
>>
>
>  
>
>>LB>   2276 (04344, 0x8e4)
>>    
>>
>
>That is just the internal representation of the character in Emacs. It's
>not important. What matters is what Emacs writes to your file. When you
>write out utf-8 (for example by giving the command
>(set-buffer-file-coding-system 'utf-8) it will write out C3 A4,
>whereas if you use (set-buffer-file-coding-system 'latin-1) it will write
>out E4.
>  
>
So you mean that at a - what should I call it? - "text semantic level"
the utf-8 char and the latin-1 char has the same meaning?


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Tomas Zerolo
On Sat, Oct 01, 2005 at 01:02:31AM +0200, Lennart Borgman wrote:
> Piet van Oostrum wrote:
[...]
> >That is just the internal representation of the character in Emacs. It's
> >not important. What matters is what Emacs writes to your file. When you
> >write out utf-8 (for example by giving the command
[...]
> So you mean that at a - what should I call it? - "text semantic level"
> the utf-8 char and the latin-1 char has the same meaning?

Yes. You put that nicely. The *character* (a dieresis) stays the same.
The *representation* (loosely referred to as `encoding') changes.

I said loosely, because on more complex things as utf-8 there are
actually two layers: the `character set', mapping each character to an
integer (aka `code point', which in this case would be UNICODE or
ISO-10646, which nowadays are equivalent), and the representation in a
file, which may be utf-8 (most common), ucs-16 or whatnot.

Now the advantage of utf-8: it is a variable-width encoding, and uses up
just one byte for one ASCII character (on ASCII it uses the same code
points). So you can interpret an ASCII file ``as-is'' as an utf-8 file.

For higher characters (the ones, for example with codes >127 in
iso-8859-1 (aka Latin1)), you need more than one byte in utf-8. AFAIK,
up to 6 bytes, but don't take that too seriously.

The disadvantage is: it is a variable-width encoding, so you have to
process a text sequentially, byte-for-byte to get the character
boundaries right (it's designed to re-synchronize gracefully, though).

If you want the whole story (on UNICODE, ISO10646, UTF8), see here:

  <http://www.cl.cam.ac.uk/~mgk25/unicode.html>

(very recommended). From the perspective of a web slave, see:

  <http://www.w3.org/TR/REC-html40/charset.html>

HTH
-- tomas

_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Problem with national characters in XHTML

Piet van Oostrum
In reply to this post by Lennart Borgman
>>>>> Lennart Borgman <[hidden email]> (LB) wrote:

>LB> So you mean that at a - what should I call it? - "text semantic level" the
>LB> utf-8 char and the latin-1 char has the same meaning?

With Unicode unification yes. Without (I think - input) unification a
iso-8851-1 "ä" and a iso-8859-9 "ä" and others all use different codes,
which can be quite annoying.
--
Piet van Oostrum <[hidden email]>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: [hidden email]


_______________________________________________
Emacs-devel mailing list
[hidden email]
http://lists.gnu.org/mailman/listinfo/emacs-devel