Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 May 2011 03:37:25 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Frank Bonnet <f.bonnet@esiee.fr>
Cc:        freebsd-apache@freebsd.org
Subject:   Re: Where to define HTTP_ACCEPT_LANGUAGE=fr-fr ???
Message-ID:  <20110520103725.GA19494@icarus.home.lan>
In-Reply-To: <4DD63698.3030907@esiee.fr>
References:  <4DD624E4.5000408@esiee.fr> <20110520092755.GA18041@icarus.home.lan> <4DD63698.3030907@esiee.fr>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, May 20, 2011 at 11:38:32AM +0200, Frank Bonnet wrote:
> On 05/20/2011 11:27 AM, Jeremy Chadwick wrote:
> >On Fri, May 20, 2011 at 10:23:00AM +0200, Frank Bonnet wrote:
> >>How and WHERE to define this variable in apache22 configuration ???
> >>I need the web server to understand French characters in filenames
> >I haven't worked with this before, but what does "need the webserver to
> >understand French characters in filenames" mean exactly?  More details
> >are needed, particularly technical ones.  How is Apache "not working"
> >with French characters in filenames?
> 
> 
> Apache is working BUT if a filename contains a "french" character
> I get a 404 error from apache ( file not found)
> 
> here is such error message
> 
> xxx.xxx.xxx.xxx - - [20/May/2011:10:55:06 +0200] "GET /cv/ESIEE_ENGINEERING/CV_electronique/11_EE_APP_FE_CV_CISSE_Kaliss%C3%A9.docx
> HTTP/1.1" 404 1221
> 
> in fact the file do exists
> 
> -rw-r--r--  1 www-data  www-data    15494 20 mai 03:00
> 11_EE_APP_FE_CV_CISSE_Kaliss?.docx
>                                                                                                                                          ^^^^^
>                                                                                                                                   here is the problem

This looks like a character set issue of the browser vs. the filename on
the server.  Specifically: the browser is requesting to download a
filename that's in utf-8 (Unicode), while what's on the actual server is
a filename encoded in iso-8859-1.

I'm also making the assumption the letter which shows up in your Email
above is actually the "é" character (latin small letter e with an
acute (raising) accent above it).  I hope the below examples therefore
render correctly for you.

Let me explain the two differences:

utf-8
=======
- Filename (visually):  11_EE_APP_FE_CV_CISSE_Kalissé.docx
- Filename (literally): 11_EE_APP_FE_CV_CISSE_Kaliss<0xc3><0xa9>.docx
- Filename (as URL):    11_EE_APP_FE_CV_CISSE_Kaliss%C3%A9.docx

iso-8859-1
============
- Filename (visually):  11_EE_APP_FE_CV_CISSE_Kalissé.docx
- Filename (literally): 11_EE_APP_FE_CV_CISSE_Kaliss<0xe9>.docx
- Filename (as URL):    11_EE_APP_FE_CV_CISSE_Kaliss%E9.docx

URLs, per official RFC 1738, with regards to iso-8859-1, do not permit
characters above 0x7f to make it into the URL.  So, technically
speaking, the URL of:

http://somesite/11_EE_APP_FE_CV_CISSE_Kalissé.docx

Should fail or not work.  Some browsers may try and "be smart" and turn
the accented small e character into %E9, which would then become:

http://somesite/11_EE_APP_FE_CV_CISSE_Kaliss%E9.docx

Which would work just fine.

I'm not sure that HTTP_ACCEPT_LANGUAGE would fix this problem.

If you have a CGI, PHP script, web software, etc. which is generating
filenames and things like that, and is using utf-8 as it's character set
(meaning either via an HTTP header or via HTML <meta http-equiv> tag),
then that's going to mess things up.  You need to be using the
iso-8859-1 character set instead.  A good browser will be able to show
you what character set the page shows up as.

What's the alternative?  Simple: you start using utf-8 in your
filenames.  I should note, however, that FreeBSD (including 8.2-STABLE)
does not have very good Unicode support.  It's hit-or-miss, and using
things like LANG/LC_CTYPE result in some serious problems with utilities
that rely on locale(7).  So, I would be very careful going this route on
FreeBSD.

The short version is this: if you're going to use utf-8, you need to use
it absolutely 100% of the time.  You cannot reliably mix-match character
sets like that.

Hope this helps.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110520103725.GA19494>