nginx + apache + mod_wsgi + python: how to make dynamic pages expire

When writing dynamic web applications, we use nginx as a front-end web server and apache+mod_wsgi as an application server.

It is the job of nginx to:

  1. Handle SSL, and domain-level rewriting/redirects
  2. Handle static content (.jpeg, .png, .css, .js, .txt, .ico, .pdf, etc….)
  3. Handle dynamic downloads through X-Accel-Redirect
  4. Proxy other requests to apache
  5. Set the proper cache-control and expires headers on content

Ever run into the situation where you click log out, and then click the back button, and are still able to see the pages!  That is bad.   They are dynamic pages anyway, and should not be cached.

However, images, etc… SHOULD be cached. It is important that any references to images have a way to invalidate the cache. We append a number as a query string:

/path/to/script.js?192012129

This number is updated from time to time (via Python variable) when we need to invalidate the cache.

Anyway, here are some helpful nginx configuration directives.

# Send static requests directly back to the client
location ~ \.(gif|jpg|png|ico|xml|html|css|js|txt|pdf)$
{
    root  /path/to/document/root;
    expires max;
}

# Send the rest to apache
location /
{
    add_header Cache-Control 'no-cache, no-store, max-age=0, must-revalidate';
    add_header Expires 'Thu, 01 Jan 1970 00:00:01 GMT';
    proxy_pass http://127.0.0.1:8123;
}

A brief introduction to AppStruct

Have been very busy at work lately.  We made the decision about a month ago to switch (most|all) new projects over to use Python 3 with Apache, mod_wsgi, and AppStruct.  You may know what the first 3 are, but the 4th??

Special thanks goes to Graham Dumpleton behind mod_wsgi, and James William Pye behind Python>>Postgresql.   They are not involved or affiliated with AppCove or AppStruct (aside from great mailing list support) BUT if it were not for them, this framework would not exist.

AppStruct is a component of Shank, a meta-framework.  A stand-alone component in it’s own right, it represents the AppCove approach to web-application development.  Most of it is in planning, but the parts that have materialized are really, really cool.

Briefly, I’ll cover the two emerging areas of interest:

AppStruct.WSGI

This is a very pythonic (in my opinion) web application framework targeted toward Python 3.1 (a challenge in itself at this point).  We really wanted to base new development on Python 3.1+, as well as PostgreSQL using the excellent Python3/PostgreSQL library at http://python.projects.postgresql.org/.  However, none of the popular frameworks that I am aware of support Python 3, and most (if not all) of them have a lot of baggage I do not want to bring to the party.

Werkzeug was the most promising, but alas, I could not find Python 3 support for it either.  In fact, I intend to utilize a good bit of code from Werkzeug in finishing off AppStruct.WSGI.  (Don’t you just love good OSS licences?  AppStruct will be released under one of those also).

HTTP is just not that complicated.  It dosen’t need bundled up into n layers of indecipherable framework upon framework layers.   It doesn’t need abstracted to death.  I just needs to be streamlined a bit (with regard to request routing, reading headers, etc…).

Python is an amazing OO language.  It’s object model (or data model) is one of (if not the) most well conceived of any similar language, ever.  I want to use that to our advantage…

Inheritance, including multiple inheritance, has very simple rules in Python.  Want to use that as well.

Wish to provide developers with access to the low level guts they need for 2% of the requests, but not make them do extra work for the 98% of requests.

Speed is of the essence.  Servers are not cheap, and if you can increase your throughput by 5x, then that’s a lot less servers you need to pay for.

So, how does it work?

Well, those details can wait for another post.  But at this point the library is < 1000 lines of code, and does a lot of interesting things.

  • A fully compliant WSGI application object
  • 1 to 1 request routing to Python packages/modules/classes
  • All request classes derived from AppStruct.WSGI.Request

The application maps requests like /some/path/here to objects, like Project.WSGI.some.path.here.  If there is a trailing slash, then the application assumes that the class is named Index.  The object that is found is verified to be a subclass of AppStruct.WSGI.Request, and then…

Wait!  What about security?

Yes, yes, very important.  A couple things to point out.  First, the URLs are passed through a regular expression that ensures that they adhere to only the characters that may be used in valid python identifiers (not starting with _), delimited by “/”.  Second, the import and attribute lookup verify that any object (eg class) found is a subclass of the right thing.  And so on and so forth…

But you may say “wait, what about my/fancy-shmancy/urls/that-i-am-used-to-seeing?  Ever hear of mod_rewrite?  Yep.  Not trying to re-invent the wheel.  Use apache for what it was made for, not just a dumb request handler.

What about these request objects?

They are quite straightforward.  There are several attributes which represent request data, the environment, query string variables, post variables, and more.  There is a .Response attribute which maps to a very lightweight response object.

Speaking of it — it has only 4 attributes: [Status, Header, Iterator, Length].  As you see, it’s pretty low-level-wsgi-stuff.  But the developer would rarely interact with it, other than to call a method like .Response.Redirect(‘http://somewhere-else.com&#8217;, 302)

Once the application finds the class responsible for handling the URL, it simply does this (sans exception catching code):

RequestObject = Request(...)
with RequestObject:
   RequestObject.Code()
   RequestObject.Data()
return RequestObject.Response

Wow, that’s simple.  Let me point out one more detail.  The default implementation of Data() does this:

def Data(self):
   self.Response.Iterator = [self.Text()]
   self.Response.Length = len(self.Response.Iterator[0])

So the only thing really required of this Request class is to override Text() and return some text?  Yep, that simple.

But typically, you would at some point mixin a template class that would do something like this:

class GoodLookingLayout:
   def Text(self):
      return ( 
         """<html><head>...</head><body><menu>""" +
         self.Menu() +
         """</menu><div>""" +
         self.Body() +
         """</div></body></html>"""
         )

And then it would be up to the developer to override Menu and Body (each returning the appropriate content for the page).

Ohh, you may say.  What about templating engine X?  Well, it didn’t support Python 3, and I probabally didn’t want it anyway (for 9 reasons)…  If it’s really, really good and fits this structure, drop me a line, please.

What about that Code() method?

Yeah, that’s the place that any “logic” of UI interaction should go.  I’m not advocating mixing logic and content here, but you could do that if you wanted.  You will find in our code, a seperate package for application business logic and data access that the request classes will call upon.  But again, if you are writing a one page wonder, why go to all the trouble?

The only requirement for the Code() method is that it calls super().Code() at the top.  Since the idea is that the class .Foo.Bar.Baz.Index will inherit from the class .Foo.Bar.Index, this gives you a very flexible point of creating .htaccess style initialization/access-control code in one place.  So in /Admin/Index, you could put a bit of code in Code() which ensures the user is logged in.  This code will be run by all sub-pages, therefore ensuring that access control is maintained.  Relative imports are important for this task.

from .. import Index as PARENT
from AppStruct.WSGI.Util import *
class Index(PARENT):
   def Code(self):
      super().Code()
      self.Name = self.Post.FirstName + " " + self.Post.LastName
      # some other init stuff
   def Body(self):
      return "Hello there " + HS(self.Name) + "!"

Summary of AppStruct.WSGI…

To get up and running with a web-app using this library:

  1. A couple mod_wsgi lines in httpd.conf
  2. A .wsgi file that has 1 import and 1 line of code
  3. A request class in a package that matches an expected URI (myproject.wsgi.admin.foo ==  /admin/foo)

Speed?

With no database calls, it’s pushing over 2,000 requests per second on a dev server.  With a couple PostgreSQL calls, it is pushing out ~ 800 per second.

AppStruct.Database.PostgreSQL

This is not nearly as deep as the WSGI side of AppStruct, but still really cool.  To start off, I’d like to say that James William Pye has created an amazing Postgresql connection library in Python 3.  I mean just amazing.  In fact, almost so amazing that I didn’t want to change it (but then again…)

What we did here was subclass the Connection, PreparedStatement, and Row classes.  (well, actually replaced the Row class).

Once these were subclassed, we simply added a couple useful features.  Keep in mind that all of the great functionality of the underlying library is retained.

Connection.CachePrepare(SQL)
Connection._PS_Cache

Simply a dictionary of {SQL: PreparedStatementObject} that resides on the connection object.  When you directly or indirectly invoke CachePrepare, it says “if this SQL is in the cache, return the associated object.  Otherwise, prepare it, cache it, and return it”.  This approach really simplifies the storing of prepared statement objects in a multi-threaded environment, where there is really no good place to store them (and be thread safe, connection-failure safe, etc…)

Connection.Value(SQL, *args, **kwargs)
Connection.Row(SQL, *args, **kwargs)

These simple functions take (SQL, *args, **kwargs) and return either a single value or a single row.  They will raise an exception is != 1 row is found.  They make use of CachePrepare(), so you get the performance benefits of prepared statements without the hassle.  More on *args, **kwargs later under PrePrepare()

Connection.ValueList(SQL, *args, **kwargs)
Connection.RowList(SQL, *args, **kwargs)

Same as above, except returns an iterator (or list, not sure) of  zero or more values or rows.

Row class

Simply a dictionary that also supports attribute style access of items.  After evaluating the Tuple that behaves like a mapping, I decided for our needs, a simple dict would be a better representation of a row.

Connection.PrePreprepare(SQL, args, kwargs) -> (SQL, args)

Ok, so what’s the big deal?  Well, the big deal is that we don’t like positional parameters.  They are confusing to write, read, analyze, and see what the heck is going on when you get about 30 fields in an insert statement.  Feel free to argue, but maybe I’m not as young as I once was.

We do like keyword parameters.  But postgresql prepared statement api uses numeric positional parameters.  Not changing that…

The most simple use case is to pass SQL and keyword arguments.

"SELECT a,b,c FROM table WHERE id = $id AND name = $name"
dict(id=100, name='joe')

It returns

"SELECT a,b,c FROM table WHERE id = $1 AND name = $2"
(100, "joe")

Which is suitable for passing directly into the Connection.prepare() method that came with the underlying library.

I don’t know about you, but we find this to be very, very useful.

If you pass tuples of (field, value) as positional arguments, then they will replace [Field][Value], and [Field=Value] (in the SQL) with lists of fields, lists of values (eg $1, $2), or lists of field=values (eg name=$1, age=$2).  That really takes the verbosity out of INSERT and UPDATE statements with long field lists.

Conclusion

This is just in the early stages, and has a good deal of polishing to be done (especially on the WSGI side).  My purpose here was to introduce you to what you can expect to get with AppStruct, some of the rationale behind it, and that it’s really not that hard to take the bull by the horns and make software do what you want it to (especially if you use python).

Feel free to comment.

Python 3.1 and mod_wsgi performance notes

We’re researching the use of Python and mod_wsgi running under apache for developing some extensive web applications.  Here are some notes on a performance test that we recently ran.
==================================================================
Server:

x86_64
Python 3.1.1
mod_wsgi 3.0c5
apache 2.2
RHEL 5.3
quad core xenon
8 GB ram

Development system – not in production use.

==================================================================
Application:

1 import time
2
3 def application(environ, start_response):
4     status = ‘200 OK’
5
6     output = “hello world!”
7
8     #time.sleep(1)
9
10     response_headers = [
11         (‘Content-type’, ‘text/plain’),
12         (‘Content-Length’, str(len(output))),
13         ]
14
15     start_response(status, response_headers)
16
17     return [output]

==================================================================
Apache Configuration:

WSGISocketPrefix run/wsgi
<VirtualHost *>
ServerName shankproject.jason.star.ionzoft.net
DocumentRoot /home/jason/Code/ShankProject/Web
WSGIScriptAlias /Admin /home/jason/Code/ShankProject/WSGI/
Admin.wsgi
WSGIDaemonProcess shankproject.jason.star.ionzoft.net threads=15
WSGIProcessGroup shankproject.jason.star.ionzoft.net
</VirtualHost>

==================================================================
Tests:

—————————————————–
# Baseline with one process and 15 threads
# 15 threads total

threads=15
no process definition

WITHOUT time.sleep(1)
concurrency = 1  >> 1800 / second
concurrency = 100 >> 3900 / second

WITH time.sleep(1)
concurrency = 1  >> 1 / second
concurrency = 100  >> 14 / second

—————————————————–
# Get a marginal improvement by doubling the threads to 30
# 30 threads total

threads=30
no process definition

WITHOUT time.sleep(1)
concurrency = 1  >> 1680 / second
concurrency = 100 >> 3500 / second

WITH time.sleep(1)
concurrency = 1  >> 1 / second
concurrency = 100  >> 30 / second

—————————————————–
# Take processes from 1 to 3
# 90 threads total

threads=30
processes=3

WITHOUT time.sleep(1)
concurrency = 1  >> 1770 / second
concurrency = 100 >> 3500 / second

WITH time.sleep(1)
concurrency = 1  >> 1 / second
concurrency = 100  >> 88 / second

—————————————————–
# Take processes from 3 to 6
# Take threads from 30 to 15
# 90 threads total

threads=30
processes=3

WITHOUT time.sleep(1)
concurrency = 1  >> 1550 / second
concurrency = 100 >> 3300 / second

WITH time.sleep(1)
concurrency = 1  >> 1 / second
concurrency = 100  >> 88 / second

==================================================================
Conclusion:

mod_wsgi performance is outstanding.  Even running slower requests, it
can still handle significant concurrency in daemon mode without any
apparent issues.

Questions:
Is there any information on the balance between more processes less
threads and more threads less processes?

Thanks!

Interesting Thoughts on Cloud Server Performance

Apache load testing on a Cloud Server – Jason – 7/31/2009

I recently created a cloud server for a wordpress blog, and configured it to the point that the blog was working OK.  Then I decided to check the performance aspects of the server, as it was a small 256 MB + 10GB machine.
Using apachebench (ab), I ran some load tests on the blog home page.  The server choked to death. It was swapping so bad, that RackSpace Cloud sent me this email:

This is an automatic notification to let you know that your Cloud Server, city.appcove.com, is showing a considerable amount of consistent swapping activity. Quite often this is an indicator that your application or database are not as efficient as they could be. It also may indicate that you need to upgrade your Cloud Server for more RAM.

That’s strange…
I found that the response rate was:

4 requests per second, 10 concurrent connections

When the concurrency was raised to 50, the server died.  It took 10 minutes for it to calm down enough that I could LOG IN and KILL apache.
So upon further investingation, I found that the default httpd.conf configuration was WAY TOO LARGE:
We’re only working with 256 MB ram here, so if each apache process takes up any amount of memory at all, we have a low limit.

<IfModule prefork.c>
StartServers       8
MinSpareServers    5
MaxSpareServers   20
ServerLimit      256
MaxClients       256
MaxRequestsPerChild  4000
</IfModule>

Only after drastically reducing the configuration to the following, did we get reasonable performance:

<IfModule prefork.c>
StartServers       4
MinSpareServers    2
MaxSpareServers   4
ServerLimit      4
MaxClients       4
MaxRequestsPerChild  4000
</IfModule>

As it turns out, the performance went up considerably:

16 requests per second, 50 concurrent connections

Still, I thought that it could get better.  So I looked into installing some PHP opcode caching software.

http://www.php.net/manual/en/intro.apc.php

The Alternative PHP Cache (APC) is a free and open opcode cache for PHP. Its goal is to provide a free, open, and robust framework for caching and optimizing PHP intermediate code.

As it turns out, it was easy to install.

# yum install php-pecl-apc

And after restarting apache:

47 requests per second, 50 concurrent connections

Even during this load test, the site was still responsive from a web browser.
Not bad for a cheap little Cloud Server, eh?

I highly recommend yum + createrepo + rpmbuild

As I was discussing lightly before, I have recently been involved in building quite a few RPMs for our server clusters at AppCove.


Where we have arrived:

Our (new) primary production cluster consists of multiple RedHat Enterprise Linux 5 boxes in different capacities (webserver, appserver, database master, database slave, etc…).

Each machine is registered with 3 yum repositories:

  1. RHEL (RedHat Enterprise Linux)
  2. EPEL (Extra Packages for Enterprise Linux)
  3. ACN (AppCove Network)

All of our custom software packages and custom builds of open source software are placed into individual RPMs, and entered into our ACN repository.

From there, it is a snap to update any given server with the correct version of the software that server needs.

We have a dedicated build area, versioned with git, that is used to build and package all of the custom software that is needed.

(note, RPMs are not used for web application deployment — rsync via ssh is used for that)


Recommendation:

Having worked through the process from start to finish, I must say that I would highly recommend the following tools to anyone who is responsible for RedHat Enterprise, Centos, or Fedora system administration.

  • git – to keep your .spec files versioned
  • rpmbuild – to build the rpms
  • createrepo – to create your very own yum repository
  • apache – to serve the yum repository
  • yum – to obtain, install, and upgrade your rpms

Additionally, if you are using RedHat Enterprise or Centos, I would highly recommend using Extra Packages for Enterprise Linux (EPEL) to get a few of those “other” packages that don’t come with your OS (git, for example).


Learning how to build RPMs was a fairly steep curve.  But it wasn’t long.  It is one of those things that if you know it you say “that’s easy” and if you don’t you say “what the ???

yum+rpm was invented (I assume) to make life easier for countless system administrators and software publishers.  So it’s not the kind of thing that everyone is involved in.

I was a bit tough to figure out the caveats of how to correctly build RPM’s that work.  The documentation is a bit sparse.  A bit here and a bit there.


What are the benefits?

Many.  Let me list a few.

Your system stays really clean. With RPMs, you can uninstall everything you installed without leaving extra files laying around.

Upgrades are a snap. Once you have registered your own yum repository on a system, you can upgrade a given package by running:

yum upgrade your-package

All your systems can be on the same “page”. It is very easy, using yum, to ensure that all of your systems are using the exact same version of software.

Custom builds are super easy to maintain. We custom-compile php, python, and various other software.  Once the .spec files are in place, all of your software can be re-packaged with a single command.

In our specific case, we wanted to have the memcached client statically compiled into PHP.  With a few extra commands in the .spec file, it was a snap to pull in the source from pecl, and update `configure` to take it into account.

All builds can take place in one place. With one set of documentation, one consistent set of development tools, etc…  We have a user called `build` on one of the hosts that is specifically used for building all of the RPMs.


Where to learn?

The best way to learn, as usual, is to jump in and figure it out.   There is some really good documentation buried in the rpm.org site.   It is a book called Maximum RPM, origninally published by redhat.  The current snapshot of the book is available online.

http://www.rpm.org/max-rpm-snapshot/

Google is another good resource, depending on what it is you are looking for.

Basics of telnet and HTTP

Say you want to request a webpage…  Normally, one would use a web browser, right?  But sometimes you just need to see what is really going on…  In this blog post I will show the basics of using the telnet command to work with the HTTP protocol.

For reference: http://www.w3.org/Protocols/rfc2616/rfc2616.html

Most of these commands were run on Linux, but telnet on Windows should work too.

telnet <ip-or-host> <port>

Background…

If you are using the HTTP protocol, which is port 80, then you must follow the HTTP protocol conventions (which are simple).  HTTP has two primary versions at this point: 1.0 and 1.1.

In the HTTP 1.0 days, a single website was bound to a single IP address.  What this means is that an HTTP request sent to a given IP address would return content from only one site.  This is quite limiting and inconvenient.  To have to assign a new IP for every different domain name… What a bother.  Not to mention that the current internet protocol standard, IPv4, is limited to several billion addresses and quickly running out.

More recently, HTTP 1.1 has become the standard.  This enables something called Name Based Virtual Hosting.  By requiring a “Host” header to be sent along with the request, HTTP servers can in turn “look up” the correct website and return it based on the name.  Hundreds or even thousands of different domains can now be hosted on a single IP address.

(keep in mind that SSL certificates each require a seperate IP address.  Due to encryption issues, the IP address is needed to determine which SSL certificate to use…)

So with that introduction, allow me to show you the basics of HTTP…

Using HTTP over Telnet

The telnet utility is a simple (but useful) utility that allows one to establish connections to a remote server.  From my perspective, it is most useful with plain text protocols (like HTTP), but my knowledge of telnet is not very deep…

Here is an example (commands you would type are in red):

[jason@neon ~]$ telnet gahooa.com 80
Trying 74.220.208.72…
Connected to gahooa.com (74.220.208.72).
Escape character is ‘^]’.
GET /       <press enter>
<html>
   <body>
      Hi, you have reached Gahooa!
   </body>
</html>
Connection closed by foreign host.

Because it was an HTTP 1.0 request, the server DID NOT wait for additional headers.  Again, quite limiting – only sending one header line.

And… HTTP 1.1

Here is an example of an Apache Virtual Host configuration directive.

<VirtualHost 74.220.208.72:80>
   # Defines the main name by which this VirtualHost responds to
   ServerName gahooa.com

   # Additional names (space delimited) which this VirtualHost will respond to.
   ServerAlias www.gahooa.com 

   # Apache will append the requested URI to this path in order to find the resource to serve.
   DocumentRoot /home/gahooa/sites/gahooa.com/docroot

</VirtualHost>

When we issue the following HTTP 1.1 request, we are in effect asking for the file at:

/home/gahooa/sites/gahooa.com/docroot/index.html

Keep in mind that because this is HTTP 1.1, the web server will continue to accept header lines until it encounters a blank line:
A blank line…

[jason@neon ~]$ telnet gahooa.com 80
Trying 74.220.208.72…
Connected to gahooa.com (74.220.208.72).
Escape character is ‘^]’.
GET /index.html HTTP/1.1       <press enter>
Host: www.gahooa.com           <press enter>
                               <press enter again>
HTTP/1.1 200 OK
Date: Wed, 03 Sep 2008 21:00:46 GMT
Server: Apache/2.2.9 (Unix)
Transfer-Encoding: chunked
Content-Type: text/html
                               <take note of blank line here>
<html>
   <body>
      Hi, you have reached Gahooa!
   </body>
</html>
Connection closed by foreign host.

A couple notes:

  • HTTP 1.1 continues to accept header lines until it recieves a blank line
  • HTTP 1.1 sends a number of header lines in the response.  Then a blank line.  Then the response content.

Redirects

One of the main points of writing this article was to describe how to debug strange redirect problems.   Redirects are done by sending a “Location” header in the response.  For more information on the Location header, please see http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.30

[jason@neon ~]$ telnet gahooa.com 80
Trying 74.220.208.72…
Connected to gahooa.com (74.220.208.72).
Escape character is ‘^]’.
GET /test-redirect.php HTTP/1.1 <press enter>
Host: www.gahooa.com            <press enter>
                                <press enter again>
HTTP/1.1 200 OK
Date: Wed, 03 Sep 2008 21:00:46 GMT
Server: Apache/2.2.9 (Unix)
Transfer-Encoding: chunked
Content-Type: text/html
Location: http://www.google.com <take note of this line>

The Location header in the response instructs the requestor to re-request the resource, but from the URI specified in the Location header.  In the above example, if you were debugging redirect issues, you would simply initiate another HTTP request to  http://www.google.com

Python instead of telnet

Finally, I’d like to illustrate a really simple python program that would facilitate playing around with the same:

import socket
S = socket.socket(socket.AF_INET)
S.connect(("www.gahooa.com", 80))

S.send("GET / HTTP/1.1\r\n")
S.send("Host: www.gahooa.com\r\n")
S.send("\r\n")

print S.recv(1000)

S.close()

Conclusion

When you are not familiar with protocols such as HTTP, understanding “how things work” can be daunting.  But like many technologies out there, they really are simple (once understood).

The more truth and understanding you can fit into your perspective, the better you will be able to make informed decisions.

Gahooa!

File Extensions and Apache, a win-win solution

Here is the problem…  Either the developer loses, or the end user loses.  What possibly could I be talking about?  Allow me to explain…

Long ago, websites were authored using .html files.  Developers would hand code them to make sites which served their purposes quite nicely.  But as time went on, more was demanded of the web.  Server side languages, such as PHP, ASP, Java, Perl, Python, and more began to surface and become quite popular.

The file extension shown in the browser *usually* matches the file extension used on the server.  At least under Apache’s default configurations (and IIS, I believe).

http://www.site.com/home/index.html

But now, it is quite common to see this:

apache-win-win-1

Or this:

apache-win-win-2

Or even this (whatever it’s doing…)

apache-win-win-3

But in reality…

They are all really returning a file with:

Content-type: text/html

That’s a pretty common approach to using server side languages.  There are a couple other approaches also, such as:

  1. Don’t use files at all, only directories:
    http://www.example.com/about
  2. Auto generate the files on the site (but then you lose the “interactive” nature of a server site language)
    http://www.example.com/about.html

The problems with the above are:

  • It gives the developers an “incorrect” file extension to work with (ie, embedding PHP in a .html file)
  • Or, it gives the end user a file like “about.asp”, but in reality, there is not a single character of ASP in the file they receive.

(“quit complaining”, you may say…  oh well… I do like things to be “optimal” when possible)

So I identified a way to suit both purposes nicely. We now name our scripts names like:

  • /home/about.html.php
  • /render/image.jpg.php
  • /foo/bar.xhtml.php

HOWEVER, when they are referenced via HTTP, the last extension is alwas omitted.

  • /home/about.html
  • /render/image.jpg
  • /foo/bar.xhtml

(doesn’t that look nice?)

To pull it off, we implemented an interesting Apache mod_rewrite rule:

RewriteCond %{REQUEST_FILENAME} (\.html|\.xhtml)$
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_FILENAME}.php -f
RewriteRule ^(.*)$ $1.php

‘if the request ends in “.html” or “.xhtml”, and the file (REQUEST + “.php”) exists, then use that file instead.’

In this way, the end user simply receives an “.html” file.  The developers are still looking at a “.php” file.  And everyone is happy.

Observations and Questions:

Developers at AppCove have taken to this quite readily.  There was a little confusion at first about linking to “.html.php”, but that was quickly resolved.

Does it impact performance?  I’m sure it has an impact, however so small, but have not tested that.  It would be an interesting benchmark.  My opinion is that it would be negligible.

Useful?  Sure!  I think it is more “correct” to return a file with an extension that appropriately describes its content type.


Thoughts?