logo

Scaling Unix to infinity with SQL

PLEASE NOTE – This document is still very much a work in progress. I posted it to my Web site because I feel like the information it contains is valuable and not found anywhere else. But it is currently unfinished, and I'm unfortunately unable to dedicate any resources to finishing it up at the moment. Sorry! You're welcome to ask specific questions, I'll try to answer them to the best of my knowledge.

Introduction

Unix is well known for its text-based approach to system administration. While in the recent years efforts have been made to develop GUI frontends for most system tasks, (with mixed success as any Unix system admin will tell), at the core Unix still heavily relies on text configuration files.

Such an approach has obvious advantages when compared to other OSes with GUI-based administration only, such as Windows NT and 2000. For example, text configuration makes it extremely easy to document and duplicate a system's setup; such files can be processed by external tools, and edited even when the system is running in a minimal configuration, or under duress, at times of repair.

On the other hand text-based configuration does not scale very well. Growing configuration data will usually result in huge configuration files, hard to parse and understand by a human operator. Big configuration files can often be split in multiple, smaller files, but at the expense of readibility.

A more structured way is needed to store configuration data in order to scale services efficiently to potentially millions of users. This article discusses a method increasingly deployed by Internet Service Providers and IT organizations alike: structuring Unix systems around a SQL database.

Using a SQL database to store configuration data solves many of the problems mentionnned earlier. With a SQL database, the data:

  • is structured,
  • is accessed in SQL, a powerful language, allowing for consistency checks and XXX
  • can be processed by external, SQL-aware, tools,
  • is kept in ASCII format.

This article uses FreeBSD and MySQL as example system and database, but it should apply equally well to other combinations of operating system and database server.

1. Web Service With Apache

1.1. Virtual Hosts Configuration

Apache is by all accounts a great piece of software: it complies to all standards, is fast, very stable and its open-source license lets you modify and distribute it at will. Unfortunately its configuration file has grown from being a very simple text file to a complex hierarchical model allowing the configuration of multiple virtual hosts. In the process was lost a lot of the manageability that a simple text file offers.

In this article I would like to introduce you to a new way of configuring Apache, directly from a SQL database, with the help of mod_perl: instead of creating and maintaining by hand a single text file with (potentially) hundreds of virtual hosts definitions, you will be able to store your all of your configuration in a database table.

Quick reminder: mod_perl, initially developed by Doug MacEachern, embeds a Perl interpreter in the Apache server. It is used mostly to speed up CGI execution (with the Apache::Registry module) and to write Apache modules in Perl. But one neat possibility offered by mod_perl is often overlooked: <Perl> sections can be used to embed Perl code inside of the Apache configuration file itself and thus configure the server according to anything accessible from Perl. See example 1 for a very simple example of what <Perl> sections can do.

Standard method With a <Perl> section

User nobody
Group nogroup

Listen 10.0.0.1:80
Listen 10.0.0.2:81
<Perl>
$User = "nobody";
$Group = "nogroup";

@Listen = qw( 10.0.0.1:80 10.0.0.2:81 );
</Perl>
Example 1. Comparing Apache configuration methods.

Example 2 is slightly more advanced and unveils the true power of Perl sections, configuring the Apache server as the user who started it. All Apache directives are accessible from within Perl sections as Perl variables.

<Perl>
#!perl

$User = $>;	# current UID
$Group = $);	# current GID

$Port = 8888;
$ServerRoot = "$ENV{HOME}/www";

$ServerName = `hostname`;

__END__
</Perl>
Example 2. A more advanced example of Apache configuration with a Perl section: run the server as the current user.

Even virtual host configuration can be done with Perl sections: the magic %VirtualHost variable is a hash referencing the virtual hosts configured. This is where the database comes into play: instead of manually referencing all virtual hosts in the Apache config file, we are going to have them in a database table. A <Perl> section in the Apache config file will read this table and instantiate the corresponding VirtualHost sections for the configuration.

Since VirtualHost directives merely override values inherited from the main configuration, every <VirtualHost> section only needs to change whatever is needed for this specific virtual host. This is the main idea behind mod_vhost_alias, where only a limited number of changes can be made per virtual host. On the other hand, using mod_perl and a database to configure your Apache means that you are only limited by your imagination.

See example 3 for the database table used to store virtual hosts. Because we will use the DBI module to access the database, almost any database can be used. I'm using and showing MySQL in the examples because of its ubiquity. The "rank" column is used to order the VirtualHost definitions. Apache may depend on their order for virtual host matching, see An In-Depth Discussion of Virtual Host Matching for more details.

Field Type Null Key Default Extra
vhostID int(10) unsigned PRI 0 auto_increment
rank int(10) unsigned 999
VirtualHost varchar(63)
Port int(10) unsigned 80
ServerName varchar(63) YES
DocumentRoot varchar(127) YES
ScriptAlias varchar(127) YES
ErrorLog varchar(127) YES
TransferLog varchar(127) YES
HostnameLookups enum('Off','On') Off
Example 3. The database table storing virtual hosts configuration.

Notice how in example 2 we delimited the actual Perl code between #!perl and __END__. These special tags will allow us to test the code entered in the configuration file with:

$ perl -cx apache.conf
It's a useful trick.

[program] here

Note how we make sure a new value is present to override the value inherited from the main server configuration; otherwise we could resetting or erasing a configuration value.

The ScriptAlias directive is treated differently: instead of providing the full syntax of the original Apache directive, we choose to treat it similarly to the DocumentRoot, which specifies only a directory.

Once the configuration is complete, we insert it into the special %VirtualHost variable. Because there might be many virtual hostssharing the same address and port, we have to push this configuration in the hash. In other words, %VirtualHost is a hash of arrays of hashes, see example 4.

%VirtualHost = (
   "10.0.0.1:80" => [ { ServerName => "www1.example.com",
                        DocumentRoot => "/www/example1/htdocs",
                        ... 
                      },
                      { ServerName => "www2.example.com",
                        DocumentRoot => "/www/example2/htdocs", 
                        ...
                      },
                    ],
   "10.0.0.2:81" => ...
);
Example 4. The %VirtualHost variable.

Extensions

I've shown only a limited subset of what can be done with this technique. Since almost any ...

Why would you want to do this

A new module mod_vhost_alias distributed with Apache 1.3 makes the task much easier and almost infinitely scalable, allowing you to have potentially thousands of virtual hosts running off of one machine. Although mod_vhost_alias is a great addition to Apache, I find it to be of limited flexibility

Installing mod_perl just for the just the ease of configuring your Apache from a database might not be worth the extra XXX to your site. But, if you are already using mod_perl for all its other benefits (eg. CGI performance, SQL logging, etc.), this approach suddenly makes a lot more sense. Also, any site beyond a certain size is already running some kind of database.

For example, I believe it to be perfect for an Web-hosting environment where lots of sites are used but require more configuration that what can be done with mod_vhost_alias.

1.2. Web Logs

Have Apache log directly to the database with the Apache::DBILogger Perl module (requires mod_perl). See example 6 for the directives to add to your Apache configuration file.

PerlLogHandler Apache::DBILogger

PerlSetVar DBILogger_data_source    DBI:driver:dsn
PerlSetVar DBILogger_username       httpd_user
PerlSetVar DBILogger_password       secret
PerlSetvar DBILogger_table          requests
Example 6. Apache::DBILogger configuration.

Apache::DBILogger will log all requests to the database, bringing the power of SQL to your log files. For example:

  • hit count and total bytes transfered from a specific virtual server,
  • hit count and total bytes from all servers, ordered by number of hits,
  • count of hits from macintosh users,
  • etc.

Crafting your own analysis of the log files is only limited by the power of your SQL engine.

Note that in practice you will probably want to use specialized software for the task of Web log analysis. Thanks to logging to a database, the software will not have to go through the lengthy process of importing the log files into its format.

Field Type Null Key Default Extra
server varchar(127) MUL
bytes mediumint(9) 0
user varchar(15)
filename varchar(200)
remotehost varchar(150)
remoteip varchar(15)
status smallint(6) 0
timeserved datetime MUL 0000-00-00 00:00:00
contenttype varchar(50)
urlpath varchar(200)
referer varchar(250)
useragent varchar(250)
usertrack varchar(100)
SQL table for logging.

1.3. Authentication & Authorization

Web passwords can also be stored in the database, instead of using the text-based htpasswd and htgroup files. The mod_perl modules Apache::AuthenDBI and Apache::AuthzDBI can be used fo this purpose; based on the DBI module, they totally free the site administrator from any dependance to the database engine used.

Why two different modules for something that was traditionnally bundled together in the Apache configuration? Because each of these two modules actually answer a different need:

  • authentication is the process of asserting a user's identity, answering the question: "is the user who they claim to be?".
  • authorization has a quite different purpose, that is: "is this specific user allowed to access the resource?".

Apache::AuthenDBI provides password-based authentication, which in practice is often deemed sufficient. On the other hand, Apache::AuthzDBI does only basic authorization, by user name and group name. Quite often you will want more fine-grained control, for example by giving access to users from a specific country, etc. Many more Apache authorization modules are available, ready for use, on CPAN, the Perl module repository.

Sample .htaccess in example X.

AuthName Private Web Area
AuthType Basic

PerlAuthenHandler Apache::AuthenDBI
PerlAuthzHandler Apache::AuthzDBI

PerlSetVar Auth_DBI_data_source   dbi:driver:dsn
PerlSetVar Auth_DBI_username      db_username
PerlSetVar Auth_DBI_password      db_password
# DBI->connect($data_source, $username, $password)

PerlSetVar Auth_DBI_pwd_table     users
PerlSetVar Auth_DBI_uid_field     username

PerlSetVar Auth_DBI_pwd_field     password
# SELECT pwd_field FROM pwd_table WHERE uid_field=$user
# for authentication

PerlSetVar Auth_DBI_grp_field     groupname
# SELECT grp_field FROM pwd_table WHERE uid_field=$user
# for authorization

require valid-user	# for authentication only

require user   user_1  user_2 ... # for authorization 
require group group_1 group_2 ...
Example X. Sample .htaccess file for authentication & authorization.

AuthDBI commands.

Directive Explanation
Auth_DBI_pwd_whereclause Use this option for specifying more constraints to the SQL SELECT statement.
Auth_DBI_grp_whereclause Use this option for specifying more constraints to the SQL SELECT statement.
Auth_DBI_authoritative Default is 'on'. When set 'on', there is no fall-through to other authorization methods if the authorization check fails. When this directive is set to 'off', control is passed on to any other authorization modules. Be sure you know what you are doing when you decide to switch it off.
Auth_DBI_expeditive Default is 'off'. When set to 'on', the result of an authorization failure is an 'Access Forbidden' code instead of 'Authentication Required'. This is less convenient in a few cases because it doesn't allow users to 'switch identities' w/o closing the browser, but is formally more correct and allows support persons to easily diagnose whether the problem is in authentication (wrong password) or in authorization (wrong permissions).
Auth_DBI_uidcasesensitive Default is 'on'. When set 'off', the entered userid is converted to lower case. Also the userid in the password select-statement is converted to lower case.
Auth_DBI_pwdcasesensitive Default is 'on'. When set 'off', the entered password is converted to lower case.
Auth_DBI_nopasswd Default is 'off'. When set 'on' the password comparison is skipped if the Auth_DBI_pwd_field is empty, i.e. allow any password. This is 'off' by default to ensure that an empty Auth_DBI_pwd_field does not allow people to log in with a random password. Be sure you know what you are doing when you decide to switch it on.
Auth_DBI_encrypted Default is 'on'. When set 'on', the value in the Auth_DBI_pwd_field is assumed to be crypted using perl's crypt() function and the incoming password is crypted before comparison. When this directive is set to 'off', the comparison is done directly with the plain-text entered password.
Auth_DBI_cache_time Default is 0 = off. When set to any value n > 0, the groups of all users will be cached for n seconds. After finishing the request, a special handler skips through the cache and deletes all outdated entries (entries, which are older than the cache_time).
Auth_DBI_placeholder Default is 'off'. When set 'on', the select statement is prepared using a placeholder for the username. This may result in improved performance for databases supporting this method.
Example X. Extra configuration for authentication & authorization.

2. E-mail Service With Postfix

One piece of advice to the reader: trash sendmail; it is old technology. Postfix on the other hand comes secure out of the box, is more flexible, configurable by a normal human being, and supports external lookup tables, called maps. The default types of maps supported by default in Postfix are text-based and hash-based, but Scott Cotton and Joshua Marcus of IC Group, Inc., added a new mysql map type, allowing the storing of e-mail aliases, forwarding addresses, etc. in a database.

Postfix installations which may benefit from using mysql map types include sites that have a need for instantaneous updates of forwarding, and sites that may benefit from having mail exchangers reference a networked database, possibly working in conjunction with a customer database of sorts.

#
# postfix config file for looking up aliases in a MySQL database
#

# the user name and password to log into the mysql server
user = some_user
password = some_password

# the database name on the servers
dbname = customer_database

# the table name
table = mxaliases

# the SQL query
select_field = forw_addr
where_field = alias
#
# you may specify additional_conditions here
additional_conditions = and status = 'paid'
#
# the above variables will result in the following query:
# SELECT forw_addr FROM mxaliases WHERE alias = '$lookup' AND status = 'paid'

# the hosts that postfix will try to connect to
# and query from (in the order listed)
hosts = db1.example.com db2.example.com
Example X. Postfix configuration for mysql map.

Note how this configuration interface setup allows for multiple databases: you can use one for a virtual table, one for an access table, and one for an aliases table if you want.

Since sites that have a need for multiple mail exchangers may enjoy the convenience of using a networked mailer database, but do not want to introduce a single point of failure to their system, we've included the ability to have postfix reference multiple hosts for access to a single mysql map. This will work if sites set up mirrored mysql databases on two or more hosts. Whenever queries fail with an error at one host, the rest of the hosts will be tried in order. Each host that is in an error state will undergo a reconnection attempt every so often, and if no mysql server hosts are reachable, then mail will be deferred until at least one of those hosts is reachable.

3. Domain Name Service With Bind

bind 8 versus bind 9

B. References

  • to be completed

C. About the author

Renaud Waldura is a freelance software consultant specializing in Internet applications. Visit Renaud's Web site at http://renaud.waldura.com/ to learn more about how he can help you with your business.


Copyright © 2000-2007 Renaud Waldura <renaud@waldura.com>