Search beyond MySQL with Sphinx

MySQL is a powerful database, but its search capabilities are most efficient for searches based upon a single database index. If your application requires complex searching, consider adding Sphinxto the mix to gain searching speed. Sphinx's main selling point is its full-text search capability, but it can also shine with regular queries.

Our test dataApache and RESTful Web Services article. The "world" database schema holds data about a couple of hundred countries, their four thousand regions, and more than three million towns and cities. Countries have an ID (two letters) and a name; regions in a country have an ID (two letters) and a name; cities in a region have an ID (integer), a name, an accented version of the name with possibly foreign characters, population (if known, zero otherwise), and geographic coordinates (latitude, longitude). Searches in the country and region tables don't require much processing power, but joins involving cities can, so I used Sphinx mainly for such searches.

To show how Sphinx can improve searching performance, I needed a database with at least a few million records. I turned to the World Cities data I used in my recent

Sphinx (a sort of acronym for "SQL Phrase Index") is a full-text open source search engine that supports MySQL databases (and others) natively. Its development began in 2001, and its latest version, 2.0.6, was released last month. Sphinx uses its own tuned-for-speed index files, and can thus achieve much faster search speeds than MySQL alone. It is being used for databases with billions of documents, spanning terabytes of disk space; for scalability, you can set up a cluster of Sphinx machines.

Installation of Sphinx is simple. Binary versions are available for a few Linux distributions, but installation from source isn't that complicated, so we'll take that route. After downloading the source tarball from the Sphinx download page, extract its file, and run the usual ./configure, make, make install sequence, but specify the path for Sphinx's binaries; I opted for bin/sphinx under my home directory. Try ./configure --help to get a full list of available options.

> tar -zxf sphinx-2.0.6-release.tar.gz
> cd sphinx-2.0.6-release
> ./configure --prefix=


> make


> make install

After you install Sphinx you must configure it so it can access your MySQL database, create whatever index files you request, index your data, and run a daemon that actually performs searches.

You configure Sphinx by editing its etc/sphinx.conf file in the installation directory. The configuration file is divided into stanzas. For each index you want to create, you must provide a source, which tells Sphinx how to procure the data it must index, its data types, and more, and an index definition, which specifies how to index the data, where to store the index files, and so on:

source citiesSource
  type              = mysql
  sql_host          =
  sql_port          = 3306
  sql_user          = fkereki_user
  sql_pass          = fkereki_pass
  sql_db            = world

  sql_query    =        \
    SELECT              \
      cityCode id,      \
      countryName,      \
      regionName,       \
      cityName,         \
      population,       \
      latitude,         \
      longitude         \
    FROM cities         \
      JOIN regions   USING (countryCode,regionCode) \
      JOIN countries USING (countryCode)

  sql_field_string  = countryName
  sql_field_string  = regionName
  sql_field_string  = cityName
  sql_attr_bigint   = population
  sql_attr_float    = latitude
  sql_attr_float    = longitude

We want to index cities data. The first attribute tells Sphinx it will work with a MySQL database, and the five following ones (host, port, user, password, and schema name) tell it how to access it. The sql_query definition tells Sphinx how to get the "documents" it will index; in our case, it will use the results of joining together countries, regions, and cities, using most of their fields. A "document ID," a unique unsigned positive integer number, must always be the first field, followed by as many as 32 full-text fields, and any number of other attributes. The last sql_ lines define the data types of your document fields; see the table below for common allowed data types. You could also have sql_query_pre and sql_query_post options, to specify SQL commands to be run before or after indexing; these could set up temporary tables, mark records to be indexed, update counters or totals, and clean up afterward.

Sphinx frequently used data types
Data TypeMySQL equivalent
sql_attr_bool Boolean, TINYINT
sql_attr_bigint BIGINT, signed, 64 bits wide
sql_attr_float FLOAT, 32 bits wide
sql_attr_uint UNSIGNED INTEGER, up to 32 bits wide
sql_attr_timestamp UNIX timestamps, 32 bits wide.
sql_attr_string VARCHAR fields, not indexed
sql_field_string VARCHAR fields, full-text indexed

The corresponding index entry specifies where the index files and attribute files are to be stored, whether you can do '*' queries (similar to LIKE queries with the '%' special character), and whether you are working with single-byte data (we are; the "world" schema is in ISO-8859-1) or multi-byte (UTF-8, for example) data. Setting enable_star=1 makes for larger index files and slower index times, so don't specify it unless your business logic does require '*' queries.

index citiesIndex
  source            = citiesSource
  path              = /home/fkereki/bin/sphinx/var/data/world
  docinfo           = extern
  enable_star       = 1
  min_infix_len     = 3
  charset_type      = sbcs

Finally, you must also specify some parameters for indexer, the Sphinx module that creates and rotates index files, and searchd, the Sphinx search deamon, which actually runs searches. For the former, you can specify how much RAM it will use (the more, the faster) and some other parameters, such as throttling conditions and buffer sizes. For the latter, you must specify the port that the Sphinx daemon will listen on, some file paths for data and logs, and some parameters specifying, for example, how many queries to run in parallel and whether to preopen index files on startup; the default values usually work.

  mem_limit         = 1024M

  listen            = 9306:mysql41

  binlog_path       = /home/fkereki/bin/sphinx/var/data
  log               = /home/fkereki/bin/sphinx/var/log/searchd.log
  pid_file          = /home/fkereki/bin/sphinx/var/log/
  query_log         = /home/fkereki/bin/sphinx/var/log/query.log

You can query Sphinx through an API, but for MySQL programmers, using SphinxQL, a subset of SQL that supports the SELECT statement for running queries, is a better option. You must pick a port for SphinxQL to listen on; 9306 (similar to MySQL's 3306) is suggested. You'll be able to work with MySQL and Sphinx at the same time, each on its own port; we'll see that in a moment, but first let's index and reindex our data.

Indexing the data

Sphinx can recreate all of its index files even when it's running, because it produces an alternate set of files, which it then rotates. Indexing is very fast, and the more RAM, the faster it goes. On my machine it can index the three million city documents in about 25 seconds. Enabling star searches (enable-star=1) can drastically and negatively impact that; when I made that change, indexing took six times as long, or about 150 seconds, and index sizes were also bigger. If required, you can reindex just a single index at the time, or specify --all to reindex everything.

> ../bin/indexer --all --rotate
Sphinx 2.0.6-release (r3473)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (
using config file '/home/fkereki/bin/sphinx/etc/sphinx.conf'...
indexing index 'citiesIndex'...
collected 3033147 docs, 89.2 MB
sorted 13.2 Mhits, 100.0% done
total 3033147 docs, 89186515 bytes
total 25.680 sec, 3472869 bytes/sec, 118108.93 docs/sec
total 3 reads, 0.228 sec, 59220.6 kb/call avg, 76.0 msec/call avg
total 552 writes, 3.968 sec, 791.8 kb/call avg, 7.1 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=30416).

You should also think about how you want to keep index files up to date. If your data doesn't change frequently and changes involve a small percent of all data (a common situation) you can have a "main+delta" structure, with a main index for data that rarely (if ever) changes, and a delta index for new data. You can reindex the smaller delta data set frequently, but you'll have to query both index files in order to get a full answer. The Sphinx documentation talks at length about this type of solution.

Sphinx also supports RT (Real Time) index files, which can be updated on the fly by using INSERT, DELETE, and REPLACE statements. Of course, this means that every operation that updates your database must also be replicated for the benefit of Sphinx, so every involved index file is updated. If you require searching the very latest data, you have to decide between RT index files and delta files, which lead to some further research. If not, reindexing periodically can be a solution, as long as you can live with some out-of-date data for a little while.

Trying out some searches

Using SphinxQL isn't that different from using MySQL; mostly it's just a matter of specifying port 9306 instead of 3306. No matter what language you use, if you have an available MySQL client, you'll be able to easily access SphinxQL. However, don't plan on a line-for-line replacement; SphinxQL isn't full MySQL, and you have to work around some restrictions.

As an example of some quirks you have to live with, you cannot directly use string attributes for WHERE conditions, so instead of cityname='MONTEVIDEO', you have to write MATCH('@cityname ^MONTEVIDEO$'); the Sphinx site suggests other alternative usages. You also require workarounds with DATE and DATETIME fields; Sphinx doesn't provide those types, so you have to transform DATE fields into integers with TO_DAYS(), and DATETIME fields to timestamps with UNIX_TIMESTAMP(). Every SELECT must include at least one MATCH() call. If you don't specify a LIMIT clause, LIMIT 0,20 is assumed; MySQL would instead return all matching records. Check the latest documentation to see how these restrictions evolve and work out; it's likely most will be, over time, removed, but for the time being they may prove a problem, depending on your needs.

Despite the differences, the speed Sphinx achieves can make it worth using. For example, suppose I want data on countries having cities that include "DARWIN" in their name, with known population; Sphinx can return this information in less than 1/100th of a second. As a second query, to show grouping, I'd like to know which countries have cities with a population greater than that of Uruguay, my home country; this takes Sphinx 0.01 seconds. Note the @count notation instead of a straight COUNT(*), and also note that queries always return two extra attributes, id and weight; the first identifies the original record, and the second measures how well it matches your query.

>mysql -h192.168.1.200 -P9306 -ufkereki_user -pfkereki_pass


mysql> SELECT countryname,regionname,cityname,population
    -> FROM citiesIndex
    -> WHERE MATCH('DARWIN') AND population>0
    -> ORDER BY countryname ASC, regionname ASC;
| id      | weight | countryname | regionname          | cityname     | population |
|  153585 |   1685 | Australia   | Northern Territory  | darwin       |      93081 |
| 3173417 |   1685 | Zimbabwe    | Mashonaland Central | mount darwin |      11963 |
2 rows in set (0.00 sec)

mysql> SELECT countryName,@count
    -> FROM citiesIndex
    -> WHERE population>3500000
    -> GROUP BY countryName
    -> ORDER BY @count DESC, countryname ASC
    -> LIMIT 10;
| id      | weight | countryname        | @count |
|  437596 |      1 | China              |      8 |
| 1322104 |      1 | India              |      8 |
| 1995219 |      1 | Nigeria            |      3 |
|  156938 |      1 | Australia          |      2 |
|  313817 |      1 | Brazil             |      2 |
|  840438 |      1 | Egypt              |      2 |
| 2200162 |      1 | Pakistan           |      2 |
| 2461969 |      1 | Russian Federation |      2 |
| 2769407 |      1 | Turkey             |      2 |
| 2920653 |      1 | United States      |      2 |
10 rows in set (0.01 sec)

Working with MySQL produces the same results, but at a higher cost. The query for "DARWIN" cities, which requires JOINing three tables, takes almost 10 seconds, and the query for countries with cities more populated than Uruguay takes almost 1 second:

1 2 Page 1
Page 1 of 2
SD-WAN buyers guide: Key questions to ask vendors (and yourself)