GOOGLE PROGRAMMING CONTEST SUBMISSION: GEO SEARCH

Daniel Egnor (egnor@ofb.net)
30 April, 2002


OVERVIEW

This software detects street addresses (such as "2400 Bayshore Parkway,
Mountain View, CA 94043") in a corpus of text and converts them into 
geographical coordinates (such as "122.09720 W, 37.42532 N").  These 
coordinates are indexed in a two-dimensional index along with a conventional 
keyword index of the corpus.  A query processor is then able to rapidly 
process queries which ask for documents which match certain keywords and/or 
contain addresses within a certain radius of a specified target address.  
(Think "bookstores near me".)

Address detection and conversion is done using the TIGER/Line digital mapping 
data published by the US Census Bureau (free of charge, public domain).
The TIGER/Line data is converted into an indexed form suitable for very
rapid address lookup (and, more importantly, very fast rejection of anything
that is not a valid street address).


DEPENDENCIES

The main dependency (other than GNU make, a C/C++ compiler, etc.) required 
to use this software is the TIGER/Line data, which may be downloaded free 
of encumbrance from <http://www.census.gov/geo/www/tiger/tiger2k/>.  
You will want every .zip file from every subdirectory of that URL; a recursive 
download tool will be helpful (I used "wget").  This is a large download.

You will also need the FIPS 55 geographical name data file, available
freely from <http://www.itl.nist.gov/fipspubs/55new/nav-top-fr.htm>.

The CGI interface is a simple Python script, so you will need Python to use
it (any moderately recent version should suffice).  It generates <img>
references to MapBlast, a commercial Web-based mapping service.  If MapBlast 
is unavailable, everything will work but you will not see a map of the results 
for a geographical query.  Since MapBlast is a commercial service, feel free
to ignore mapping functionality for evaluation.


QUICK START

1. Download TIGER/Line and FIPS data.

   Start this first, because it will take a while.  Download everything under 
   <http://www.census.gov/geo/www/tiger/tiger2k/>.  The important files are the
   .zip files in each state's directory.  Place these files somewhere on your
   system so that you can find them again.

   Also download the FIPS 55 geographical name database.
   Download <http://www.itl.nist.gov/fipspubs/55new/txtstate/fip55all.txt>
   and save it somewhere.

   Make sure that you have enough space for all this data, plus a few extra
   gigabytes for intermediate files during indexing.

2. Build my code (run "make").

   Invoke GNU make in my source directory (the one with this README in it).
   This assumes a UNIX-like system.  (See PORTABILITY below for notes about
   porting to other systems.)

   Hopefully all will go well at this point, and it will build a happy little
   stable of executables.  (Among other things, it will build a modified 
   version of the Google "ripper", which will be called "goo-ripper".)

3. Index the TIGER/Line data.

   Once you've finished downloading the TIGER/Line data, use the "geo-map.sh"
   shell script in my source directory to index it.  Specifically, in my
   source directory, run this:

     find /your/tiger/files -name '*.zip' | 
     ./geo-map.sh fip55all.txt ./alias.txt ./all.map

   This should chug for a few hours, occasionally emitting complaints that not
   every file is in every .zip file, and perhaps one or two complaints about
   invalid input lines.  (Blame the Census Bureau.)  If all goes well, you
   should end up with an 'all.map' file that's a few hundred MB large.

4. Test the TIGER/Line index.

   In my source directory, run this:

     ./geo-client ./all.map

   Input an address, on a single line.  For example, type "2400 Bayshore 
   Parkway, Mountain View, CA 94043" and press Enter.  The "geo-client"
   program should indicate success and offer location information,
   something like this:

     SUCCESS: 2400/Bayshore Pky/Mountain View/CA/94043
        Side = L
       Start = 2100 @ -122.09452, 37.42271
          At = 2400 @ -122.0972, 37.42532
         End = 2642 @ -122.09937, 37.42743

   Feel free to try more addresses, and experiment with address formats
   (I've tried to make the detector as lenient as possible).  
   When you're done, press ^C or ^D.
   
   If this works, you're in good shape.  If this doesn't work... *gulp*.
   "It works for me, really it does."  Feel free to contact me.

5. Index some documents!

   The indexer consumes pre-parsed repository files (*.pprepos.bz2).
   Collect, let's say, the 900K-document sample corpus.  (Significantly larger
   corpuses will require special processing; see the implementation notes
   for more information.)  Have all the .bz2 files stored somewhere, and:

     find /your/pprepos/files -name '*.bz2' |
     ./idx-index.sh ./all.map ./all.index

   After some hours, this should generate a file 'all.index'.  (For the
   sample corpus, it's a few hundred MB.)  It may give a "read: EOF"
   message, which you may ignore.

6. Test some queries.

   In the source directory, run this:

     ./idx-client ./all.index ./all.map -5 '2400 Bayshore Pky 94043' robots

   With the 900K-document sample corpus, you should see four results,
   formatted something like this:

     ok
     4 results found
     5:37.42532:-122.0972:2400 Bayshore Pky 94043

     http://www.frc.ri.cmu.edu/robotics-faq/8.html
     What companies sell or build robots?
     0.77411696:37.418939:-122.09368:2133 Leghorn St, Mountain View, CA 94043

     http://www.frc.ri.cmu.edu/robotics-faq/10.html
     What Robotics related products are there?
     2.2514967:37.41528:-122.07504:1235 Pear Ave, Mountain View, CA 94043

     http://soe.stanford.edu/programs/research/labscenters.html
     SUSOE - Laboratories, Research Centers, and Affiliate Programs
     3.9878246:37.427179:-122.14233:430 Sherman Ave, Palo Alto, CA 94306

     http://mason.gmu.edu/~plepage/Pubs98.html
     Pamela LePage -- Publications
     4.825617:37.40464:-122.14528:3801 Miranda Ave, Palo Alto, CA 94304

   Feel free to try other queries, with different addresses (instead of
   2400 Bayshore), different distances (instead of 5km), different keywords
   (instead of 'robot'; each keyword should be its own argument), etc..

7. Set up the CGI interface.

   Edit the "query.cgi" script and set the pathnames hardcoded at the top.
   Place this script somewhere such that a Web server will execute it.
   Visit the script; the user interface should be self-explanatory.


CAVEATS

The current implementation has a number of weaknesses that should be pointed
out.  None of these flaws are believed to be insuperable.

1. The sample corpus provided is not an interesting demonstration of this
technology.  Educational pages are not address-rich; those addresses
that do appear are usually the main address of a university.  I believe
a more general corpus would have more addresses of businesses and service
providers that would yield more interesting query results, especially in
combination with good ranking (see below).

2. The keyword search facility is extremely primitive.  There are no
boolean queries or phrase queries or proximity search.  There is no ranking;
pages are simply sorted by increasing distance from the center address
specified (if any).  The usual problems with simple conjunctive keyword
query are apparent; hits often match lengthy documents that happen to
contain the words in the query and an address in the area of interest,
but are not actually relevant to the query.  Simple TF/IDF ranking would 
probably work wonders.

Google obviously has very good keyword search relevance ranking algorithms,
so I didn't invest any time in that area.  The best way to combine keyword 
relevance with geographical distance into a unified result presentation order 
remains unknown, however.

3. The query client collects all the results for a query (no matter how
many), sorts them (by distance), and outputs them all on a single page.
Pathological queries like "the" (no noise word removal here!) will run for
a minute and then generate an enormous result page.

Incremental result generation is very possible, simply unimplemented.
Even with distance ranking, the geographical index could be used to find
closer results before farther ones.

4. The address detector is not perfect.  With one test set of addresses
(extracted by hand from a corpus subset), it was only able to recognize 80%
of them as valid addresses.  Accuracy is unknown; I have done no more than
anecdotal testing to make sure it identifies the correct location (but in
all the anecdotal tests it works).

See the section on address recognition below for more on this.

5. The I/O framework is unable to handle files larger than 4GB (and probably
unhappy well before that).  It could be adapted to support 64 bit files,
but only on operating systems with large file support.  This isn't as bad
as it seems, as a corpus can be easily partitioned into multiple index
files; the query results against each index could be easily merged.  (The
indexes could even be on different hosts, which might be desirable anyway.)
While this is simple, the infrastructure to scatter queries and gather 
results against a distributed set of indexes hasn't been implemented.

6. The intermediate format used during indexing is a human-readable
text format which is processed by Unix 'sort'.  This isn't the most
efficient way to merge indexes, and the resulting binary search index
isn't the most efficient disk-based data storage system.  A production
implementation would do things differently.  (That said, indexing time
isn't all that ridiculous.)

7. The address processing code and the data which feeds it is wholly
U.S.-centric.  Significant changes must be made to handle other countries
(for example, most have non-numeric postal codes), and I have no idea
where to get public domain street address location data for other countries.


IMPLEMENTATION NOTES

This software is built from 1830 lines of C code (the TIGER/Line and FIPS
indexing and address processing system, written in C to facilitate eventual
reuse elsewhere), 818 lines of C++ code (the ripper module, 
keyword/geographical index, and query system, not including the Google 
source code provided with the contest materials), a 136-line CGI script 
written in Python, and 61 lines of Bourne shell glue.

(You can compile the C code with a C++ compiler if you want.)

Both the TIGER/Line/FIPS index and the actual corpus index are built by
generating an intermediate text file.  Each line of the text file is a key
in the index, and they are designed to lexicographically compare in key
order.  The Unix 'sort' utility then performs the heavy lifting of actually
ordering the keys.  Conveniently, segments of the source data can be
processed independently, their output sorted independently, and the final
results combined with "sort -m".  (The Bourne shell scripts automate this
process on a single machine.)  The sorted rows are then converted directly
into a sorted fixed-record-length table which may be queried by binary search.
The details are more complex; TIGER/Line processing requires four separate
processing stages, but the basic idea is the same.


THE ADDRESS MAP

The address map is a file built from TIGER/Line and FIPS data which is
used to quickly convert addresses in text form into geographic coordinates.
The address map is used during document indexing and also during query
(to look up the address the user enters as the target location).

Street addresses have four basic components (some optional): Number (1234),
street (Main Street), city (Philadelphia), state (PA), and ZIP (12345).
In the address map, addresses of even parity are treated separately from
addresses with odd parity (since a range of odd addresses often appears in
a different location from the interleaved range of even addresses), so
we add a fifth component, Parity (even or odd).

In the address map, all addresses with the same parity, street, city, state
and ZIP are combined into a "zone".  For efficiency, not all individual 
addresses will be represented in a zone; if a range of addresses lie along 
a line (usually a city block), only the endpoints (or "control points") will 
be stored; locations of addresses in the middle are interpolated.

A zone is stored as a series of these control points, each of which is a
latitude/longitude location, an address, and the side of the street the
address is on (left or right, if you are facing the direction of increasing
addresses).  These control points are stored in increasing order, so that
search for a particular address within a zone could use binary search
(though the current implementation uses a linear search; zones aren't big).
A delta compression scheme is used to compactly store series of control 
points which differ (in location and address number) only slightly from their
neighbors.

Given a zone, then, we can easily find an address.  To find the right zone,
each unique name (such as "Main Street" or "Philadelphia") is associated with
a list of zones.  Note that the "Main Street" name will be associated with the
zones for Main Street in every city.  The names are tagged with a type:

  E  Street name for even parity addresses on the street
  O  Street name for odd parity addresses on the street
  C  City name
  S  State name

Zones are ordered in the address map by ZIP (and alphabetically by street 
within the ZIP), so zones corresponding to the same name are often adjacent.
For each name, a list of the offsets to each such contiguous region are
stored in order by offset (this is the "zone range" for the name).  A lookup
table holds the zone offsets for each ZIP code (since zones are ordered by
ZIP, the zones within a ZIP are always in a single contiguous range).

Because zone ranges are internally sorted, multiple zone ranges can easily
be combined to find the intersection -- the zones which are shared by all
the ranges.  When an address is processed, the components of the address
(except the number) are converted into a zone range, the intersection is
taken, and the resulting zone range is searched for the target number.
Note that addresses may be ambiguous, and the system could return all the
matching possibilities (but currently it is not set up to do this).


BUILDING THE ADDRESS MAP

The source TIGER/Line data is a set of what are effectively relational
database tables, with one set for each county.  The first step of the
processing pipeline, "geo-tiger-to-1", reads TIGER data for each county,
joins these tables against each other in memory (counties aren't big
enough for this to be a problem), and outputs a series of records which
identify the ZIP code, street name, address parity, a control point's
street address (TIGER/Line shares the notion of using control points and
interpolation), and the geographical location of the control point.

These records, after sorting (with 'sort'), are used directly to construct
the zone data and ZIP code tables by the second step of the pipeline,
"geo-1-to-2".  Whenever this step encounters a new ZIP code, parity,
or street name, it writes a record to output indicating the address of the
beginning and end of the corresponding zone.

These zones are combined here with FIPS data ("geo-fips-to-2"), which supplies 
names for cities and states.  The FIPS data is keyed to ZIP codes.  The names 
are also run through a canonicalizer ("geo-2-to-2a") which rewrites every
name according to a set of rules.  These rules reduce variant terms
("Avenue", "Ave", and "Av", or "North East", "Northeast", "N.E." and "NE",
or "Tenth Street", "10th St" and "10 St") to a single canonical form.  The 
rules are also stored in the index so the query processor can access them.

These names -- which are now tagged with their type -- are now sorted and
run through the penultimate step, "geo-2-to-3".  This step constructs the
zone range lists associated with each name, and writes out records associating
each name with the address of its zone range list.  Finall, the last step,
"geo-3-to-index", writes a table of names (suitable for binary searching)
which associates each name with its corresponding zone range list.

All of these steps are invoked by the master "geo-index.sh" shell script.


ADDRESS RECOGNITION AND PROCESSING

Addresses come in a variety of forms.  Even after variant terms are
canonicalized using alias rules, addresses still vary.  Some will include
city, state, and ZIP; some will be missing one or more of those.  Some will
include extra text, such as apartment or "suite" numbers or corporate routing
information, between the street and the city.

In any case, it is often very difficult to find the dividers between the
street name, city name, and state.  Punctuation or typography may be used;
if physical whitespace such as newlines are used, that may be represented
by HTML which we can't really access.

So, the address parser uses a recursive descent parser with backtracking.
It is able to accept up several "garbage words" between the number and
the street as well as between the street and the city; it will try to
"greedily parse" as many words in the input as find a match the name table.
If multiple possible matches exist, then they will be tried in order until
one results in a valid parse.

The result is usually quite good; you can use the "geo-client" tool, which
will show you what it believes the "proper" form of the address is (after
aliases are reduced and components separated).   Bear in mind that the parser
completely ignores punctuation.  Note that while "2400 Bayshore Parkway,
Mountain View, CA 94043" is a correct address, so is "2400 Bayshore Pky 94043".

When parsing documents, every number which occurs in the source text is
considered to potentially mark the beginning of an address, and the number
and subsequent text is fed to the address parser.  Most numbers of course
are not addresses, and they are generally rejected very quickly (since they're
not followed by a valid street name, much less city and/or state and/or zip).

Nevertheless, a number of issues remain in address detection.  A test shows
that it only recognizes 80% of a human-culled set of 200 addresses from one 
of the raw corpuses.  Commercial Web mapping engines do substantially better.
In particular, they are able to deal with addresses that have missing or
incorrect "street types" (Street vs. Avenue vs. Blvd...) and missing or
incorrect directional qualifiers (prefix or suffix East, West, ...).  My
recognizer considers those to be part of the street name, which must match.
They are broken out in the original TIGER/Line data, so we could be more
lenient here.  Street names are also present in a variety of forms; while
TIGER does store alternate names, it doesn't always get them all.  One
address failed because it referred to "Sandpoint Way" rather than "Sand
Point Way".

Furthermore, some addresses are given by intersection ("the corner of Canal
and Lexington"), which my recognizer can not handle at all; the document
scanner won't even pick it up as a potential address.

Sadly, of the four addresses in <http://www.google.com/corporate/address.html>,
only two are successfully recognized.  Luckily, the Mountain View headquarters'
address is one of them them; the New York address is also fine.  However, the 
Chicago address is missing the "Avenue" suffix (it just says "401 N. 
Michigan"), and the Atlanta address is missing a directional suffix (it says 
"1200 Abernathy Road" instead of "1200 Abernathy Road NE").

Finally, the TIGER/Line data itself is incomplete and sometimes inaccurate;
probably half of the unrecognized 20% of addresses were due to omissions in
the TIGER/Line database.  Who knows how many addresses are recognized but
plotted incorrectly.  (Anecdotally, they're generally at least close, which 
is what matters for this purpose.)

It's a messy world out there.


DOCUMENT INDEXING

The keyword indexing system is an extremely simple inverted index of which
I will not speak further.

Geographical indexing is more interesting.  Coordinates are converted to
the Mercator projection.  The Mercator grid is covered multiple times by a
set of "regions".  Regions come in a variety of sizes; at each size, the
Mercator plane is divided into a regular grid of that size, and each grid
square is a "region".  There are 16 sizes, in powers of two between a
hemisphere of the globe and 1/65536 thereof.

Each point, therefore, "belongs" to 16 different regions.  Each region
has a grid coordinate.  The Mercator projection extends to infinity at
the poles; if at a given size the Y grid coordinate is greater than 65536,
it is ignored.  (This has the effect of keeping the minimum resolution
roughly constant with latitude.)  Luckily, there are no streets in TIGER/Line 
near the poles, otherwise I would have to deal with some irksome 
singularities.

When an address is recognized in the corpus, all 16 associated region codes
are associated with the document, as if they were a special kind of inverted
index term.  ("R13000123000456" might be "region size 13, grid coordinate
(123,456)".)

When a geographical query is performed (with a center point and a radius),
a 3x3 set of regions of appropriate size is picked that will cover the circle
completely but not include too much else.  These regions are then combined
into the query process as if they were terms being OR'd together (and then
AND'd with the rest of the user's search terms).

The index stores the address locations found for each document, so when the
documents are retrieved the addresses can be checked to see if they are
actually within the specified radius of the specified center.  The results
can also be sorted by distance.  Note that a single document may of course
have many addresses.

As with the TIGER/Line processing, document indexing uses an intermediate
text file format.  However, there is only a single such format, which
associates document IDs with URLs, titles and addresses, and associates
search terms (keywords and region codes) with documents.

This file is read twice due to certain index layout constraints, but the
index itself is quite simple by comparison to the address map.  It contains
a list of documents, a list of locations for each document, and two binary
search tables (one for keywords, one for region codes) that point to zones
which contain lists of document identifiers.

The query client "idx-client" outputs a machine-readable (but also 
human-readable) results form; the CGI script "query.cgi" runs this program 
and parses its output.  A Web server running a Python CGI script running 
a search program isn't the most efficient way to do things; in production, 
the whole thing would be tied in to some big crazy application server.


PORTABILITY

With one exception, the included code is designed to be completely portable,
using only standard C and C++ libraries.  The exception is the raw I/O 
layer, which comes in two flavors (both of which conform to the same header
file); one flavor uses UNIX mmap() for performance, the other uses only
C stdio for maximum portability.  The system has been tested with both, but
is configured to use the mmap() layer by default.

The code has, however, only been tested on a Debian Linux 2.4 x86 system,
using gcc/g++ 2.95.  Your mileage may vary on other platforms.

The Makefile, shell scripts, and Python CGI script would have to be replaced 
if porting to a system without these facilities.  These are all very simple
and would be easy to reimplement. 


SCALABILITY

Performance measurements are taken on my computer ("gleep"), a modest home 
Linux PC with a 1GHz Athlon CPU, ordinary IDE disk subsystem, and 256MB RAM. 
All tests are run with "cold cache" (input files had not been read recently).

On this machine, conversion of the entire US TIGER/Line data set and FIPS
names into a map file takes a few hours.  This time is invariant; the data
is only published once every 10 years, and is constant with respect to the 
corpus size.  We can therefore ignore this phase, considering only the
actual document indexing.

As described above, the indexing process happens in two phases.  The first
phase independently converts portions of the corpus into intermediate indexes 
which are each sorted with UNIX 'sort'.  The second phase merges these
files with 'sort -m' and feeds the output into a program which generates
a binary index.  Finally, another level of segmentation may be achieved by
building many binary indexes and querying them in parallel at run time:

                                   query                                   (5)
                  -------------------+-------------------
        +--------------+     +--------------+     +--------------+ run time
  - - - | binary index | - - | binary index | - - | binary index | - - - - - -
        +--------------+     +--------------+     +--------------+ build time
               /                     |                     \
         idx-index               idx-index               idx-index         (4)
             |                       |                       |
           sort -m                sort -m                sort -m           (3)
       ------+------           ------+------           ------+------
     sort   sort   sort      sort  sort   sort      sort   sort   sort     (2)
      /      |       \        /      |      \        /       |      \
   ripper  ripper  ripper  ripper  ripper  ripper  ripper  ripper  ripper  (1)
     |       |        |      |       |       |      |        |       |
  pprepos pprepos pprepos pprepos pprepos pprepos pprepos pprepos pprepos
   /   \   /   \   /   \   /   \   /   \   /   \   /   \   /   \   /   \   (0)
  doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc

Note: the infrastructure for query distriction is not included (though it
would be simple).  All other forms of distribution pictured here are 
represented in the submission.

0. Parsing

   I know nothing about the conversion from raw documents to pre-parsed form,
   but I can only assume that is a scalable process.

1. Ripper

   The "ripper" step's execution time is lienarly proportional to the size of
   its input (it remembers nearly nothing from one document to the next); the
   corpus can be subdivided as desired to run any number of "ripper" processes
   concurrently.  To measure its performance, I processed an arbitrarily chosen
   segment of the 900K-document corpus:

     <gleep> time bzip2 -dc ~/google/data/pprepos.03.bz2 | 
             ./goo-ripper --geocode all.map 0 - > 03.rip        
     Welcome to the Google Programming Contest ripper.
     Please see the file LICENSE for terms of use of the data and code.
     bzip2 -dc ~/google/data/pprepos.03.bz2
       34.05s user 0.54s system 13% cpu 4:20.35 total
     ./goo-ripper --geocode all.map 0 - > 03.rip
       115.79s user 2.59s system 45% cpu 4:20.36 total

   This segment contains 16,419 documents.  Even with the overhead of
   uncompression, the ripper can process around 60 documents per second.
   With 1,000 machines equivalent to my own, a two-billion-document corpus
   could be run through the ripper in under 10 hours.

2. Sort

   The task of sorting each ripper's output can be parallelized as
   appropriate by adjusting the size of the portions individually sorted.
   Using the same example segment gives the following time on gleep:

     <gleep> time sort -f < 03.rip > 03.rip.sort
     sort -f < 03.rip > 03.rip.sort
       22.24s user 8.95s system 38% cpu 1:21.21 total

   With this particular segment size (16,419 documents), sorting is faster
   than ripping; with 1,000 machines like my own, the two-billion-document
   corpus would take under 3 hours.

3. Merge

   This step's running time is harder to estimate.  It depends on how
   many indexes are constructed, and the complexity performance and
   parallelization opportunities inherent in merging large collections
   of large sorted documents.

   The entire sample 900K corpus merges relatively quickly:

     <gleep> time sort -mf rip/*.rip > all.rip
     sort -mf rip/*.rip > all.rip 
       316.15s user 162.13s system 26% cpu 30:22.19 total

   If each binary index built were the same size as this corpus, then
   the two-billion-document corpus would have 2,222 binary indexes.
   With 1,000 machines equivalent to my own, the merging step would
   take an hour and a half.

   However, it is almost certainly not optimal to use over 2,000 
   individual indexes.  I will leave it to you to estimate the scaling
   factors when fewer, larger indexes are built.  (If nothing else,
   parallelization is harder.)  Nevertheless, it is clearly a feasible
   task.

4. Index

   The index builder must be run twice on the merged input, but each run's
   time is proportional only to the length of the input.  Indexing our
   favorite 16K-document segment is very fast:

     <gleep> time ./idx-index 03.index < 03.rip.sort
     read: EOF
     notice: indexing stage 1
     ./idx-index 03.index < 03.rip.sort
       4.54s user 0.56s system 68% cpu 7.420 total
     <gleep> time ./idx-index 03.index < 03.rip.sort
     notice: indexing stage 2
     ./idx-index 03.index < 03.rip.sort
       5.62s user 0.65s system 98% cpu 6.342 total

   The second part of this test was with a 'warm cache', which does not 
   accurately reflect the situation for a much larger index.  Nevertheless,
   the time for the second run should be no worse than the time for the first.

   That's 1,193 documents per second.  If the two-billion-document corpus 
   were built into 100 individual binary indexes on 100 machines equivalent 
   to my own, the final indexing step would take under five hours.
   Using more indexes and more machines will reduce the time, at the
   expense of trouble when running queries.

   (It probably makes sense to perform this step on a smaller number of
   machines, each of which is very fast I/O system.)

5. Query

   Query scalability is slightly harder to judge.  Individual queries
   vary, and cache effects are critical (do most queries share a common
   'working set' of 'root pages', or does each query thrash the cache?).
   Nevertheless, anecdotal evidence suggests that queries run in a few
   tens of milliseconds:

     <gleep> time ./idx-client all.index all.map \
                  -5 '2461 E Madison St 98112' books > /dev/null
     ./idx-client all.index all.map \
     -5 '2461 E Madison St 98112' books > /dev/null
       0.04s user 0.00s system 134% cpu 0.030 total

   If the two-billion-document corpus is divided into 900,000-document
   indexes, each of which gets its own machine comparable to mine, then
   the response time for queries against the large corpus should be similar.
   With fewer, larger indexes, the time will be worse.  With machines that
   have more memory, the time will be better; with some revisions to improve
   cache locality (using B*tree indexes rather than simple binary search
   tables), the time will be still better.

   Nevertheless, I believe query times are rapid enough for scaling to
   be quite feasible.

I argue, therefore, that this submission could be used almost as-is to 
index and query two billion documents on a farm of 1,000 relatively cheap 
machines in well under 24 hours.  Relatively simple improvements (like using 
a custom intermediate format rather than highly redundant text files) would 
likely improve performance even further.

None of this should be surprising.  The techniques I use are similar to
those used in any full-text indexing system; even the geographical index
is basically an inverted index with some special terms.  These systems are
known to scale quite well with some effort; if Google itself can run with
two billion documents, then my system should also be able to.


AUTHORS

I am the sole author of this submission:

Daniel Egnor                  egnor@ofb.net
424 West End Ave. #4G         http://ofb.net/~egnor/
New York, NY 10024            +1 212 579-8562