GOOGLE PROGRAMMING CONTEST SUBMISSION: GEO SEARCH Daniel Egnor (egnor@ofb.net) 30 April, 2002 OVERVIEW This software detects street addresses (such as "2400 Bayshore Parkway, Mountain View, CA 94043") in a corpus of text and converts them into geographical coordinates (such as "122.09720 W, 37.42532 N"). These coordinates are indexed in a two-dimensional index along with a conventional keyword index of the corpus. A query processor is then able to rapidly process queries which ask for documents which match certain keywords and/or contain addresses within a certain radius of a specified target address. (Think "bookstores near me".) Address detection and conversion is done using the TIGER/Line digital mapping data published by the US Census Bureau (free of charge, public domain). The TIGER/Line data is converted into an indexed form suitable for very rapid address lookup (and, more importantly, very fast rejection of anything that is not a valid street address). DEPENDENCIES The main dependency (other than GNU make, a C/C++ compiler, etc.) required to use this software is the TIGER/Line data, which may be downloaded free of encumbrance from . You will want every .zip file from every subdirectory of that URL; a recursive download tool will be helpful (I used "wget"). This is a large download. You will also need the FIPS 55 geographical name data file, available freely from . The CGI interface is a simple Python script, so you will need Python to use it (any moderately recent version should suffice). It generates references to MapBlast, a commercial Web-based mapping service. If MapBlast is unavailable, everything will work but you will not see a map of the results for a geographical query. Since MapBlast is a commercial service, feel free to ignore mapping functionality for evaluation. QUICK START 1. Download TIGER/Line and FIPS data. Start this first, because it will take a while. Download everything under . The important files are the .zip files in each state's directory. Place these files somewhere on your system so that you can find them again. Also download the FIPS 55 geographical name database. Download and save it somewhere. Make sure that you have enough space for all this data, plus a few extra gigabytes for intermediate files during indexing. 2. Build my code (run "make"). Invoke GNU make in my source directory (the one with this README in it). This assumes a UNIX-like system. (See PORTABILITY below for notes about porting to other systems.) Hopefully all will go well at this point, and it will build a happy little stable of executables. (Among other things, it will build a modified version of the Google "ripper", which will be called "goo-ripper".) 3. Index the TIGER/Line data. Once you've finished downloading the TIGER/Line data, use the "geo-map.sh" shell script in my source directory to index it. Specifically, in my source directory, run this: find /your/tiger/files -name '*.zip' | ./geo-map.sh fip55all.txt ./alias.txt ./all.map This should chug for a few hours, occasionally emitting complaints that not every file is in every .zip file, and perhaps one or two complaints about invalid input lines. (Blame the Census Bureau.) If all goes well, you should end up with an 'all.map' file that's a few hundred MB large. 4. Test the TIGER/Line index. In my source directory, run this: ./geo-client ./all.map Input an address, on a single line. For example, type "2400 Bayshore Parkway, Mountain View, CA 94043" and press Enter. The "geo-client" program should indicate success and offer location information, something like this: SUCCESS: 2400/Bayshore Pky/Mountain View/CA/94043 Side = L Start = 2100 @ -122.09452, 37.42271 At = 2400 @ -122.0972, 37.42532 End = 2642 @ -122.09937, 37.42743 Feel free to try more addresses, and experiment with address formats (I've tried to make the detector as lenient as possible). When you're done, press ^C or ^D. If this works, you're in good shape. If this doesn't work... *gulp*. "It works for me, really it does." Feel free to contact me. 5. Index some documents! The indexer consumes pre-parsed repository files (*.pprepos.bz2). Collect, let's say, the 900K-document sample corpus. (Significantly larger corpuses will require special processing; see the implementation notes for more information.) Have all the .bz2 files stored somewhere, and: find /your/pprepos/files -name '*.bz2' | ./idx-index.sh ./all.map ./all.index After some hours, this should generate a file 'all.index'. (For the sample corpus, it's a few hundred MB.) It may give a "read: EOF" message, which you may ignore. 6. Test some queries. In the source directory, run this: ./idx-client ./all.index ./all.map -5 '2400 Bayshore Pky 94043' robots With the 900K-document sample corpus, you should see four results, formatted something like this: ok 4 results found 5:37.42532:-122.0972:2400 Bayshore Pky 94043 http://www.frc.ri.cmu.edu/robotics-faq/8.html What companies sell or build robots? 0.77411696:37.418939:-122.09368:2133 Leghorn St, Mountain View, CA 94043 http://www.frc.ri.cmu.edu/robotics-faq/10.html What Robotics related products are there? 2.2514967:37.41528:-122.07504:1235 Pear Ave, Mountain View, CA 94043 http://soe.stanford.edu/programs/research/labscenters.html SUSOE - Laboratories, Research Centers, and Affiliate Programs 3.9878246:37.427179:-122.14233:430 Sherman Ave, Palo Alto, CA 94306 http://mason.gmu.edu/~plepage/Pubs98.html Pamela LePage -- Publications 4.825617:37.40464:-122.14528:3801 Miranda Ave, Palo Alto, CA 94304 Feel free to try other queries, with different addresses (instead of 2400 Bayshore), different distances (instead of 5km), different keywords (instead of 'robot'; each keyword should be its own argument), etc.. 7. Set up the CGI interface. Edit the "query.cgi" script and set the pathnames hardcoded at the top. Place this script somewhere such that a Web server will execute it. Visit the script; the user interface should be self-explanatory. CAVEATS The current implementation has a number of weaknesses that should be pointed out. None of these flaws are believed to be insuperable. 1. The sample corpus provided is not an interesting demonstration of this technology. Educational pages are not address-rich; those addresses that do appear are usually the main address of a university. I believe a more general corpus would have more addresses of businesses and service providers that would yield more interesting query results, especially in combination with good ranking (see below). 2. The keyword search facility is extremely primitive. There are no boolean queries or phrase queries or proximity search. There is no ranking; pages are simply sorted by increasing distance from the center address specified (if any). The usual problems with simple conjunctive keyword query are apparent; hits often match lengthy documents that happen to contain the words in the query and an address in the area of interest, but are not actually relevant to the query. Simple TF/IDF ranking would probably work wonders. Google obviously has very good keyword search relevance ranking algorithms, so I didn't invest any time in that area. The best way to combine keyword relevance with geographical distance into a unified result presentation order remains unknown, however. 3. The query client collects all the results for a query (no matter how many), sorts them (by distance), and outputs them all on a single page. Pathological queries like "the" (no noise word removal here!) will run for a minute and then generate an enormous result page. Incremental result generation is very possible, simply unimplemented. Even with distance ranking, the geographical index could be used to find closer results before farther ones. 4. The address detector is not perfect. With one test set of addresses (extracted by hand from a corpus subset), it was only able to recognize 80% of them as valid addresses. Accuracy is unknown; I have done no more than anecdotal testing to make sure it identifies the correct location (but in all the anecdotal tests it works). See the section on address recognition below for more on this. 5. The I/O framework is unable to handle files larger than 4GB (and probably unhappy well before that). It could be adapted to support 64 bit files, but only on operating systems with large file support. This isn't as bad as it seems, as a corpus can be easily partitioned into multiple index files; the query results against each index could be easily merged. (The indexes could even be on different hosts, which might be desirable anyway.) While this is simple, the infrastructure to scatter queries and gather results against a distributed set of indexes hasn't been implemented. 6. The intermediate format used during indexing is a human-readable text format which is processed by Unix 'sort'. This isn't the most efficient way to merge indexes, and the resulting binary search index isn't the most efficient disk-based data storage system. A production implementation would do things differently. (That said, indexing time isn't all that ridiculous.) 7. The address processing code and the data which feeds it is wholly U.S.-centric. Significant changes must be made to handle other countries (for example, most have non-numeric postal codes), and I have no idea where to get public domain street address location data for other countries. IMPLEMENTATION NOTES This software is built from 1830 lines of C code (the TIGER/Line and FIPS indexing and address processing system, written in C to facilitate eventual reuse elsewhere), 818 lines of C++ code (the ripper module, keyword/geographical index, and query system, not including the Google source code provided with the contest materials), a 136-line CGI script written in Python, and 61 lines of Bourne shell glue. (You can compile the C code with a C++ compiler if you want.) Both the TIGER/Line/FIPS index and the actual corpus index are built by generating an intermediate text file. Each line of the text file is a key in the index, and they are designed to lexicographically compare in key order. The Unix 'sort' utility then performs the heavy lifting of actually ordering the keys. Conveniently, segments of the source data can be processed independently, their output sorted independently, and the final results combined with "sort -m". (The Bourne shell scripts automate this process on a single machine.) The sorted rows are then converted directly into a sorted fixed-record-length table which may be queried by binary search. The details are more complex; TIGER/Line processing requires four separate processing stages, but the basic idea is the same. THE ADDRESS MAP The address map is a file built from TIGER/Line and FIPS data which is used to quickly convert addresses in text form into geographic coordinates. The address map is used during document indexing and also during query (to look up the address the user enters as the target location). Street addresses have four basic components (some optional): Number (1234), street (Main Street), city (Philadelphia), state (PA), and ZIP (12345). In the address map, addresses of even parity are treated separately from addresses with odd parity (since a range of odd addresses often appears in a different location from the interleaved range of even addresses), so we add a fifth component, Parity (even or odd). In the address map, all addresses with the same parity, street, city, state and ZIP are combined into a "zone". For efficiency, not all individual addresses will be represented in a zone; if a range of addresses lie along a line (usually a city block), only the endpoints (or "control points") will be stored; locations of addresses in the middle are interpolated. A zone is stored as a series of these control points, each of which is a latitude/longitude location, an address, and the side of the street the address is on (left or right, if you are facing the direction of increasing addresses). These control points are stored in increasing order, so that search for a particular address within a zone could use binary search (though the current implementation uses a linear search; zones aren't big). A delta compression scheme is used to compactly store series of control points which differ (in location and address number) only slightly from their neighbors. Given a zone, then, we can easily find an address. To find the right zone, each unique name (such as "Main Street" or "Philadelphia") is associated with a list of zones. Note that the "Main Street" name will be associated with the zones for Main Street in every city. The names are tagged with a type: E Street name for even parity addresses on the street O Street name for odd parity addresses on the street C City name S State name Zones are ordered in the address map by ZIP (and alphabetically by street within the ZIP), so zones corresponding to the same name are often adjacent. For each name, a list of the offsets to each such contiguous region are stored in order by offset (this is the "zone range" for the name). A lookup table holds the zone offsets for each ZIP code (since zones are ordered by ZIP, the zones within a ZIP are always in a single contiguous range). Because zone ranges are internally sorted, multiple zone ranges can easily be combined to find the intersection -- the zones which are shared by all the ranges. When an address is processed, the components of the address (except the number) are converted into a zone range, the intersection is taken, and the resulting zone range is searched for the target number. Note that addresses may be ambiguous, and the system could return all the matching possibilities (but currently it is not set up to do this). BUILDING THE ADDRESS MAP The source TIGER/Line data is a set of what are effectively relational database tables, with one set for each county. The first step of the processing pipeline, "geo-tiger-to-1", reads TIGER data for each county, joins these tables against each other in memory (counties aren't big enough for this to be a problem), and outputs a series of records which identify the ZIP code, street name, address parity, a control point's street address (TIGER/Line shares the notion of using control points and interpolation), and the geographical location of the control point. These records, after sorting (with 'sort'), are used directly to construct the zone data and ZIP code tables by the second step of the pipeline, "geo-1-to-2". Whenever this step encounters a new ZIP code, parity, or street name, it writes a record to output indicating the address of the beginning and end of the corresponding zone. These zones are combined here with FIPS data ("geo-fips-to-2"), which supplies names for cities and states. The FIPS data is keyed to ZIP codes. The names are also run through a canonicalizer ("geo-2-to-2a") which rewrites every name according to a set of rules. These rules reduce variant terms ("Avenue", "Ave", and "Av", or "North East", "Northeast", "N.E." and "NE", or "Tenth Street", "10th St" and "10 St") to a single canonical form. The rules are also stored in the index so the query processor can access them. These names -- which are now tagged with their type -- are now sorted and run through the penultimate step, "geo-2-to-3". This step constructs the zone range lists associated with each name, and writes out records associating each name with the address of its zone range list. Finall, the last step, "geo-3-to-index", writes a table of names (suitable for binary searching) which associates each name with its corresponding zone range list. All of these steps are invoked by the master "geo-index.sh" shell script. ADDRESS RECOGNITION AND PROCESSING Addresses come in a variety of forms. Even after variant terms are canonicalized using alias rules, addresses still vary. Some will include city, state, and ZIP; some will be missing one or more of those. Some will include extra text, such as apartment or "suite" numbers or corporate routing information, between the street and the city. In any case, it is often very difficult to find the dividers between the street name, city name, and state. Punctuation or typography may be used; if physical whitespace such as newlines are used, that may be represented by HTML which we can't really access. So, the address parser uses a recursive descent parser with backtracking. It is able to accept up several "garbage words" between the number and the street as well as between the street and the city; it will try to "greedily parse" as many words in the input as find a match the name table. If multiple possible matches exist, then they will be tried in order until one results in a valid parse. The result is usually quite good; you can use the "geo-client" tool, which will show you what it believes the "proper" form of the address is (after aliases are reduced and components separated). Bear in mind that the parser completely ignores punctuation. Note that while "2400 Bayshore Parkway, Mountain View, CA 94043" is a correct address, so is "2400 Bayshore Pky 94043". When parsing documents, every number which occurs in the source text is considered to potentially mark the beginning of an address, and the number and subsequent text is fed to the address parser. Most numbers of course are not addresses, and they are generally rejected very quickly (since they're not followed by a valid street name, much less city and/or state and/or zip). Nevertheless, a number of issues remain in address detection. A test shows that it only recognizes 80% of a human-culled set of 200 addresses from one of the raw corpuses. Commercial Web mapping engines do substantially better. In particular, they are able to deal with addresses that have missing or incorrect "street types" (Street vs. Avenue vs. Blvd...) and missing or incorrect directional qualifiers (prefix or suffix East, West, ...). My recognizer considers those to be part of the street name, which must match. They are broken out in the original TIGER/Line data, so we could be more lenient here. Street names are also present in a variety of forms; while TIGER does store alternate names, it doesn't always get them all. One address failed because it referred to "Sandpoint Way" rather than "Sand Point Way". Furthermore, some addresses are given by intersection ("the corner of Canal and Lexington"), which my recognizer can not handle at all; the document scanner won't even pick it up as a potential address. Sadly, of the four addresses in , only two are successfully recognized. Luckily, the Mountain View headquarters' address is one of them them; the New York address is also fine. However, the Chicago address is missing the "Avenue" suffix (it just says "401 N. Michigan"), and the Atlanta address is missing a directional suffix (it says "1200 Abernathy Road" instead of "1200 Abernathy Road NE"). Finally, the TIGER/Line data itself is incomplete and sometimes inaccurate; probably half of the unrecognized 20% of addresses were due to omissions in the TIGER/Line database. Who knows how many addresses are recognized but plotted incorrectly. (Anecdotally, they're generally at least close, which is what matters for this purpose.) It's a messy world out there. DOCUMENT INDEXING The keyword indexing system is an extremely simple inverted index of which I will not speak further. Geographical indexing is more interesting. Coordinates are converted to the Mercator projection. The Mercator grid is covered multiple times by a set of "regions". Regions come in a variety of sizes; at each size, the Mercator plane is divided into a regular grid of that size, and each grid square is a "region". There are 16 sizes, in powers of two between a hemisphere of the globe and 1/65536 thereof. Each point, therefore, "belongs" to 16 different regions. Each region has a grid coordinate. The Mercator projection extends to infinity at the poles; if at a given size the Y grid coordinate is greater than 65536, it is ignored. (This has the effect of keeping the minimum resolution roughly constant with latitude.) Luckily, there are no streets in TIGER/Line near the poles, otherwise I would have to deal with some irksome singularities. When an address is recognized in the corpus, all 16 associated region codes are associated with the document, as if they were a special kind of inverted index term. ("R13000123000456" might be "region size 13, grid coordinate (123,456)".) When a geographical query is performed (with a center point and a radius), a 3x3 set of regions of appropriate size is picked that will cover the circle completely but not include too much else. These regions are then combined into the query process as if they were terms being OR'd together (and then AND'd with the rest of the user's search terms). The index stores the address locations found for each document, so when the documents are retrieved the addresses can be checked to see if they are actually within the specified radius of the specified center. The results can also be sorted by distance. Note that a single document may of course have many addresses. As with the TIGER/Line processing, document indexing uses an intermediate text file format. However, there is only a single such format, which associates document IDs with URLs, titles and addresses, and associates search terms (keywords and region codes) with documents. This file is read twice due to certain index layout constraints, but the index itself is quite simple by comparison to the address map. It contains a list of documents, a list of locations for each document, and two binary search tables (one for keywords, one for region codes) that point to zones which contain lists of document identifiers. The query client "idx-client" outputs a machine-readable (but also human-readable) results form; the CGI script "query.cgi" runs this program and parses its output. A Web server running a Python CGI script running a search program isn't the most efficient way to do things; in production, the whole thing would be tied in to some big crazy application server. PORTABILITY With one exception, the included code is designed to be completely portable, using only standard C and C++ libraries. The exception is the raw I/O layer, which comes in two flavors (both of which conform to the same header file); one flavor uses UNIX mmap() for performance, the other uses only C stdio for maximum portability. The system has been tested with both, but is configured to use the mmap() layer by default. The code has, however, only been tested on a Debian Linux 2.4 x86 system, using gcc/g++ 2.95. Your mileage may vary on other platforms. The Makefile, shell scripts, and Python CGI script would have to be replaced if porting to a system without these facilities. These are all very simple and would be easy to reimplement. SCALABILITY Performance measurements are taken on my computer ("gleep"), a modest home Linux PC with a 1GHz Athlon CPU, ordinary IDE disk subsystem, and 256MB RAM. All tests are run with "cold cache" (input files had not been read recently). On this machine, conversion of the entire US TIGER/Line data set and FIPS names into a map file takes a few hours. This time is invariant; the data is only published once every 10 years, and is constant with respect to the corpus size. We can therefore ignore this phase, considering only the actual document indexing. As described above, the indexing process happens in two phases. The first phase independently converts portions of the corpus into intermediate indexes which are each sorted with UNIX 'sort'. The second phase merges these files with 'sort -m' and feeds the output into a program which generates a binary index. Finally, another level of segmentation may be achieved by building many binary indexes and querying them in parallel at run time: query (5) -------------------+------------------- +--------------+ +--------------+ +--------------+ run time - - - | binary index | - - | binary index | - - | binary index | - - - - - - +--------------+ +--------------+ +--------------+ build time / | \ idx-index idx-index idx-index (4) | | | sort -m sort -m sort -m (3) ------+------ ------+------ ------+------ sort sort sort sort sort sort sort sort sort (2) / | \ / | \ / | \ ripper ripper ripper ripper ripper ripper ripper ripper ripper (1) | | | | | | | | | pprepos pprepos pprepos pprepos pprepos pprepos pprepos pprepos pprepos / \ / \ / \ / \ / \ / \ / \ / \ / \ (0) doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc Note: the infrastructure for query distriction is not included (though it would be simple). All other forms of distribution pictured here are represented in the submission. 0. Parsing I know nothing about the conversion from raw documents to pre-parsed form, but I can only assume that is a scalable process. 1. Ripper The "ripper" step's execution time is lienarly proportional to the size of its input (it remembers nearly nothing from one document to the next); the corpus can be subdivided as desired to run any number of "ripper" processes concurrently. To measure its performance, I processed an arbitrarily chosen segment of the 900K-document corpus: time bzip2 -dc ~/google/data/pprepos.03.bz2 | ./goo-ripper --geocode all.map 0 - > 03.rip Welcome to the Google Programming Contest ripper. Please see the file LICENSE for terms of use of the data and code. bzip2 -dc ~/google/data/pprepos.03.bz2 34.05s user 0.54s system 13% cpu 4:20.35 total ./goo-ripper --geocode all.map 0 - > 03.rip 115.79s user 2.59s system 45% cpu 4:20.36 total This segment contains 16,419 documents. Even with the overhead of uncompression, the ripper can process around 60 documents per second. With 1,000 machines equivalent to my own, a two-billion-document corpus could be run through the ripper in under 10 hours. 2. Sort The task of sorting each ripper's output can be parallelized as appropriate by adjusting the size of the portions individually sorted. Using the same example segment gives the following time on gleep: time sort -f < 03.rip > 03.rip.sort sort -f < 03.rip > 03.rip.sort 22.24s user 8.95s system 38% cpu 1:21.21 total With this particular segment size (16,419 documents), sorting is faster than ripping; with 1,000 machines like my own, the two-billion-document corpus would take under 3 hours. 3. Merge This step's running time is harder to estimate. It depends on how many indexes are constructed, and the complexity performance and parallelization opportunities inherent in merging large collections of large sorted documents. The entire sample 900K corpus merges relatively quickly: time sort -mf rip/*.rip > all.rip sort -mf rip/*.rip > all.rip 316.15s user 162.13s system 26% cpu 30:22.19 total If each binary index built were the same size as this corpus, then the two-billion-document corpus would have 2,222 binary indexes. With 1,000 machines equivalent to my own, the merging step would take an hour and a half. However, it is almost certainly not optimal to use over 2,000 individual indexes. I will leave it to you to estimate the scaling factors when fewer, larger indexes are built. (If nothing else, parallelization is harder.) Nevertheless, it is clearly a feasible task. 4. Index The index builder must be run twice on the merged input, but each run's time is proportional only to the length of the input. Indexing our favorite 16K-document segment is very fast: time ./idx-index 03.index < 03.rip.sort read: EOF notice: indexing stage 1 ./idx-index 03.index < 03.rip.sort 4.54s user 0.56s system 68% cpu 7.420 total time ./idx-index 03.index < 03.rip.sort notice: indexing stage 2 ./idx-index 03.index < 03.rip.sort 5.62s user 0.65s system 98% cpu 6.342 total The second part of this test was with a 'warm cache', which does not accurately reflect the situation for a much larger index. Nevertheless, the time for the second run should be no worse than the time for the first. That's 1,193 documents per second. If the two-billion-document corpus were built into 100 individual binary indexes on 100 machines equivalent to my own, the final indexing step would take under five hours. Using more indexes and more machines will reduce the time, at the expense of trouble when running queries. (It probably makes sense to perform this step on a smaller number of machines, each of which is very fast I/O system.) 5. Query Query scalability is slightly harder to judge. Individual queries vary, and cache effects are critical (do most queries share a common 'working set' of 'root pages', or does each query thrash the cache?). Nevertheless, anecdotal evidence suggests that queries run in a few tens of milliseconds: time ./idx-client all.index all.map \ -5 '2461 E Madison St 98112' books > /dev/null ./idx-client all.index all.map \ -5 '2461 E Madison St 98112' books > /dev/null 0.04s user 0.00s system 134% cpu 0.030 total If the two-billion-document corpus is divided into 900,000-document indexes, each of which gets its own machine comparable to mine, then the response time for queries against the large corpus should be similar. With fewer, larger indexes, the time will be worse. With machines that have more memory, the time will be better; with some revisions to improve cache locality (using B*tree indexes rather than simple binary search tables), the time will be still better. Nevertheless, I believe query times are rapid enough for scaling to be quite feasible. I argue, therefore, that this submission could be used almost as-is to index and query two billion documents on a farm of 1,000 relatively cheap machines in well under 24 hours. Relatively simple improvements (like using a custom intermediate format rather than highly redundant text files) would likely improve performance even further. None of this should be surprising. The techniques I use are similar to those used in any full-text indexing system; even the geographical index is basically an inverted index with some special terms. These systems are known to scale quite well with some effort; if Google itself can run with two billion documents, then my system should also be able to. AUTHORS I am the sole author of this submission: Daniel Egnor egnor@ofb.net 424 West End Ave. #4G http://ofb.net/~egnor/ New York, NY 10024 +1 212 579-8562