This isn’t really my first data visualization project – in fact I’ve done this same sort of geo visualization with other data, just not with Google maps and not with such a large data set. I was a bit surprised that there weren’t some casual write-ups about doing large scale viz, so here I am.

The other day someone showed me the utvoters.com web site which lists all of the publicly available registered voter information for Utah. My first thought was that this is interesting information presented in a very terrible way, so I decided to take this as an opportunity to learn Google Maps API and brush off some other old skills.

First of all, I’ve got a LAMP server spun up over on digital ocean already, with various domains controlled from the free Hurricane DNS service which I can point to the server and configure a new host in a flash. I’ve got various bits of software installed that I find useful Java, PHP, Sublime Text Editor, MySQL Workbench, Pentaho Kettle, etc.

I’d also like to note that I really slapped this together just to play with Google Maps API – the data structures and code are haphazard. If this prototype is interesting enough I’ll come back through and refactor.

Google Maps API

From the wonderful Google Developer Maps page it’s easy to get started. Grab an API key and implement the simplest version of Google Maps in 10 minutes. I went ahead an updated the hello-world Google map to be centered on Utah at the appropriate zoom (by googling map Utah and looking at the link). Then I placed a marker at the center of the map.

http://votermap.pirho.com/hello-world.htm

Data

I started by downloading the Utah Registered Voters data from http://utvoters.com/voters.txt, this is a large file (~500MB).

I’m not fond of spending too much time on database structure when putting together a quick project, so I am leaning on Pentaho’s Kettle (http://community.pentaho.com/projects/data-integration/) to create the table(s) and insert the data.

I’m skipping some details, like install & configuring the database connection – but this is fairly simple stuff.

First, I created a new transform and:
Added an input step “CSV File input”, clicked into the details, clicked “Get Fields”, then “Preview”.
Added an output step “Insert / Update”, clicked SQL

These steps will scan the data and figure out a table structure that will work to store the data… this makes life simple. Double clicking on the SQL button within the “Insert / Update” gives the SQL for generating the table, which you can just click on the execute button to create.

Here is the autogenerated SQL:

CREATE TABLE voters2
(
  `Voter ID` BIGINT
, `Last Name` VARCHAR(23)
, `First Name` VARCHAR(27)
, `Middle Name` VARCHAR(31)
, `Name Suffix` VARCHAR(9)
, Status VARCHAR(12)
, Absentee VARCHAR(6)
, UOCAVA VARCHAR(3)
, `Registration Date` BIGINT
, `Original Registration Date` BIGINT
, Party VARCHAR(34)
, Phone VARCHAR(24)
, `Mailing Address` VARCHAR(107)
, `Mailing city, state  zip` VARCHAR(49)
, `County ID` VARCHAR(10)
, Precinct VARCHAR(9)
, `House Number` VARCHAR(8)
, `House Number Suffix` VARCHAR(4)
, `Direction Prefix` VARCHAR(2)
, Street VARCHAR(28)
, `Direction Suffix` VARCHAR(15)
, `Street Type` VARCHAR(6)
, `Unit Type` VARCHAR(4)
, `Unit Number` VARCHAR(10)
, City VARCHAR(20)
, Zip VARCHAR(15)
, DOB VARCHAR(11)
, Congressional BIGINT
, `State House` BIGINT
, `State Senate` BIGINT
, `State Schoolboard` BIGINT
, `Local Schoolboard` VARCHAR(30)
, `County Council` VARCHAR(25)
, `City Council` VARCHAR(33)
, `11/6/1990` DATETIME
, `11/5/1991` DATETIME
, `11/3/1992` DATETIME
, `11/2/1993` DATETIME
, `11/8/1994` DATETIME
, `5/23/1995` DATETIME
, `9/12/1995` DATETIME
, `10/3/1995` DATETIME
, `11/7/1995` DATETIME
, `6/25/1996` DATETIME
, `8/6/1996` DATETIME
, `11/5/1996` DATETIME
, `2/4/1997` DATETIME
, `5/6/1997` DATETIME
, `8/1/1997` DATETIME
, `10/7/1997` DATETIME
, `11/4/1997` DATETIME
, `6/23/1998` DATETIME
, `11/3/1998` DATETIME
, `5/4/1999` DATETIME
, `8/3/1999` DATETIME
, `10/5/1999` DATETIME
, `11/2/1999` DATETIME
, `5/2/2000` DATETIME
, `6/27/2000` DATETIME
, `11/7/2000` DATETIME
, `2/6/2001` DATETIME
, `10/2/2001` DATETIME
, `11/6/2001` DATETIME
, `6/25/2002` DATETIME
, `11/5/2002` DATETIME
, `2/4/2003` DATETIME
, `8/5/2003` DATETIME
, `10/7/2003` DATETIME
, `11/4/2003` DATETIME
, `5/4/2004` DATETIME
, `6/22/2004` DATETIME
, `8/3/2004` DATETIME
, `11/2/2004` DATETIME
, `10/4/2005` DATETIME
, `11/8/2005` DATETIME
, `6/27/2006` DATETIME
, `11/7/2006` DATETIME
, `6/26/2007` DATETIME
, `9/11/2007` DATETIME
, `11/6/2007` DATETIME
, `2/5/2008` DATETIME
, `6/24/2008` DATETIME
, `11/4/2008` DATETIME
, `9/15/2009` DATETIME
, `11/4/2009` DATETIME
, `6/22/2010` DATETIME
, `11/2/2010` DATETIME
, `9/13/2011` DATETIME
, `11/8/2011` DATETIME
, `6/26/2012` DATETIME
, `11/6/2012` DATETIME
, `8/13/2013` DATETIME
, `11/5/2013` DATETIME
)
;

If I were to run the transform from Kettle at this point it would end up taking about 30 hours – so instead I turn back to the MySQL command line and load the data using load data infile:

mysql> truncate voters;
Query OK, 0 rows affected (0.00 sec)

mysql> set autocommit = 0;
Query OK, 0 rows affected (0.00 sec)

mysql> load data infile '/var/www/votermap.pirho.com/voters.txt' into table voters
    -> FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
    -> LINES TERMINATED BY '\n'
    -> IGNORE 1 LINES;
Query OK, 1481476 rows affected, 65535 warnings (50.04 sec)
Records: 1481476  Deleted: 0  Skipped: 0  Warnings: 90370058

mysql> commit;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(*) from voters;
+----------+
| count(*) |
+----------+
|  1481476 |
+----------+
1 row in set (0.00 sec)

Note the 50 second import time.

There are a ton of warnings from dates in the data that I’m not converting, if I really want to clean this up I’ll come back and change the structure of the table and do a better transform on the data. For now, I’m just trying to get this data together to start the process of geocoding the addresses.

Geocoding

After poking around the Google Map API a bit, it appears that I will need to pre-process the lat and long of the addresses in my database using the Google geocoding service.

I need to modify the DB table to include lat, long & the Geocode response.

How accurately should I store latitude and longitude?

ALTER TABLE `votermap`.`voters` 
ADD COLUMN `lat` DECIMAL(10,8) NULL DEFAULT NULL AFTER `11/5/2013`,
ADD COLUMN `lng`  DECIMAL(11,8)NULL DEFAULT NULL AFTER `lat`,
ADD COLUMN `Geocode Response` VARCHAR(10000) NULL DEFAULT NULL AFTER `lng`;

 Then threw together some php to roll through the file Geocoding the voter addresses to lat & long.

Google Geocode PHP Function

Expanding on existing code, this is the function that will interface with Googles geocode service.

 $responseDecoded['status'],
        'response' => $rawResp
    );
   }
 
    //print_r($response);

    $geometry = $response['results'][0]['geometry'];
 
    $array = array(
        'status' => $response['status'],
        'latitude' => $geometry['location']['lat'],
        'longitude' => $geometry['location']['lng'],
        'location_type' => $geometry['location_type'],
        'response' => $rawResp
    );
 
    return $array;
 
}

 I’m struggling a little with how to get 1.5 million rows geocoded with a rate limit of 2500 from googles free service. This will take some thought after reviewing Googles geocode strategy page. It is interesting to note that you could leverage user requests to do a lot of additional work from their browsers and feed that information back to the server to cache.

For now, I’m simply going to run against the service until it starts giving me a status back of  ‘OVER_QUERY_LIMIT’ so that I can try using some data on a google map. I’ll have to return to this problem later to get everything geocoded.

PDO (PHP Database Code)

\n";
$db = new PDO('mysql:host=localhost;dbname=votermap;charset=utf8', 'user', 'password');

$overQueryLimit = false;
while(!$overQueryLimit ){
	foreach($db->query('SELECT `House Number`, `House Number Suffix`, `Direction Prefix`, `Street`, `Direction Suffix`, `Street Type`, `Unit Type`, `Unit Number`, `City`, `Zip` FROM voters where lat IS NULL LIMIT 5') as $row) {
	    //echo $row['House Number'].' '.$row['House Number Suffix']; //etc...

	    //Build the address to lookup from the database querry
	    $address = "";
	    if( !empty( $row['House Number'] )){
	    	$address .= " ".$row['House Number'];
	    }
	    if( !empty( $row['House Number Suffix'] )){
	    	$address .= " ".$row['House Number Suffix'];
	    }
	    if( !empty( $row['Direction Prefix'] )){
	    	$address .= " ".$row['Direction Prefix'];
	    }
	    if( !empty( $row['Street'] )){
	    	$address .= " ".$row['Street'];
	    }
	    if( !empty( $row['Direction Suffix'] )){
	    	$address .= " ".$row['Direction Suffix'];
	    }
	    if( !empty( $row['Street Type'] )){
	    	$address .= " ".$row['Street Type'];
	    }
	    if( !empty( $row['Unit Type'] )){
	    	$address .= ", ".$row['Unit Type'];
	    }
	    if( !empty( $row['Unit Number'] )){
	    	$address .= " ".$row['Unit Number'];
	    }
	    if( !empty( $row['City'] )){
	    	$address .= ", ".$row['City'];
	    }
	    $address .= ", Utah";
	    if( !empty( $row['Zip'] )){
	    	$address .= " ".$row['Zip'];
	    }
	    echo($address."
\n");

	    //Geocode the address
	    $result = geocode($address);
	    echo($result['status']."
\n");
	    echo($result['latitude']."
\n");
		echo($result['longitude']."
\n");
	    echo($result['response']."
\n");

	    if ($result['status'] == 'OK') {
			//$stmt = $db->prepare("UPDATE voters SET name=?, WHERE `Voter ID`=?");
			try{
				$stmt = $db->prepare("UPDATE voters SET lat=?, lng=?, `Geocode Response`=? WHERE `House Number`=? AND `House Number Suffix`=? AND `Direction Prefix`=? AND `Street`=? AND `Direction Suffix`=? AND `Street Type`=? AND `Unit Type`=? AND `Unit Number`=? AND `City`=? AND `Zip`=?");
				$stmt->execute(array(	$result['latitude'],
										$result['longitude'],
										$result['response'],
										$row['House Number'],
		    							$row['House Number Suffix'],
		    							$row['Direction Prefix'],
		    							$row['Street'],
										$row['Direction Suffix'],
										$row['Street Type'],
								    	$row['Unit Type'],
								    	$row['Unit Number'],
								    	$row['City'],
								    	$row['Zip']
									));
				$affected_rows = $stmt->rowCount();
				echo "Affected Rows : ".$affected_rows."
\n";
			}catch(PDOException $e) {
	  			echo $e->getMessage();
			}
		}else if($result['status'] == 'OVER_QUERY_LIMIT'){
			$overQueryLimit = true;
		}
	}
}
echo "
\nThe End";

 I accidentally hit the above php file in chrome instead of from the command line, which actually worked to monitor the output. Evidently it’s been a while since I’ve done this kind of development because I don’t recall browsers staying attached to a process for hours on end in the past.

Actual Data Viz

Now that I’ve got some coordinates in my database, let’s take the first step and modify the hello-world Google map example (my implementation) to include 1000 data-points. This will just loop over 1000 data-points from the database on the server writing out javascript on the page to add markers to the map.

      function initialize() {
        var mapOptions = {
          center: new google.maps.LatLng(39.499761,-111.547028),
          zoom: 7
        };
        var map = new google.maps.Map(document.getElementById("map-canvas"),
            mapOptions);

        var unaffiliated = 'images/google-maps-icon-unafiliated.png';
        var republican = 'images/google-maps-icon-republican.png';
        var democrat = 'images/google-maps-icon-democrat.png';

	query('SELECT `First Name`, `Last Name`, Party, lat, lng FROM voters where lat IS NOT NULL LIMIT 1000') as $row)
	{
	?>

        var voterMarker = new google.maps.Marker({
            position: new google.maps.LatLng(),
            map: map,
            icon: ,
            title: ""
        });

    

      }

      google.maps.event.addDomListener(window, 'load', initialize);

http://votermap.pirho.com/hello-world.php

Evidently the first 1000 lines of the file were people in Moab – not quite sure how the file is sorted. However, even this very crude visualization is interesting as you zoom in on Moab.

Future Goals

  • Geocode all addresses
    • cron job
    • other services
    • leverage client geocoding
  • Manage markers smarter for better performance
    • Group at certain zoom levels
    • Fetch markers by view boundries
    • KML?
  • Visualize by voting cycle
    • Opacity by recent voting
    • Filter
  • Show other political boundaries
    • Voter turn out
    • Party

 

Leave a reply

Your email address will not be published. Required fields are marked *