Distribution of JSON values with Python

Back in the early 1990's, the company I worked for had a program we used for data exploration of flat files to quickly create a distribution of values which was very useful to get a quick look at the cardinality of values within a field before planning how to perform any ELT steps. While the popular data format has changed over the years, I found myself confronted with a similar situation with JSON data. What I needed was a small program to read through many source data files containing one JSON record per line. For reasons that are fuzzy to me now, the original program was called something like "mval" and I think the original developers name was Mike, so I decided to call this program "gval". I've also needed to do Python programming in 2014 on Linux; therefore this program is in Python.

The program is simple, and makes use of standard Python libraries. This resulted in code that was nearly as long to parse arguments as it was to perform the actual processing. Yes, the argument process in overkill, but I didn't want to have to commit parameters to my long term memory if I could simply type python gval.py -h.

Python has the ability to crawl a directory, and all subdirectories which allows the program to process all the files with a .json extension. Each json file is expected to have one or more json records delimited by a carriage return. In addition to crawling the directory and any sub directory, the rest is very straightforward Python using a standard dictionary to collect the counts of values.

In the case where there are a large number of unique values, the program will run out of memory and die. It should be noted that the point is for a distribution of a reasonable number of values for a distribution of one unique value per record is useless when the desire is to get a reasonable number of values and the count of each.

If this data was in an RDBMS, the analogous SQL to what this Python program does would be a statement like:

select value, count(*) as cnt
from foo
group by value
order by cnt desc

In overall terseness, the SQL statement wins for less lines of code but it should be noted that it would first be necessary to load the data into an RDBMS which defeats the purpose for quick exploration of data to develop a plan to process the data.

Another neat thing that Python provides is the lambda function for sorting. In this case, the results are sorted by the count (to match the order by cnt desc part of the SQL statement). Further processing of the results may be achieved by piping to another program to reformat (use sed) or insert into a datastore.

The parameters to the program are:

python gval.py -h
     -e, --element
          JSON element to distribute
     -i, --input
          Input directory
     -h, --help
          This message

A test set of data, gziped is available here. The source of this data is the data set used for the GunStockMarket.

Using the test data, the output is:

python gval.py -e firearm
Input directory: .
JSON element: firearm
Processing: GunStockMarketSample.json
SKS: 431
Mosin Nagant: 269
Springfield Armory M1A: 208
M1 Garand: 163
Mauser Rifle: 65
Lee Enfield: 62
M1917 Revolver: 50
Luger Pistol: 45
Mannlicher M95: 35
CZ-52: 34
PSL: 32
Nagant M1895 Revolver: 19
Hakim: 17
Swiss K-31: 16
TT-33: 15
CZ-82: 14
SVT-40: 14
VZ-52: 9
Yugo M76: 5
Rasheed: 3

And finally, here's the script:

import os
import sys
import getopt
import json

input_dir = "."
element = ""
options, remainder = getopt.getopt(sys.argv[1:], 'i:e:h' , ['i=', 'input=', 'element=', 'help', ])
for opt, arg in options:
    if opt in ('-i', '--input'):
        input_dir = arg
    elif opt in ('-e', '--element'):
        element = arg
    elif opt in ('-h', '--help'):
        print '     -e, --element'
        print '          JSON element to distribute'
        print '     -i, --input'
        print '          Input directory'
        print '     -h, --help'
        print '          This message'
if element is "":
print 'Input directory: ' + input_dir
print 'JSON element: ' + element
distribution = dict()
for subdir, dirs, files in os.walk(input_dir):
    for file in files:
        if file.endswith(".json"):
            with open(subdir + '/' + file, 'r') as f:
                print "Processing: " + file
                for line in f.readlines():
                    json_data = json.loads(line)
                    value = json_data[element] if element in json_data else ""
                    if value in distribution:
                        distribution[value] = distribution[value] + 1
                        distribution[value] = 0
for k, v in sorted(distribution.iteritems(), key=lambda (k,v): (v * -1,k)):
    print u'{0}: {1}'.format(k, v)