Python Library#

Usage#

import flatterer
output = flatterer.flatten('games.json', 'games_dir')

If the first argument to flatten is a string (as in the above example), it represents the path to the input JSON file. However, it is also possible to supply an iterator of python dicts:

i.e

import flatterer
list_of_objects = [{"a", "a"}, {"a", "b"}]
output = flatterer.flatten(list_of_objects, 'games_dir')

For more complicated cases, where you want a stream of data, it is possible make a generator to feed data into flatterer. This example uses ijson to stream all objects from a JSON array in a file:

import flatterer

def array_item_generator():
    with open('fixtures/basic.json', 'rb') as f:
        for item in ijson.items(f, 'item'):
            // you could manipulate items to modify data as it is going in.
            yield item

output = flatterer.flatten(array_item_generator(), 'games_dir')

This can be useful if you need to modify the data before it is processed, or if you are streaming data from a non-file source such as a database.

It is also possible to supply an iterator of bytes or strings that will be interpreted as JSON.

Also you can add the files=True argument to supply a list of file names.

import flatterer
output = flatterer.flatten(['games.json','games2.json'], 'games_dir', files=True)

All other options are the same as the command line tool described in Option Reference.

Output#

The output from running flatten is a dict which contains metadata about the conversion:

>>> output = flatterer.flatten('games.json', 'games_dir', sqlite=True, xlsx=True)
>>> print(output)
{
    "fields": "DataFrame - table_name, field_name, field_type, fiel... (5 fields)"
    "tables": "DataFrame - table_name, table_title"
    "data": {
        "main": "File Path - games_dir/csv/main.csv"
        "developer": "File Path - games_dir/csv/developer.csv"
        "platforms": "File Path - games_dir/csv/platforms.csv"
    }
    "sqlite": "File Path - games_dir/sqlite.db"
    "xlsx": "File Path - games_dir/output.xlsx"
}

So output[‘fields’] contains a pandas DataFrame with information about the fields generated by flatterer:

>>> print(output['fields'])
   table_name   field_name field_type  field_title  count
0        main        _link       text        _link      2
1        main           id     number           id      2
2        main        title       text        title      2
3        main  releaseDate       date  releaseDate      2
4        main  rating_code       text  rating_code      2
5        main  rating_name       text  rating_name      2
6   developer        _link       text        _link      2
7   developer   _link_main       text   _link_main      2
8   developer         name       text         name      2
9   platforms        _link       text        _link      3
10  platforms   _link_main       text   _link_main      3
11  platforms         name       text         name      3

Similar for output['tables'] showing the tables in the data.

>>> print(output['tables'])
  table_name table_title
0       main        main
1  developer   developer
2  platforms   platforms
  • output['data'] contains the locations of the CSV files if csv=False is not set.

  • output['sqlite'] contains the location of the sqlite database if sqlite=True is set.

  • output['xlsx'] contains the locatin of the xlsx file if xlsx=True is set.

Creating pandas DataFrames#

You can do this by setting dataframe=True

This creates DataFrames for all tables generated. This option is for the python library only.

Warning: This will cause issues with large datasets as the DataFrames will be put in memory

>>> output = flatterer.flatten('games.json', dataframe=True)
>>> print(output)
{
    "fields": "DataFrame - table_name, field_name, field_type, fiel... (5 fields)"
    "tables": "DataFrame - table_name, table_title"
    "data": {
        "main": "DataFrame - _link, id, title, releaseDate, rating_co... (6 fields)"
        "developer": "DataFrame - _link, _link_main, name"
        "platforms": "DataFrame - _link, _link_main, name"
    }
}

As you can see with this option you do not need to supply an output directory and will work in your systems temporary space.

The data key in the output now contains dataframes, e.g

>>> print(output['data']['main'])
   _link  id   title releaseDate rating_code rating_name
0      0   1  A Game  2015-01-01           E    Everyone
1      1   2  B Game  2016-01-01           E    Everyone