Python Library#
Usage#
import flatterer
output = flatterer.flatten('games.json', 'games_dir')
If the first argument to flatten
is a string (as in the above example), it represents the path to the input JSON file. However, it is also possible to supply an iterator of python dicts:
i.e
import flatterer
list_of_objects = [{"a", "a"}, {"a", "b"}]
output = flatterer.flatten(list_of_objects, 'games_dir')
For more complicated cases, where you want a stream of data, it is possible make a generator to feed data into flatterer. This example uses ijson
to stream all objects from a JSON array in a file:
import flatterer
def array_item_generator():
with open('fixtures/basic.json', 'rb') as f:
for item in ijson.items(f, 'item'):
// you could manipulate items to modify data as it is going in.
yield item
output = flatterer.flatten(array_item_generator(), 'games_dir')
This can be useful if you need to modify the data before it is processed, or if you are streaming data from a non-file source such as a database.
It is also possible to supply an iterator of bytes or strings that will be interpreted as JSON.
Also you can add the files=True
argument to supply a list of file names.
import flatterer
output = flatterer.flatten(['games.json','games2.json'], 'games_dir', files=True)
All other options are the same as the command line tool described in Option Reference.
Output#
The output from running flatten
is a dict which contains metadata about the conversion:
>>> output = flatterer.flatten('games.json', 'games_dir', sqlite=True, xlsx=True)
>>> print(output)
{
"fields": "DataFrame - table_name, field_name, field_type, fiel... (5 fields)"
"tables": "DataFrame - table_name, table_title"
"data": {
"main": "File Path - games_dir/csv/main.csv"
"developer": "File Path - games_dir/csv/developer.csv"
"platforms": "File Path - games_dir/csv/platforms.csv"
}
"sqlite": "File Path - games_dir/sqlite.db"
"xlsx": "File Path - games_dir/output.xlsx"
}
So output[‘fields’] contains a pandas DataFrame with information about the fields generated by flatterer:
>>> print(output['fields'])
table_name field_name field_type field_title count
0 main _link text _link 2
1 main id number id 2
2 main title text title 2
3 main releaseDate date releaseDate 2
4 main rating_code text rating_code 2
5 main rating_name text rating_name 2
6 developer _link text _link 2
7 developer _link_main text _link_main 2
8 developer name text name 2
9 platforms _link text _link 3
10 platforms _link_main text _link_main 3
11 platforms name text name 3
Similar for output['tables']
showing the tables in the data.
>>> print(output['tables'])
table_name table_title
0 main main
1 developer developer
2 platforms platforms
output['data']
contains the locations of the CSV files ifcsv=False
is not set.output['sqlite']
contains the location of the sqlite database ifsqlite=True
is set.output['xlsx']
contains the locatin of the xlsx file ifxlsx=True
is set.
Creating pandas DataFrames#
You can do this by setting dataframe=True
This creates DataFrames for all tables generated. This option is for the python library only.
Warning: This will cause issues with large datasets as the DataFrames will be put in memory
>>> output = flatterer.flatten('games.json', dataframe=True)
>>> print(output)
{
"fields": "DataFrame - table_name, field_name, field_type, fiel... (5 fields)"
"tables": "DataFrame - table_name, table_title"
"data": {
"main": "DataFrame - _link, id, title, releaseDate, rating_co... (6 fields)"
"developer": "DataFrame - _link, _link_main, name"
"platforms": "DataFrame - _link, _link_main, name"
}
}
As you can see with this option you do not need to supply an output directory and will work in your systems temporary space.
The data
key in the output now contains dataframes, e.g
>>> print(output['data']['main'])
_link id title releaseDate rating_code rating_name
0 0 1 A Game 2015-01-01 E Everyone
1 1 2 B Game 2016-01-01 E Everyone