Option Reference#

Help (CLI Only) #

flatterer --help

output looks like

Usage: flatterer [OPTIONS] [INPUTS]... OUTPUT_DIRECTORY

Options:
  --web                       Load web based version
  --csv / --nocsv             Output CSV files, default true
  --xlsx / --noxlsx           Output XLSX file, default false
  --sqlite / --nosqlite       Output sqlite.db file, default false
  --parquet / --noparquet     Output directory of parquet files, default false
  --postgres TEXT             Connection string to postgres. If supplied will
                              load data into postgres
  --sqlite-path TEXT          Output sqlite file to this file
  -d, --pushdown TEXT         Object keys and values, with this key name, will
                              be copied down to child tables
  -n, --no-link               Do not create `_link` fields
  -m, --main-table-name TEXT  Name of main table, defaults to name of the file
                              without the extension
  -p, --path TEXT             Key name of where json array starts, default top
                              level array
  -j, --ndjson                Is file a new line delemited JSON file, default
                              false
  --json-stream               File contains stream of json object, default
                              false
  --force                     Delete output directory if it exists, then run
                              command, default False
  -f, --fields TEXT           fields.csv file to use
  -o, --only-fields           Only output fields in fields.csv file
  -b, --tables TEXT           tables.csv file to use
  -l, --only-tables           Only output tables in tables.csv file
  -i, --inline-one-to-one     If array only has single item for all objects
                              treat as one-to-one
  -y, --arrays-new-table      Always treat arrays as a new tables, even when
                              they contain items that are not objects
  -s, --schema TEXT           JSONSchema file or URL to determine field order
  -t, --table-prefix TEXT     Prefix to add to all table names
  -a, --path-separator TEXT   Seperator to denote new path within the input
                              JSON. Defaults to `_`
  -h, --schema-titles TEXT    Use titles from JSONSchema in the given way.
                              Options are `full`, `slug`, `underscore_slug`.
                              Default to not using titles
  -w, --preview INTEGER       Only output this `preview` amount of lines in
                              final results
  --threads INTEGER           Number of threads, default 1, 0 means use number
                              of CPUs
  --json-path TEXT            JSON path within each object to use to filter
                              which objects to select
  --postgres-schema TEXT      When loading to postgres, put all tables into
                              this schema.
  --evolve                    When loading to postgres or sqlite, evolve
                              tables to fit data
  --drop                      When loading to postgres or sqlite, drop table
                              if already exists.
  --id-prefix TEXT            Prefix for all `_link` id fields
  --stats                     Produce stats about the data in the
                              datapackage.json file
  --help                      Show this message and exit.

Output Formats#

CSV: Defaults to outputing CSV in output directory <OUTPUT_DIRECTORY>/csv/.

XLSX: Output xlsx file to <OUTPUT_DIRECTORY>/output.xlsx.

SQLITE: Output sqlite file to <OUTPUT_DIRECTORY>/sqlite.db.

PARQUET: Output parquet files in <OUTPUT_DIRECTORY>/parquet/.

POSTGRES: Output in database.

CLI Usage#

Stop CSV output:

flatterer --nocsv INPUT_FILE OUTPUT_DIRECTORY

xlsx output:

flatterer --xlsx INPUT_FILE OUTPUT_DIRECTORY

sqlite output:

flatterer --sqlite INPUT_FILE OUTPUT_DIRECTORY

parquet output:

flatterer --parquet INPUT_FILE OUTPUT_DIRECTORY

postgres output:

flatterer --postgres='postgres://user:pass@host/dbname' INPUT_FILE OUTPUT_DIRECTORY

The connection string should be in one of these formats. In addition, if you want the connection string from an environment variable then use the string env (for default DATABASE_URL enviroment variable) env=MY_ENV_VAR (for MY_ENV_VAR environment variable).

postgres output from envirment variable:

flatterer --postgres='env=MY_ENV_VAR' INPUT_FILE OUTPUT_DIRECTORY

This will get the connection string from the MY_ENV_VAR environment variable.

Python Usage#

Export sqlite, xlsx, and parquet but not CSV.

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', csv=False, sqlite=True, xlsx=True, parquet=True, postgres='postgres://user:pass@host/dbname')

Main Table Name#

Name of the table that represents data at the root of the JSON object.

For CSV will create <OUTPUT_DIRECTORY>/csv/<main_table_name>.csv and for XLSX will be the first tab name.

For CLI defaults to name of input file without the file ending and for python defaults to main.

CLI Usage#

flatterer -m games INPUT_FILE OUTPUT_DIRECTORY

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', main_table_name='games')

Pushdown Fields#

This allows you to copy values from top level object down to the child tables. This is useful if you want to define your own join keys or if it is useful having certain values in all related tables, saving you doing extra joins for common queries.

You need to specify a list of fields names (keys in the JSON) that you want to appear on all one-to-many tables (child tables). The field will prefixed with the table name where the field existed and the value will be copied from that table.

For example if main_table_name is game and this is the input JSON:

[
  {
    "id": 4,
    "platforms": [
      {
        "name":"PC",
        "id": 1,
        "requirements": [
          {"ram": "4GB"}
        ] 
      }
    ]
  }
]

CLI Usage#

If id and name are pushdown fields:

flatterer INPUT_FILE OUTPUT_DIRECTORY -d id -d name

platforms table will contain:

_link

_link_game

name

id

game_id

0.platforms.0

0

PC

1

4

As you can see a new column game_id is created containing the id from the games object.

platforms_requirements table will contain:

_link

_link_platforms

_link_game

platforms_name

platforms_id

game_id

ram

0.platforms.0.requirements.0

0.platforms.0

0

PC

1

4

4GB

This table also contains game_id but also platforms_id as id is pushed down from both parent tables. Also as name is a pushdown field, platforms_name column is also created.

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', pushdown=['id','name'])

Path to JSON Array#

Name of the object key where array of objects exists. Defaults to analysing the top level array.

Will not work with JSON Stream or option.

By default will work with:

[
  {
    "id": 1,
    "title": "A Game",
    "releaseDate": "2015-01-01"
  },
  {
    "id": 2,
    "title": "B Game",
    "releaseDate": "2016-01-01"
  }
]

If set to games will work with JSON like:

{"games": [
    {
      "id": 1,
      "title": "A Game",
      "releaseDate": "2015-01-01"
    },
    {
      "id": 2,
      "title": "B Game",
      "releaseDate": "2016-01-01"
]}

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY -p games

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', path='games')

New Line Delemited JSON (NDJSON)#

Input file is new line delimeted JSON. This is the fastest input type as each line can be parsed in its own thread using --threads.

{"id": 1, "title": "A Game",  "releaseDate": "2015-01-01"}
{"id": 2,  "title": "B Game",  "releaseDate": "2016-01-01"}

Will not work with Path to JSON Array option.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --ndjson

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', ndjson=True)

JSON Stream#

Input file is stream of json objects, sometimes called Concatonated JSON. Each object does not need to be on its own line.

{
  "id": 1,
  "title": "A Game",
  "releaseDate": "2015-01-01"
}
{
  "id": 2,
  "title": "B Game",
  "releaseDate": "2016-01-01"
}

Will not work with Path to JSON Array option.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --json-stream

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', json_stream=True)

JSON Path Filter#

This is used for filtering out objects to be included in the output, not for selecting values within an object.

Use a [JSON path][https://goessner.net/articles/JsonPath/] expression to select if a particular object will be in output.

Flatterer will evaluate the JSON path expression against every object in the input, and if there is any non-null value, it will include that object in the result.

e.g

[{
  "id": 1,
  "title": "A Film",
  "type": "film"
},
{
  "id": 2,
  "title": "A Game",
  "type": "game"
}]

Using the above input JSON the following will only select object with type = game (only select the second object in the example).

CLI Usage#

# careful as the $ needs escaping
flatterer INPUT_FILE OUTPUT_DIRECTORY --json-path "\$[?(@.type == 'game')]]" 

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', json_path="$[?(@.type == 'game')]")

More complicated expressions can be used including logical conditions so $[?(@.type == 'game' || @.type == 'film')] will select objects with either type = 'game' OR type = 'film'.

Force#

Delete the output folder if it exists before processing.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --force

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', force=True)

Postgres Schema#

Put tables into a postgres schema. Will create schema if it does not already exist.

CLI Usage#

flatterer --postgres='postgres://user:pass@host/dbname' INPUT_FILE OUTPUT_DIRECTORY --postgres-schema=myschema

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', postgres='postgres://user:pass@host/dbname', postgres_schema='myschema')

Evolve Tables#

For postgres and sqlite. This will evolve the existing tables if the schema of the new flattened data is different. This is useful if you have new JSON data that comes in over time or you have lots of files you want to process one by one.

Evolving follows the following rules:

  • If the new flattened data contains a table that does not exist in the database it will created.

  • If the table already exists but the new data has extra fields, the table is altered to add the new fields.

  • If table exists but fields that are in the database are not in the new data, they will result in nulls in the database when the new data is inserted.

  • If table exists and contains the same field name as the new data but the data types of the fields conflict:

    • For postgres the field is altered to being a text field so that both new and old data can exist (all types can be coerced to text).

    • For sqlite, as you can not alter existing types, the original type is kept. This will mean the data insertion will still work as sqlite treats any field as if it is text.

  • If no id_prefix is supplied random string will be added to _link fields so that these ids will be unique across loads.

It is recommended to add an id_prefix that is unique for each JSON file load. It could conatain for example the name of the file or a date.

Warning: this could mean you modify existing data.

Warning: Not completely parallel safe if multiple processes are inserting data into same database, it may cause an error if two processes are trying to add the same new field. This will not currupt any data and a retry should work.

CLI Usage#

flatterer --postgres='postgres://user:pass@host/dbname' INPUT_FILE OUTPUT_DIRECTORY --evolve

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', postgres='postgres://user:pass@host/dbname', evolve=True)

Drop Tables#

Warning: this could mean you loose data

For postgres and sqlite. Drop the existing table if it exists.

CLI Usage#

flatterer --postgres='postgres://user:pass@host/dbname' --sqlite-path=sqlite.db INPUT_FILE OUTPUT_DIRECTORY --drop

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', postgres='postgres://user:pass@host/dbname', drop=True)

Fields File#

Path to fields CSV file. The fields file can be used for:

  • Changing the field order in output files by rearranging the rows in the correct order.

  • Giving the fields a new name by using field_title

  • Removing unwanted fields when using the Only Fields option.

The CSV file needs the following headers:

  • table_name

  • field_name

It has the optional heading of field_title which will default to the field_name if missing.

For example:

table_name

field_name

field_type

count

field_title

platforms

_link

text

3

_link

platforms

_link_games

text

3

_link_games

platforms

name

text

3

name

games

_link

text

2

_link

games

id

number

2

id

It can have additional headers in the file but they will not be used. This is true of columns count and field_type in the above example.

Field order in the output will the same as the row order in the file.

table_name and field_name need to match up with the eventual structure of output. The easiest make sure of this is to edit the fields.csv that is in an output directory.
You can generate just the fields.csv file by not outputting the CSV files.

By default if there are fields in the data that are not in the fields.csv they will be added to the output after the defined fields. Use Only Fields to change this behaviour so that field not in the file will be excluded.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --fields fields.csv

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', fields='fields.csv')

Only Fields#

Only fields in the fields.csv file will be in the output.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --fields fields.csv --only-fields

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', fields='fields.csv', only_fields=True)

Tables File#

Path to tables CSV file. The file can be used for:

  • Changing the sheet order in xlsx output.

  • Giving the tables (and xlsx sheets) a new name by using table_title

  • Removing unwanted tables when using the Only Tables option.

The CSV file needs the following headers:

  • table_name

  • table_title

For example:

table_name

table_title

platforms

_link

games

_link

It can have additional headers in the file but they will not be used.

tables_name has to be the name that would be output by flatterer. To make sure that these names are correct it is best to use the tables.csv that is always in the output directory as an basis for modifying the output.

By default if there are tables in the data that are not in the tabless.csv they will be added to the output after the defined tables. Use Only Tables to change this behaviour so that only tables in this file will be output.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --tables tables.csv

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', tables='tables.csv')

Only Tables#

Only tables in the tables.csv file will be in the output.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --tables tables.csv --only-tables

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', tables='tables.csv', only_tables=True)

Inline One To One#

When a key has an array of objects as its value, but that array only ever has single items in it, then treat it these single item as if they are a sub-object (not sub array).
Without this set any array of objects will be treated like a one-to-many relationship and therefore have a new table associated with it. With this set and if all arrays under a particular key only have one item in it, the child table will not be created and the values will appear in the parent table.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --inline-one-to-one

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', inline_one_to_one=True)

Arrays as Table#

Always create a new table for all arrays, even if they do not contain objects. A new table is created with just the _link fields and a column named value which contains the value of the item in the array. If the array item is a string that value will go in the output, for all other types the JSON encoded version of that type is included.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --arrays-as-table

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', arrays_as_table=True)

Schema#

Supply a JSONSchema file to help determine field ordering of the output. If the schema supplied starts with http will try and download the schema from a remote server, otherwise it is assumed to be a file-system path.

CLI Usage#

For remote url:

flatterer INPUT_FILE OUTPUT_DIRECTORY --schema https://example.com/schema.json

For local file:

flatterer INPUT_FILE OUTPUT_DIRECTORY --schema schema.json

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', schema='https://example.com/schema.json')

Table Prefix#

Prefix to add to all table names. So if the output has a table called mytable and the “table prefix” is specified as myprefix_ then the table (therefore csv file or excel sheet name) will be called myprefix_mytable.
This can be useful if you are trying to namespace the output when inserting into an exiting database.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --table-prefix myprefix_

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', table_prefix='myprefix_')

Path Separator#

By default all table and field names have _ as the seperator which denotes a that the field after the _ is a sub-property. For example myObject_myField says that myField exists as a property of myObject.

However some data already has _ in the property names. For example if in the input data theres was an object called my_object and the property called my_field then by default the field name would be my_object_my_field. This is confusing as you might expect object to be a property of my and it might cause some name clashes.

To fix this you can change the path separator to whatever you like. You could choose ___ as a separator, so in the example above, the field would be called my_object__my_field.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --path-separator __

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', path_separoator='__')

Schema Titles#

When supplying a JSONSchema then use the title field in the schema for the field name (if it exists). This option takes a string which can be one of:

  • full Use make the title field from JSONSchema without modification.

  • slug use the title field but slugify it removing all charactors that are non alphanumeric characters, lower casing and replaceing spaces with -. For example My *&*Strange   Title will turn into my-strange-title

  • underscore_slug same as slug but uses _ instead of -. The previous example the output would be my_strange_title

If this option is left out or is the empty string then it will not take the heading from JSONSchema.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --schema-titles underscore_slug

Python Usage#

import flatterer

flatterer.flatten('inputfile.jl', 'ouput_dir', schema_titles='underscore_slug_')

Stats#

Adds additional statistics about the output files in the datapackage.json output.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --stats

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', stats=True)

Preview#

The number of rows written in final files. All statistics in fields.csv and datapackage.json will show counts related to all the data.

CLI Usage#

Only output first 10 lines of all the tables.

flatterer INPUT_FILE OUTPUT_DIRECTORY --preview 10

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', preview=10)

Threads#

The number of threads used to process the data. Default to 1. If set to 0 will use amount of CPUs.

Works best with new line delimited JSON --ndjson as JSON parsing can then be done by each thread. This can about a x3 times improvement with 6 threads if you have that many CPU cores. Without --ndjson makes only about x1.24 improvement on 2 threads and not worth going over 2 as it will not lead to performance improvement. For very small datasets (less than 100 object) using threads will most likely be slower.

Warning: When using this mode, not checks will be done to ensure an array of objects exists in the data. So in some circumstances, if the wrong options are chosen, no error will be raised.

Warning: May have issues with inline-one-to-one as each thread will determine what should be inlined.

CLI Usage#

flatterer INPUT_FILE OUTPUT_DIRECTORY --threads 0

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', threads=10)

Sql Scripts#

Python only. Export scripts for importing data into the database.

Python Usage#

flatterer.flatten(‘inputfile.json’, ‘ouput_dir’, sql_scripts=True)

Low Memory (api only)#

Reduces memory usage, sacrificing some speed. Use this if JSON contains very large JSON objects.

Python Usage#

import flatterer

flatterer.flatten('inputfile.json', 'ouput_dir', low_memory=True)