# Option Reference ## Help (CLI Only) ```bash flatterer --help ``` output looks like ``` Usage: flatterer [OPTIONS] [INPUTS]... OUTPUT_DIRECTORY Options: --web Load web based version --csv / --nocsv Output CSV files, default true --xlsx / --noxlsx Output XLSX file, default false --sqlite / --nosqlite Output sqlite.db file, default false --parquet / --noparquet Output directory of parquet files, default false --postgres TEXT Connection string to postgres. If supplied will load data into postgres --sqlite-path TEXT Output sqlite file to this file -d, --pushdown TEXT Object keys and values, with this key name, will be copied down to child tables -n, --no-link Do not create `_link` fields -m, --main-table-name TEXT Name of main table, defaults to name of the file without the extension -p, --path TEXT Key name of where json array starts, default top level array -j, --ndjson Is file a new line delemited JSON file, default false --json-stream File contains stream of json object, default false --force Delete output directory if it exists, then run command, default False -f, --fields TEXT fields.csv file to use -o, --only-fields Only output fields in fields.csv file -b, --tables TEXT tables.csv file to use -l, --only-tables Only output tables in tables.csv file -i, --inline-one-to-one If array only has single item for all objects treat as one-to-one -y, --arrays-new-table Always treat arrays as a new tables, even when they contain items that are not objects -s, --schema TEXT JSONSchema file or URL to determine field order -t, --table-prefix TEXT Prefix to add to all table names -a, --path-separator TEXT Seperator to denote new path within the input JSON. Defaults to `_` -h, --schema-titles TEXT Use titles from JSONSchema in the given way. Options are `full`, `slug`, `underscore_slug`. Default to not using titles -w, --preview INTEGER Only output this `preview` amount of lines in final results --threads INTEGER Number of threads, default 1, 0 means use number of CPUs --json-path TEXT JSON path within each object to use to filter which objects to select --postgres-schema TEXT When loading to postgres, put all tables into this schema. --evolve When loading to postgres or sqlite, evolve tables to fit data --drop When loading to postgres or sqlite, drop table if already exists. --id-prefix TEXT Prefix for all `_link` id fields --stats Produce stats about the data in the datapackage.json file --help Show this message and exit. ``` ## Output Formats **CSV:** Defaults to outputing CSV in output directory `/csv/`. **XLSX:** Output xlsx file to `/output.xlsx`. **SQLITE:** Output sqlite file to `/sqlite.db`. **PARQUET:** Output parquet files in `/parquet/`. **POSTGRES:** Output in database. ### CLI Usage **Stop CSV output:** ```bash flatterer --nocsv INPUT_FILE OUTPUT_DIRECTORY ``` **xlsx output:** ```bash flatterer --xlsx INPUT_FILE OUTPUT_DIRECTORY ``` **sqlite output:** ```bash flatterer --sqlite INPUT_FILE OUTPUT_DIRECTORY ``` **parquet output:** ```bash flatterer --parquet INPUT_FILE OUTPUT_DIRECTORY ``` **postgres output:** ```bash flatterer --postgres='postgres://user:pass@host/dbname' INPUT_FILE OUTPUT_DIRECTORY ``` The connection string should be in [one of these formats](https://docs.rs/postgres/latest/postgres/config/struct.Config.html#examples). In addition, if you want the connection string from an environment variable then use the string `env` (for default DATABASE_URL enviroment variable) `env=MY_ENV_VAR` (for MY_ENV_VAR environment variable). **postgres output from envirment variable:** ```bash flatterer --postgres='env=MY_ENV_VAR' INPUT_FILE OUTPUT_DIRECTORY ``` This will get the connection string from the `MY_ENV_VAR` environment variable. ### Python Usage Export sqlite, xlsx, and parquet but not CSV. ```bash import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', csv=False, sqlite=True, xlsx=True, parquet=True, postgres='postgres://user:pass@host/dbname') ``` ## Main Table Name Name of the table that represents data at the root of the JSON object. For CSV will create `/csv/.csv` and for XLSX will be the first tab name. For CLI defaults to name of input file without the file ending and for python defaults to `main`. ### CLI Usage ```bash flatterer -m games INPUT_FILE OUTPUT_DIRECTORY ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', main_table_name='games') ``` ## Pushdown Fields This allows you to copy values from top level object down to the child tables. This is useful if you want to define your own join keys or if it is useful having certain values in all related tables, saving you doing extra joins for common queries. You need to specify a list of fields names (keys in the JSON) that you want to appear on all one-to-many tables (child tables). The field will prefixed with the table name where the field existed and the value will be copied from that table. For example if `main_table_name` is `game` and this is the input JSON: ``` [ { "id": 4, "platforms": [ { "name":"PC", "id": 1, "requirements": [ {"ram": "4GB"} ] } ] } ] ``` ### CLI Usage If `id` and `name` are pushdown fields: ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY -d id -d name ``` `platforms` table will contain: |_link|_link_game|name|id|game_id| |-----|----------|----|--|-------| |0.platforms.0|0|PC|1|4| As you can see a new column `game_id` is created containing the `id` from the `games` object. `platforms_requirements` table will contain: |_link|_link_platforms|_link_game|platforms_name|platforms_id|game_id|ram| |-----|---------------|----------|--------------|----|--|-------| |0.platforms.0.requirements.0|0.platforms.0|0|PC|1|4|4GB This table also contains `game_id` but also `platforms_id` as `id` is pushed down from both parent tables. Also as `name` is a pushdown field, `platforms_name` column is also created. ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', pushdown=['id','name']) ``` ## No Link Fields Do not create any `_link` fields. This could be useful if `pushdown` pushes ids into child tables and you trust that this will be sufficient to link the tables back together. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY -n ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', pushdown=['id'], no_link=True) ``` ## Path to JSON Array Name of the object key where array of objects exists. Defaults to analysing the top level array. Will not work with [](#json-stream) or [](#ndjson) option. By default will work with: ```json [ { "id": 1, "title": "A Game", "releaseDate": "2015-01-01" }, { "id": 2, "title": "B Game", "releaseDate": "2016-01-01" } ] ``` If set to `games` will work with JSON like: ```json {"games": [ { "id": 1, "title": "A Game", "releaseDate": "2015-01-01" }, { "id": 2, "title": "B Game", "releaseDate": "2016-01-01" ]} ``` ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY -p games ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', path='games') ``` ## New Line Delemited JSON (NDJSON) Input file is new line delimeted JSON. This is the fastest input type as each line can be parsed in its own thread using `--threads`. ```json {"id": 1, "title": "A Game", "releaseDate": "2015-01-01"} {"id": 2, "title": "B Game", "releaseDate": "2016-01-01"} ``` Will not work with [](#path-to-json-array) option. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --ndjson ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', ndjson=True) ``` ## JSON Stream Input file is stream of json objects, sometimes called Concatonated JSON. Each object does not need to be on its own line. ```json { "id": 1, "title": "A Game", "releaseDate": "2015-01-01" } { "id": 2, "title": "B Game", "releaseDate": "2016-01-01" } ``` Will not work with [](#path-to-json-array) option. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --json-stream ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', json_stream=True) ``` ## JSON Path Filter This is used for filtering out objects to be included in the output, not for selecting values within an object. Use a [JSON path][https://goessner.net/articles/JsonPath/] expression to select if a particular object will be in output. Flatterer will evaluate the JSON path expression against every object in the input, and if there is any non-null value, it will include that object in the result. e.g ```json [{ "id": 1, "title": "A Film", "type": "film" }, { "id": 2, "title": "A Game", "type": "game" }] ``` Using the above input JSON the following will only select object with `type = game` (only select the second object in the example). ### CLI Usage ```bash # careful as the $ needs escaping flatterer INPUT_FILE OUTPUT_DIRECTORY --json-path "\$[?(@.type == 'game')]]" ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', json_path="$[?(@.type == 'game')]") ``` More complicated expressions can be used including logical conditions so `$[?(@.type == 'game' || @.type == 'film')]` will select objects with either `type = 'game' OR type = 'film'`. ## Force Delete the output folder if it exists before processing. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --force ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', force=True) ``` ## Postgres Schema Put tables into a postgres schema. Will create schema if it does not already exist. ### CLI Usage ```bash flatterer --postgres='postgres://user:pass@host/dbname' INPUT_FILE OUTPUT_DIRECTORY --postgres-schema=myschema ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', postgres='postgres://user:pass@host/dbname', postgres_schema='myschema') ``` ## Evolve Tables For postgres and sqlite. This will evolve the existing tables if the schema of the new flattened data is different. This is useful if you have new JSON data that comes in over time or you have lots of files you want to process one by one. Evolving follows the following rules: - If the new flattened data contains a table that does not exist in the database it will created. - If the table already exists but the new data has extra fields, the table is altered to add the new fields. - If table exists but fields that are in the database are not in the new data, they will result in nulls in the database when the new data is inserted. - If table exists and contains the same field name as the new data but the data types of the fields conflict: - For postgres the field is altered to being a `text` field so that both new and old data can exist (all types can be coerced to text). - For sqlite, as you can not alter existing types, the original type is kept. This will mean the data insertion will still work as sqlite treats any field as if it is text. - If no `id_prefix` is supplied random string will be added to `_link` fields so that these ids will be unique across loads. It is recommended to add an `id_prefix` that is unique for each JSON file load. It could conatain for example the name of the file or a date. **Warning: this could mean you modify existing data.** **Warning: Not completely parallel safe if multiple processes are inserting data into same database, it may cause an error if two processes are trying to add the same new field. This will not currupt any data and a retry should work.** ### CLI Usage ```bash flatterer --postgres='postgres://user:pass@host/dbname' INPUT_FILE OUTPUT_DIRECTORY --evolve ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', postgres='postgres://user:pass@host/dbname', evolve=True) ``` ## Drop Tables **Warning: this could mean you loose data** For postgres and sqlite. Drop the existing table if it exists. ### CLI Usage ```bash flatterer --postgres='postgres://user:pass@host/dbname' --sqlite-path=sqlite.db INPUT_FILE OUTPUT_DIRECTORY --drop ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', postgres='postgres://user:pass@host/dbname', drop=True) ``` ## Fields File Path to fields CSV file. The fields file can be used for: * Changing the field order in output files by rearranging the rows in the correct order. * Giving the fields a new name by using `field_title` * Removing unwanted fields when using the [](#only-fields) option. The CSV file needs the following headers: * table_name * field_name It has the optional heading of `field_title` which will default to the `field_name` if missing. For example: |table_name |field_name|field_type|count|field_title| |-----------|----------|----------|-----|----------| |platforms |_link |text |3 |_link | |platforms |_link_games|text |3 |_link_games| |platforms |name |text |3 |name | |games |_link |text |2 | _link | |games |id |number |2 | id | It can have additional headers in the file but they will not be used. This is true of columns `count` and `field_type` in the above example. Field order in the output will the same as the row order in the file. `table_name` and `field_name` need to match up with the eventual structure of output. The easiest make sure of this is to edit the `fields.csv` that is in an output directory. You can generate just the fields.csv file by [not outputting the CSV files](#csv). By default if there are fields in the data that are not in the `fields.csv` they will be added to the output after the defined fields. Use [](#only-fields) to change this behaviour so that field not in the file will be excluded. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --fields fields.csv ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', fields='fields.csv') ``` ## Only Fields Only fields in the fields.csv file will be in the output. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --fields fields.csv --only-fields ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', fields='fields.csv', only_fields=True) ``` ## Tables File Path to tables CSV file. The file can be used for: * Changing the sheet order in xlsx output. * Giving the tables (and xlsx sheets) a new name by using `table_title` * Removing unwanted tables when using the [](#only-tables) option. The CSV file needs the following headers: * table_name * table_title For example: |table_name |table_title| |-----------|----------| |platforms |_link | |games |_link | It can have additional headers in the file but they will not be used. `tables_name` has to be the name that would be output by `flatterer`. To make sure that these names are correct it is best to use the `tables.csv` that is always in the output directory as an basis for modifying the output. By default if there are tables in the data that are not in the `tabless.csv` they will be added to the output after the defined tables. Use [](#only-tables) to change this behaviour so that only tables in this file will be output. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --tables tables.csv ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', tables='tables.csv') ``` ## Only Tables Only tables in the tables.csv file will be in the output. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --tables tables.csv --only-tables ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', tables='tables.csv', only_tables=True) ``` ## Inline One To One When a key has an array of objects as its value, but that array only ever has single items in it, then treat it these single item as if they are a sub-object (not sub array). Without this set any array of objects will be treated like a one-to-many relationship and therefore have a new table associated with it. With this set and if all arrays under a particular key only have one item in it, the child table will not be created and the values will appear in the parent table. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --inline-one-to-one ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', inline_one_to_one=True) ``` ## Arrays as Table Always create a new table for all arrays, even if they do not contain objects. A new table is created with just the `_link` fields and a column named `value` which contains the value of the item in the array. If the array item is a string that value will go in the output, for all other types the JSON encoded version of that type is included. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --arrays-as-table ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', arrays_as_table=True) ``` ## Schema Supply a JSONSchema file to help determine field ordering of the output. If the schema supplied starts with `http` will try and download the schema from a remote server, otherwise it is assumed to be a file-system path. ### CLI Usage For remote url: ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --schema https://example.com/schema.json ``` For local file: ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --schema schema.json ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', schema='https://example.com/schema.json') ``` ## Table Prefix Prefix to add to all table names. So if the output has a table called `mytable` and the "table prefix" is specified as `myprefix_` then the table (therefore csv file or excel sheet name) will be called `myprefix_mytable`. This can be useful if you are trying to namespace the output when inserting into an exiting database. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --table-prefix myprefix_ ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', table_prefix='myprefix_') ``` ## Path Separator By default all table and field names have `_` as the seperator which denotes a that the field after the `_` is a sub-property. For example `myObject_myField` says that `myField` exists as a property of `myObject`. However some data already has `_` in the property names. For example if in the input data theres was an object called `my_object` and the property called `my_field` then by default the field name would be `my_object_my_field`. This is confusing as you might expect `object` to be a property of `my` and it might cause some name clashes. To fix this you can change the path separator to whatever you like. You could choose `___` as a separator, so in the example above, the field would be called `my_object__my_field`. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --path-separator __ ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', path_separoator='__') ``` ## Schema Titles When supplying a JSONSchema then use the `title` field in the schema for the field name (if it exists). This option takes a string which can be one of: * `full` Use make the `title` field from JSONSchema without modification. * `slug` use the `title` field but slugify it removing all charactors that are non alphanumeric characters, lower casing and replaceing spaces with `-`. For example `My *&*Strange Title` will turn into `my-strange-title` * `underscore_slug` same as slug but uses `_` instead of `-`. The previous example the output would be `my_strange_title` If this option is left out or is the empty string then it will not take the heading from JSONSchema. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --schema-titles underscore_slug ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.jl', 'ouput_dir', schema_titles='underscore_slug_') ``` ## Stats Adds additional statistics about the output files in the `datapackage.json` output. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --stats ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', stats=True) ``` ## Preview The number of rows written in final files. All statistics in `fields.csv` and `datapackage.json` will show counts related to *all* the data. ### CLI Usage Only output first 10 lines of all the tables. ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --preview 10 ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', preview=10) ``` ## Threads The number of threads used to process the data. Default to 1. If set to 0 will use amount of CPUs. Works best with new line delimited JSON `--ndjson` as JSON parsing can then be done by each thread. This can about a x3 times improvement with 6 threads if you have that many CPU cores. Without `--ndjson` makes only about x1.24 improvement on 2 threads and not worth going over 2 as it will not lead to performance improvement. For very small datasets (less than 100 object) using threads will most likely be slower. **Warning:** When using this mode, not checks will be done to ensure an array of objects exists in the data. So in some circumstances, if the wrong options are chosen, no error will be raised. **Warning:** May have issues with inline-one-to-one as each thread will determine what should be inlined. ### CLI Usage ```bash flatterer INPUT_FILE OUTPUT_DIRECTORY --threads 0 ``` ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', threads=10) ``` ## Sql Scripts Python only. Export scripts for importing data into the database. ### Python Usage flatterer.flatten('inputfile.json', 'ouput_dir', sql_scripts=True) ## Low Memory (api only) Reduces memory usage, sacrificing some speed. Use this if JSON contains very large JSON objects. ### Python Usage ```python import flatterer flatterer.flatten('inputfile.json', 'ouput_dir', low_memory=True) ```