Contact Us
Python BlogDjango BlogBig DataSearch for Kubernetes AWS BlogCloud Services



Get your JSON Configuration In Line with GenSON

January 29, 2021

There exists a great JSON tool called JSONSchema. It allows you to define a structure for all of your JSON configuration files to adhere to. This allows you to do things like implement JSON validation into your CI/CD pipeline — rejecting pull requests that try to merge invalid JSON.

But what do you do if you have years and years of JSON files piled up that were not made to adhere to a schema? Is there a way to bring all of these configurations in line? Yes, there is. It requires a bit of elbow grease and the use of a nifty little Python tool called GenSON.

$ genson *.json | jsonpp
"$schema": "", "type": "object", "properties": { "Target": { "type": "string" }, "TargetSchema": { "type": "string" }, "TargetTable": { "type": "string" }, "TableDefinition": { "type": "object", "properties": { "Distribution": { "type": "object", "properties": { "Type": { "type": "string" }, "DistributionColumn": { "type": [ "null", "string" ] } }, "required": [ "DistributionColumn", "Type" ] }, ...

GenSON Example schema

Usually, you're using JSONSchema to validate new JSON files coming in. GenSON reverses this process by allowing you to plug in all of your JSON files. GenSON will then make a schema for you. It's not perfect; if you have thousands of lines of JSON scattered in dozens of files, a schema that validates all of them is going to incorporate all of the inconsistencies and off-by-one errors that have gone undetected.

What you can do then is fine tune it manually. You can use GenSON to spot the patterns — the bits of code that are consistent in the majority of your files — and you can decide to incorporate those into your schema and drop off the rest.

Say for instance you have to process through 5,780 JSON files, as I recently had to do. What you can do is write a schema that is hopeful — what you'd like all of your config files to adhere to (a sort of bare minimum). Then you validate all of your JSON against it, and you look at percentages.

If a deviation from your schema exists on more than some percent of the files, you can assume that one is actually intentional and can incorporate it into your Schema. If something is under that line you can investigate whether or not there is something that needs fixed manually.

It's a little time consuming but eventually you end up with a Schema you can use for any new changes in the future.

Here's a little Python script that demonstrates JSONSchema validation:

from jsonschema.exceptions import ValidationError, SchemaError
from jsonschema.validators import Draft7Validator
import json
import sys

python, schema_filename, *files = sys.argv

# the validator returns the reference to the rules and data
# as a Deq which doesn't have a useful __str__ or repr
# so we need a helper
def path2str(path):
    """ return a string for the json/schema path """
    pathstr = '.'.join(map(str, path))
    pathstr = f"@ {pathstr}" if pathstr else ""
    return pathstr

with open(schema_filename) as schema_file:
    schema = json.load(schema_file)
    validator = Draft7Validator(schema)

for filename in files:    
        with open(filename) as config_file:
          config_instance = json.load(config_file)

        error_count = 0
        for error_count, error in enumerate(
                sorted(validator.iter_errors(config_instance), key=str), start=1
                f"{filename} - #{error_count} - {error.message}"
                f" {path2str(error.path)}\n"
                f"    see {path2str(error.absolute_schema_path)} rule"
                f" in {schema_filename}.\n"

        if not error_count:
            print(f"{filename} validates.")

And here it is in action wherein I've intentionally misspelled a required JSON property in the file bad.json:

$ python ./ schemas/my_schema.json bad.json good.json
bad.json - #1 - 'TargetSchema' is a required property 
    see @ required rule in my_schema.json.

good.json validates.

There's an old software engineering adage that says "minutes of testing save hours of coding.” This applies here.

If you're just starting out a project that's going to make heavy use of JSON, you owe it to yourself to create a schema to handle all your JSON moving forward. Even if you already have a large number of JSON files with no schema, taking the hours to reverse engineer a schema for them will save you days of coding, reconfiguring, managing, and trouble shooting in the future.

Using this technique, I was able to generate a Schema file of 988 lines to manage all 5,780 JSON files. The whole process took about an hour and my client and I sleep easy, knowing that errors and inconsistencies can be caught quickly upstream and their configurations will remain consistent and reliable for years to come.

Tell us about the goals you’re trying to accomplish.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.