Users can now use the DataHub to validate their tabular data, for example checking that dates really are dates or that a column of daily revenue is always positive.
Data validation is also integrated as part of the data publishing flow making it easier to debug general errors – the most common issue we’ve found in our own publishing experience is a typo in a data value that then breaks graphing or other data presentation and processing.
Data Validation now happens automatically for you whenever you push data to the DataHub so to validate data in a file or data package just push it to the DataHub:
# Note: in this case DataHub will auto-infer the types of the columns
data push myfile.csv
# a data package
You can find more about pushing data to the DataHub here.
You may need to customize the schema for your file – perhaps DataHub has guessed the types wrong.
We use the Table Schema for your data file along with the powerful goodtables library to validate your data and generate a JSON validation report.
This report is then used in your dataset page to show you a human-readable version of the errors (a JSON version is also available).
Results of the report will be displayed in an easy to understand form. Specifically, if there are errors you will see a detailed report like this:
This table displays errors in a validation report. You can see that in the example above we have 3 badges indicating key details (
Type or Format Error message. It means 985 values have type or format errors when validating against the schema. Details can be expanded by clicking on “Error details” link on the right. You also can find values causing this error on the table - they have a red background colour. By default, we show first 10 rows but you can open more of them by clicking on “Show next 10 rows” link.
In the screenshot below, you can see how a validation report looks like for “Minimum Constraint” error. Revenue value cannot be negative so we’ve set
constraints property with a
minimum attribute as 0 and validation process identified which value is not meeting that requirement:
Below are properties that you can use in a table schema for validation of your data:
constraints - you may use this property to validate field values, e.g., a column with “Daily Revenue” or “Sales” must be a positive number so you’d have a
minimum attribute set to 0.
type is a string indicating the type of this field, e.g.,
format is a string indicating a format for the field type, e.g., you can have
email format for
string type so it validates if values in the field are emails.
missing values can be used to indicate a special value for empty entries, e.g.,
You can find full table schema specs here https://frictionlessdata.io/specs/table-schema.