Ensure data quality with asset checks
Data quality is critical in data pipelines. Inspecting individual assets ensures that data quality issues are caught before they affect the entire pipeline.
In Dagster, you define asset checks like you define assets. Asset checks run when an asset is materialized. In this step, you will:
- Define an asset check
- Execute that asset check in the UI
1. Define an asset check
Asset check can go in the assets.py
file next to the asset we just defined. An asset check can be any logic we want. In our case we query the joined_data
table created by our asset and ensure that customer_id
is not null:
@dg.asset_check(
asset=joined_data,
description="Check if there are any null customer_ids in the joined data",
)
def missing_dimension_check(duckdb: DuckDBResource) -> dg.AssetCheckResult:
table_name = "jaffle_platform.main.joined_data"
with duckdb.get_connection() as conn:
query_result = conn.execute(
f"""
select count(*)
from {table_name}
where customer_id is null
"""
).fetchone()
count = query_result[0] if query_result else 0
return dg.AssetCheckResult(
passed=count == 0, metadata={"customer_id is null": count}
)
The asset check is using the same DuckDBResource
resource we defined for the asset. Resources can be shared across all objects in Dagster.
With the Dagster UI, you can now see that an asset check is associated with the joined_data
asset.
TODO: Screenshot
Asset checks will run when an asset is materialized, but asset checks can also be executed manually in the UI:
- Reload your Definitions.
- Navigate to the Asset Details page for the
joined_data
asset. - Select the "Checks" tab.
- Click the Execute button for
missing_dimension_check
.
TODO: Screenshot
Summary
The structure of the etl_tutorial
module has remained the same:
src
└── etl_tutorial
├── __init__.py
└── defs
├── __init__.py
├── ingest_files
│ ├── defs.yaml
│ └── replication.yaml
├── jdbt
│ └── defs.yaml
├── assets.py
└── resources.py
But there are now data checks on the assets we have created to help ensure the quality of the data in our pipeline.
Next steps
- Continue this tutorial with creating and materializing partitioned assets