How to configure a ConfiguredAssetDataConnector
This guide demonstrates how to configure a ConfiguredAssetDataConnector, and provides several examples you can use for configuration.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Understand the basics of Datasources in 0.13 or later
- Learned how to configure a Data Context using test_yaml_config
Great Expectations provides two DataConnector classes for connecting to Data Assets stored as file-system-like data (this includes files on disk, but also S3 object stores, etc) as well as relational database data:
- A ConfiguredAssetDataConnector allows you to specify that you have multiple Data Assets in a
Datasource, but also requires an explicit listing of each Data Asset you want to connect to. This allows more fine-tuning, but also requires more setup. - An InferredAssetDataConnector infers
data_asset_nameby using a regex that takes advantage of patterns that exist in the filename or folder structure.
If you're not sure which one to use, please check out How to choose which DataConnector to use.
Steps#
1. Instantiate your project's DataContext#
Import these necessary packages and modules:
- YAML
- Python
from ruamel import yaml
import great_expectations as gefrom great_expectations.core.batch import BatchRequestimport great_expectations as gefrom great_expectations.core.batch import BatchRequest2. Set up a Datasource#
All of the examples below assume you’re testing configuration using something like:
- YAML
- Python
datasource_yaml = """name: taxi_datasourceclass_name: Datasourceexecution_engine: class_name: PandasExecutionEnginedata_connectors: <DATACONNECTOR NAME GOES HERE>: <DATACONNECTOR CONFIGURATION GOES HERE>"""context.test_yaml_config(yaml_config=datasource_config)datasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "<DATA CONNECTOR NAME GOES HERE>": { "<DATACONNECTOR CONFIGURATION GOES HERE>" }, },}context.test_yaml_config(yaml.dump(datasource_config))If you’re not familiar with the test_yaml_config method, please check out: How to configure Data Context components using test_yaml_config
3. Add a ConfiguredAssetDataConnector to a Datasource configuration#
ConfiguredAssetDataConnectors like ConfiguredAssetFilesystemDataConnector and ConfiguredAssetS3DataConnector require Data Assets to be explicitly named. A Data Asset is an abstraction that can consist of one or more data_references to CSVs or relational database tables. For instance, you might have a yellow_tripdata Data Asset containing information about taxi rides, which consists of twelve data_references to twelve CSVs, each consisting of one month of data. Each Data Asset can have their own regex pattern and group_names, and if configured, will override any pattern or group_names under default_regex.
Imagine you have the following files in <MY DIRECTORY>/:
<MY DIRECTORY>/yellow_tripdata_2019-01.csv<MY DIRECTORY>/yellow_tripdata_2019-02.csv<MY DIRECTORY>/yellow_tripdata_2019-03.csvWe could create a Data Asset yellow_tripdata that contains 3 data_references (yellow_tripdata_2019-01.csv, yellow_tripdata_2019-02.csv, and yellow_tripdata_2019-03.csv).
In that case, the configuration would look like the following:
- YAML
- Python
datasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_configured_data_connector_name: class_name: ConfiguredAssetFilesystemDataConnector base_directory: <MY DIRECTORY>/ assets: yellow_tripdata: pattern: yellow_tripdata_(.*)\.csv group_names: - month"""datasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_configured_data_connector_name": { "class_name": "ConfiguredAssetFilesystemDataConnector", "base_directory": "<MY DIRECTORY>/", "assets": { "yellow_tripdata": { "pattern": r"yellow_tripdata_(.*)\.csv", "group_names": ["month"], } }, }, },}Notice that we have specified a pattern that captures the year-month combination after yellow_tripdata_ in the filename and assigns it to the group_name month.
The configuration would also work with a regex capturing the entire filename (e.g. pattern: (.*)\.csv). However, capturing the month on its own allows for batch_identifiers to be used to retrieve a specific Batch of the Data Asset. For more information about capture groups, refer to the Python documentation on regular expressions.
Later on we could retrieve the data in yellow_tripdata_2019-02.csv of yellow_tripdata as its own batch using context.get_validator() by specifying {"month": "2019-02"} as the batch_identifier.
batch_request = BatchRequest( datasource_name="taxi_datasource", data_connector_name="default_configured_data_connector_name", data_asset_name="yellow_tripdata",)
context.create_expectation_suite( expectation_suite_name="<MY EXPECTATION SUITE NAME>", overwrite_existing=True)
validator = context.get_validator( batch_request=batch_request, expectation_suite_name="<MY EXPECTATION SUITE NAME>", batch_identifiers={"month": "2019-02"},)print(validator.head())This ability to access specific Batches using batch_identifiers is very useful when validating Data Assets that span multiple files.
For more information on batches and batch_identifiers, please refer to the Core Concepts document.
A corresponding configuration for ConfiguredAssetS3DataConnector would look similar but would require bucket and prefix values instead of base_directory.
- YAML
- Python
datasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_inferred_data_connector_name: class_name: ConfiguredAssetS3DataConnector bucket: <MY S3 BUCKET>/ prefix: <MY S3 BUCKET PREFIX>/ assets: yellow_tripdata: pattern: yellow_tripdata_(.*)\.csv group_names: - month"""# Pythondatasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_inferred_data_connector_name": { "class_name": "ConfiguredAssetS3DataConnector", "bucket": "<MY S3 BUCKET>/", "prefix": "<MY S3 BUCKET PREFIX>/", "assets": { "yellow_tripdata": { "group_names": ["month"], "pattern": r"yellow_tripdata_(.*)\.csv", }, }, },The following examples will show scenarios that ConfiguredAssetDataConnectors can help you analyze, using ConfiguredAssetFilesystemDataConnector.
Example 1: Basic Configuration for a single Data Asset#
Continuing the example above, imagine you have the following files in the directory <MY DIRECTORY>:
<MY DIRECTORY>/yellow_tripdata_2019-01.csv<MY DIRECTORY>/yellow_tripdata_2019-02.csv<MY DIRECTORY>/yellow_tripdata_2019-03.csvThen this configuration:
- YAML
- Python
# YAMLdatasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_configured_data_connector_name: class_name: ConfiguredAssetFilesystemDataConnector base_directory: <MY DIRECTORY>/ assets: yellow_tripdata: pattern: (.*)\.csv group_names:# Pythondatasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_configured_data_connector_name": { "class_name": "ConfiguredAssetFilesystemDataConnector", "base_directory": "<MY DIRECTORY>/", "assets": { "yellow_tripdata": { "pattern": r"yellow_tripdata_(.*)\.csv", "group_names": ["month"], } }, },will make available yelow_tripdata as a single Data Asset with the following data_references:
Available data_asset_names (1 of 1): yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']
Unmatched data_references (0 of 0):[]Once configured, you can get a Validator from the Data Context as follows:
context.add_datasource(**datasource_config)
batch_request = BatchRequest( datasource_name="taxi_datasource", data_connector_name="default_configured_data_connector_name", data_asset_name="yellow_tripdata",)
validator = context.get_validator( batch_request=batch_request, expectation_suite_name="<MY EXPECTATION SUITE NAME>",But what if the regex does not match any files in the directory?
Then this configuration:
- YAML
- Python
# YAMLdatasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_configured_data_connector_name: class_name: ConfiguredAssetFilesystemDataConnector base_directory: <MY DIRECTORY>/ assets: yellow_tripdata: pattern: green_tripdata_(.*)\.csv group_names:# Pythondatasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_configured_data_connector_name": { "class_name": "ConfiguredAssetFilesystemDataConnector", "base_directory": "<MY DIRECTORY>/", "assets": { "yellow_tripdata": { "pattern": r"green_tripdata_(.*)\.csv", "group_names": ["month"], } }, },will give you this output
Available data_asset_names (1 of 1): yellow_tripdata (0 of 0): []
Unmatched data_references (3 of 3):['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']Notice that yellow_tripdata has 0 data_references, and there are 3 Unmatched data_references listed.
This would indicate that some part of the configuration is incorrect and would need to be reviewed.
In our case, changing pattern to yellow_tripdata_(.*)\.csv will fix our problem and give the same output to above.
Example 2: Basic configuration with more than one Data Asset#
Here’s a similar example, but this time two Data Assets are mixed together in one folder.
Note: For an equivalent configuration using InferredAssetFileSystemDataConnector, please see Example 2 in How to configure an InferredAssetDataConnector.
<MY DIRECTORY>/yellow_tripdata_2019-01.csv<MY DIRECTORY>/green_tripdata_2019-01.csv<MY DIRECTORY>/yellow_tripdata_2019-02.csv<MY DIRECTORY>/green_tripdata_2019-02.csv<MY DIRECTORY>/yellow_tripdata_2019-03.csv<MY DIRECTORY>/green_tripdata_2019-03.csvThen this configuration:
- YAML
- Python
# YAMLdatasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_configured_data_connector_name: class_name: ConfiguredAssetFilesystemDataConnector base_directory: <MY DIRECTORY>/ assets: yellow_tripdata: pattern: yellow_tripdata_(\d{4})-(\d{2})\.csv group_names: - year - month green_tripdata: pattern: green_tripdata_(\d{4})-(\d{2})\.csv group_names: - year# Pythondatasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_configured_data_connector_name": { "class_name": "ConfiguredAssetFilesystemDataConnector", "base_directory": "<MY DIRECTORY>/", "assets": { "yellow_tripdata": { "pattern": r"yellow_tripdata_(\d{4})-(\d{2})\.csv", "group_names": ["year", "month"], }, "green_tripdata": { "pattern": r"green_tripdata_(\d{4})-(\d{2})\.csv", "group_names": ["year", "month"], }, }, },will now make yellow_tripdata and green_tripdata both available as Data Assets, with the following data_references:
Available data_asset_names (2 of 2): green_tripdata (3 of 3): ['green_tripdata_2019-01.csv', 'green_tripdata_2019-02.csv', 'green_tripdata_2019-03.csv'] yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']
Unmatched data_references (0 of 0): []Example 3: Example with Nested Folders#
In the following example, files are placed folders that match the data_asset_names we want (yellow_tripdata and green_tripdata), but the filenames follow different formats.
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv<MY DIRECTORY>/green_tripdata/2019-01.csv<MY DIRECTORY>/green_tripdata/2019-02.csv<MY DIRECTORY>/green_tripdata/2019-03.csvThe following configuration:
- YAML
- Python
# YAMLdatasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_configured_data_connector_name: class_name: ConfiguredAssetFilesystemDataConnector base_directory: <MY DIRECTORY>/ assets: yellow_tripdata: base_directory: yellow_tripdata/ pattern: yellow_tripdata_(\d{4})-(\d{2})\.csv group_names: - year - month green_tripdata: base_directory: green_tripdata/ pattern: (\d{4})-(\d{2})\.csv group_names: - year# Pythondatasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_configured_data_connector_name": { "class_name": "ConfiguredAssetFilesystemDataConnector", "base_directory": "<MY DIRECTORY>/", "assets": { "yellow_tripdata": { "base_directory": "yellow_tripdata/", "pattern": r"yellow_tripdata_(\d{4})-(\d{2})\.csv", "group_names": ["year", "month"], }, "green_tripdata": { "base_directory": "green_tripdata/", "pattern": r"(\d{4})-(\d{2})\.csv", "group_names": ["year", "month"], }, }, },will now make yellow_tripdata and green_tripdata available a Data Assets, with the following data_references:
Available data_asset_names (2 of 2): green_tripdata (3 of 3): ['2019-01.csv', '2019-02.csv', '2019-03.csv'] yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']
Unmatched data_references (0 of 0):[]Example 4: Example with Explicit data_asset_names and more complex nesting#
In this example, the assets yellow_tripdata and green_tripdata are being explicitly defined in the configuration, and have a more complex nesting pattern.
<MY DIRECTORY>/yellow/tripdata/yellow_tripdata_2019-01.txt<MY DIRECTORY>/yellow/tripdata/yellow_tripdata_2019-02.txt<MY DIRECTORY>/yellow/tripdata/yellow_tripdata_2019-03.txt<MY DIRECTORY>/green_tripdata/green_tripdata_2019-01.csv<MY DIRECTORY>/green_tripdata/green_tripdata_2019-02.csv<MY DIRECTORY>/green_tripdata/green_tripdata_2019-03.csvThe following configuration:
- YAML
- Python
# YAMLdatasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_configured_data_connector_name: class_name: ConfiguredAssetFilesystemDataConnector base_directory: <MY DIRECTORY>/ default_regex: pattern: (.*)_(\d{4})-(\d{2})\.(csv|txt)$ group_names: - data_asset_name - year - month assets: yellow_tripdata: base_directory: yellow/tripdata/ glob_directive: "*.txt" green_tripdata: base_directory: green_tripdata/# Pythondatasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_configured_data_connector_name": { "class_name": "ConfiguredAssetFilesystemDataConnector", "base_directory": "<MY DIRECTORY>/", "default_regex": { "pattern": r"(.*)_(\d{4})-(\d{2})\.(csv|txt)$", "group_names": ["data_asset_name", "year", "month"], }, "assets": { "yellow_tripdata": { "base_directory": "yellow/tripdata/", "glob_directive": "*.txt", }, "green_tripdata": { "base_directory": "green_tripdata/", "glob_directive": "*.csv", }, }, },will make yellow_tripdata and green_tripdata available as Data Assets, with the following data_references:
Available data_asset_names (2 of 2): green_tripdata (3 of 3): ['green_tripdata_2019-01.', 'green_tripdata_2019-02.', 'green_tripdata_2019-03.'] yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.', 'yellow_tripdata_2019-02.', 'yellow_tripdata_2019-03.']
Unmatched data_references (0 of 0):[]Additional Notes#
To view the full script used in this page, see it on GitHub: