How to configure a RuntimeDataConnector
This guide demonstrates how to configure a RuntimeDataConnector and only applies to the V3 (Batch Request) API. A RuntimeDataConnector allows you to specify a Batch using a Runtime Batch Request, which is used to create a Validator. A Validator is the key object used to create Expectations and validate datasets.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- Have a working installation of Great Expectations
- Understand the basics of Datasources in the V3 (Batch Request) API
- Learned how to configure a Data Context using test_yaml_config
A RuntimeDataConnector is a special kind of Data Connector that enables you to use a RuntimeBatchRequest to provide a Batch's data directly at runtime. The RuntimeBatchRequest can wrap an in-memory dataframe, a filepath, or a SQL query, and must include batch identifiers that uniquely identify the data (e.g. a run_id from an AirFlow DAG run). The batch identifiers that must be passed in at runtime are specified in the RuntimeDataConnector's configuration.
Steps#
1. Instantiate your project's DataContext#
Import these necessary packages and modules:
- YAML
- Python
import great_expectations as gefrom great_expectations.core.batch import RuntimeBatchRequestfrom ruamel import yaml
import great_expectations as gefrom great_expectations.core.batch import RuntimeBatchRequest2. Set up a Datasource#
All of the examples below assume you’re testing configuration using something like:
- YAML
- Python
datasource_yaml = """name: taxi_datasourceclass_name: Datasourceexecution_engine: class_name: PandasExecutionEnginedata_connectors: <DATACONNECTOR NAME GOES HERE>: <DATACONNECTOR CONFIGURATION GOES HERE>"""context.test_yaml_config(yaml_config=datasource_config)datasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "<DATACONNECTOR NAME GOES HERE>": { "<DATACONNECTOR CONFIGURATION GOES HERE>" }, },}context.test_yaml_config(yaml.dump(datasource_config))If you’re not familiar with the test_yaml_config method, please check out: How to configure Data Context components using test_yaml_config
3. Add a RuntimeDataConnector to a Datasource configuration#
This basic configuration can be used in multiple ways depending on how the RuntimeBatchRequest is configured:
- YAML
- Python
datasource_yaml = r"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine: module_name: great_expectations.execution_engine class_name: PandasExecutionEnginedata_connectors: default_runtime_data_connector_name: class_name: RuntimeDataConnector batch_identifiers: - default_identifier_name"""datasource_config = { "name": "taxi_datasource", "class_name": "Datasource", "module_name": "great_expectations.datasource", "execution_engine": { "module_name": "great_expectations.execution_engine", "class_name": "PandasExecutionEngine", }, "data_connectors": { "default_runtime_data_connector_name": { "class_name": "RuntimeDataConnector", "batch_identifiers": ["default_identifier_name"], }, },}Once the RuntimeDataConnector is configured you can add your datasource using:
context.add_datasource(**datasource_config)Example 1: RuntimeDataConnector for access to file-system data:#
At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest with the path to your data defined in runtime_parameters:
batch_request = RuntimeBatchRequest( datasource_name="taxi_datasource", data_connector_name="default_runtime_data_connector_name", data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you runtime_parameters={"path": "<PATH TO YOUR DATA HERE>"}, # Add your path here. batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)Next, you would pass that request into context.get_validator:
validator = context.get_validator( batch_request=batch_request, create_expectation_suite_with_name="<MY EXPECTATION SUITE NAME>",)Example 2: RuntimeDataConnector that uses an in-memory DataFrame#
At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest with the DataFrame passed into batch_data in runtime_parameters:
import pandas as pdpath = "<PATH TO YOUR DATA HERE>"df = pd.read_csv(path)
batch_request = RuntimeBatchRequest( datasource_name="taxi_datasource", data_connector_name="default_runtime_data_connector_name", data_asset_name="<YOUR MEANINGFUL NAME>", # This can be anything that identifies this data_asset for you runtime_parameters={"batch_data": df}, # Pass your DataFrame here. batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)Next, you would pass that request into context.get_validator:
batch_request=batch_request, expectation_suite_name="<MY EXPECTATION SUITE NAME>",)print(validator.head())Additional Notes#
To view the full script used in this page, see it on GitHub: