Skip to content
This repository has been archived by the owner on Nov 8, 2021. It is now read-only.
/ sqltask Public archive

ETL tool for performing mostly SQL-based data transformation

License

Notifications You must be signed in to change notification settings

villebro/sqltask

Repository files navigation

PyPI version PyPI Build Status codecov Requirements Status Documentation Get on Slack

SqlTask

SqlTask is an extensible ETL library based on SqlAlchemy to help build robust ETL pipelines with high emphasis on data quality.

Main features of SqlTask:

  • Create well documented data models that support iterative development of both schema and data transformation logic.
  • Tightly coupled data quality checking with transformation logic with automatic creation of visualization-friendly and actionable data quality tables.
  • Make use of SQL where practical, especially expensive data filtering and aggregation during data extraction.
  • Row-by-row data transformation using Python where SQL falls short, e.g. calling third party libraries or storing state from previous rows.
  • Encourage use of modern version control tools and processes, especially GIT.
  • Performant data uploading/insertion where supported.
  • Easy integration with modern ETL orchestration tools, especially Apache Airflow.

Word of caution: SqlTask is currently under heavy development, and the API is expected to change frequently.

Supported databases

SqlTask supports all databases with a SqlAlchemy dialect, with dedicated support for the following engines:

  • Google BigQuery
  • MS SQL Server (experimental)
  • Postgres
  • Sqlite
  • Snowflake

Engines not listed above will fall back to using regular inserts.

Installation instructions

To install SqlTask without any dependencies, simply run

pip install sqltask

To automatically pull in dependencies needed by Snowflake, type

pip install sqltask[snowflake]

Please refer to the documentation on Read The Docs for further information.

Developer instructions

By default, sqltask performs minimal validity checking of column values and types to ensure maximum performance. However, in developer mode,sqltask does additional type checking and ensuring that column values are populated in accordance with schema specifications. This can be very helpful while developing new tasks. To enable these these checks, set the environment variable SQLTASK_DEVELOPER_MODE=1.

About

ETL tool for performing mostly SQL-based data transformation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages