Source
yaml
id: sensitive-data
namespace: company.team
tasks:
- id: extract
type: io.kestra.plugin.core.http.Download
uri: https://huggingface.co/datasets/kestra/datasets/raw/main/csv/orders.csv
- id: transform
type: io.kestra.plugin.jdbc.duckdb.Query
inputFiles:
data.csv: "{{ outputs.extract.uri }}"
sql: |
CREATE TABLE orders_pii AS
SELECT order_id,
hash(customer_name) as customer_name_hash,
md5(customer_email) as customer_email_hash,
product_id,
price,
quantity,
total
FROM read_csv_auto('{{ workingDir }}/data.csv');
COPY (SELECT * FROM orders_pii) TO '{{ outputFiles.csv }}' (HEADER,
DELIMITER ',');
outputFiles:
- "*.csv"
- id: load
type: io.kestra.plugin.gcp.bigquery.Load
from: "{{ outputs.transform.outputFiles.csv }}"
serviceAccount: "{{ secret('GCP_CREDS') }}"
projectId: yourGcpProject
destinationTable: yourGcpProject.stage.orders_pii
format: CSV
autodetect: true
csvOptions:
fieldDelimiter: ","
About this blueprint
SQL GCP
This flow has three tasks: extract, transform and load.
- The
extracttask here is a simple HTTP Download task, but you can replace it with any task or custom script that extracts data. - The
transformtask reads the output of theextracttask and transforms it using DuckDB. Specifically, it hashes sensitive values so that they are masked before being loaded to the final destination. - The
loadtask loads that extracted and transformed data to BigQuery.
If you use MotherDuck, use Kestra Secret to store
the MotherDuck service
token. Then, add
the url property to point the task to your MotherDuck database.
yaml
- id: transform
type: io.kestra.plugin.jdbc.duckdb.Query
url: "jdbc:duckdb:md:my_db?motherduck_token={{ secret('MOTHERDUCK_TOKEN') }}"
More Related Blueprints