Skip to content

SQL Transformer - Data Engineering

Transcript (auto-generated)

Hi, everyone. Today, I’m going to show you part of our new Tray data engineering release, the SQL Transformer, focused on helping teams prepare data properly for AI and advanced automation.

Now modern AI workflows depend on clean and well shaped trusted data, but preparing the data inside traditional workflows can quickly become complex and fragile.

To solve that, at the foundation of data engineering is SQL Transformer, which shifts data transformation from messy row by row processing to set based database powered execution.

Now to illustrate this, I’m gonna show you what certain business logic workflow would look like using an old method, so using workflow without the SQL Transformer, and I’m gonna compare it to what loop would look like with SQL Transformer.

Now here I have a workflow that performs a typical enterprise data transformation. Its goal is to take raw inconsistent sales data from multiple sources and turn it into a regional view showing revenue and performance against targets.

Now data is coming from multiple systems.

I have a customer reference to file. I have a product catalog. I have the performance targets file, and finally, I have the sales export data, which is a large data sheet with all the sales data.

Now in all those sources, data can be malformed, duplicated, inconsistently inconsistently formatted, and we basically need to clean it, enrich it, combine it, and calculate performance metrics.

So what do we need to do? So first, I gather all the files and extract data from them.

I have a very simple case here because the files are small, so I don’t need to do much transformation. But, like, in many real life cases, I would have to paginate or perform additional passing of those files. I have to paginate on the big sales data export file though, because there are multiple hundreds of records here. So I have my pagination loop, and for each page of the records, I need to remove duplicates and introduce consistent data formatting because things like product names with extra can have extra spaces or different casing. So I am fixing that all in in this step.

Now, for each row, I’m also enriching the data from the other sources. So, I take sales, data row, and I take additional information from products and customers’ files for each sold item. Now I aggregate all the results by region. And as you do in pagination, basically, take all page of records and I append it to the list that I’m gonna then retrieve in order to perform other calculations.

Now once the list is retrieved after multiple steps of pagination, I calculate I calculate revenue splits. And finally, I compare the performance to targets and sort regions by performance. So as you can see, to answer a single business question, how is each region performing against target, we need more than twenty steps in multiple loops. This obviously depends on how well the input data is formatted and what preparation what other steps you need to perform in order to prepare it.

Now this is still a simple example. As I said earlier, my files are small, and also I’m using a lot of script logic when I am bundling multiple transformation into single script steps.

So let’s simplify this.

And here’s my workflow using a SQL Transformer.

Basically, all the transformation that you have just seen is packed within one step.

The only thing you need to do is to get this spreadsheet and this step it combines the set spreadsheet with additional files that live in the file system and performs all the logic previously done in so many steps. So instead of processing records one by one, we treat files as datasets, and we load them into an isolated SQL execution environment and apply set based transformation in one single step.

So let me summarize what we have just seen.

First of all, we operate on entire data sheets in single execution context rather than processing data row by row. This results in less complexity and fewer tasks.

And fewer task means only one thing, predictable cost. In the traditional approach, processing thousands of rows required loops, multiple data execution per each record. Now we can do it in single step, so it simplifies a lot.

An important thing here is that we are using a well known standard.

We are not introducing a new transformation language. We are using SQL. So it’s a widely known battle tested standard that most developers and dates data teams will already know. And if you don’t, you can always use AI to help you with the syntax.

This is a scalable and flexible enterprise ready solution. There’s no additional logic. There’s simplicity.

It’s ready to deal with hundreds of records and multiple data sources. And finally, this is AI ready. This produces a clean structured output that can immediately be used for AI agents, prompts, or any other down downstream systems.

And to summarize, this is what data engineering at Tray is all about. We are moving away from complex, fragile workflows towards fewer steps, fewer tasks, and more simplicity.

So in this case, if you can express the outcome in SQL, you can basically execute it in one step.

Thank you.