The AI Challenge for Data Engineers
The era of the traditional data engineer is done. It is that simple. For data engineers to continue to add real value to the business as AI continues to change the landscape, something has to change.
With data work increasing exponentially to get data AI ready, and 90% of data experts already seeking to alleviate workload increases from fragmented pipelines and overwhelming business demands, something's got to change.
The way that we use AI technology, the way that AI has the potential to increase productivity exponentially, means data engineering is evolving at pace. With that comes the need to physically upskill. Enter the age of prompt engineering.
What does AI mean for data engineering?
Data engineers are used to change. The role has radically expanded over the past decade - we used to call them ETL engineers, then big data engineers. The skills needed now include EYL designer, data modeller, SQL developer, site reliability engineers, security analyst and now with AI, python programmer and data scientist.
Ultimately the way we use AI technology is in a lot of ways different and in a lot of ways the same. What's the same is the way engineers work iteratively to solve a problem. They design the code, configure the component, write a bit of code… run, test, deploy to arrive at an outcome. What's different is the way they can not solve that problem. Rather than writing the code themselves, they can ask a chat-prompt to create the code or design the pipeline automatically for them. From that initial pipeline to iterate on the context and requests that you make of the chat-prompt to eventually arriving at the outcome. That's different and that's the area of prompt engineering.
Secondly, data engineers now need to be able to train but more often than not, tune a model (large language model). For a simple example, you ask the model: “give me an answer to this question ‘what is the capital of France?’” That’s a basic prompt. More often than not, it doesn't come back with a single-word answer. Usually it will come back with: ‘The capital of France is Paris’. If you’re working on an analytics problem and you want to do a report, all you actually need is the single word ‘Paris’ without the extra fluff. To get that, you edit the prompt, saying to the model: ‘Give me a single word answer’.
With the incremental way you train a model, you may then say: ‘Only using the following numbers, answer the following question’ - for example, a date of birth needs to be a positive number, it can’t be a negative number because it wouldn’t make any sense.
At its most basic level, you do that multiple times over with each different element until you get the response that you need from the model. That’s the core of prompt engineering.
Iterative, incremental, intelligent
There’s a huge amount of work that goes into getting that model to respond in the right way, into getting each prompt right. That work can take hours, if not days or even weeks. But, once you have it, you can use it over and over again. Data engineers do a lot of that iterative work already - they might take a bit of data, they might fix something here, tweak something there, but they’re not currently training models, whereas the training of a model is something a data scientist is very comfortable with.
Working with a model in this way isn’t hard to learn, it’s just a different skill. Once you learn it and you have the model fully tuned, you do all the other work, boosting productivity, being creative, making life easier!
You can take 10,000 questions and responses, feed it into a model and it comes back with single-word answers, which you can then put back into the data warehouse and do your BI reporting. That’s what is different for the data engineer, they’ve not traditionally done that fine tuning as part of the data engineering role before. I think teams can learn that with relative ease but they’re going to have to learn what it means to tune a model - prompt engineering - and that is a whole new skillset.
AI for data engineers vs data scientists
So does that mean a blurring on the lines between data engineers and data scientists? I don’t think so.
When you look at how a data scientist does their job compared with a data engineer, they’re like oil and water; they are almost the same but work in quite different ways. When a data scientist is using the model, they build and initially train the model. The data engineer is using the model, and to use the model you have to tune it. You can’t just use it. If only! So, whilst data engineers need that new skillset of prompt engineering, there’s still that clear distinction between data scientist and engineer.
Am I hallucinating?
One of the things I am asked about most when it comes to AI is hallucinations. How do we trust the answers that the model is producing, and how do we ensure it isn’t giving wrong answers (hallucinations)?
The answer to this question comes down to the fine tuning and training work by the data engineers. Whilst the data scientists do incredible work building out these models and delivering the plumbing to make the models function, it is the data engineer who makes that usable. Without the fine tuning, answers may be too long, too random, inaccurate. You’re more likely to see the model hallucinate, which has massive potential ramifications.
Alongside this fine-tuning work, running your data pipelines with different models to determine which model performs the task best is relatively easy and further reduces that hallucination risk.
With the iterative work to tune the model for these specific use cases, with specific prompts, you’re significantly reducing the risk of hallucinations, whilst enabling the model to make your life as data team materially more productive. This iterative work also ensures the results the model gives you are more easily usable. For example, if we’re doing reporting, there’s no point in getting english-language text back from the model because you can’t report on that - you can stick it into a report but you can’t do a chart. So, you ask for the model to respond but specifically with a number - all that skill, getting the results in the right format - is a part of tuning.
The final element of this process is the audit trail. Keeping a record to track model performance over time, whilst also correcting mistakes or hallucinations that will inevitably occur, is absolutely essential.
AI and GenAI are of course changing the space for data engineers, there’s no AI without data. We’re stating the obvious. But that’s nothing new. Over the past decade, data engineering roles have evolved massively already - from ETL developer to Big Data engineers, SQL dev and security analysts beyond. The role is continually evolving with the tech.
Now, the role moves into the AI space with prompt engineering, vector databases and large language models. What does that mean on the ground for data engineers?
- Increased data workload to get data AI ready.
- Material increase in data engineering productivity through AI and GenAI technologies.
- Data engineers need to retrain to upskill into prompt engineering to enable this. productivity boost and add AI into their everyday work lives.
- Embrace the right tools to enhance the data engineering role.
The role of the data engineer just got a hell of a lot more interesting.
We’re deep diving into this at our Data Unlocked conference, streamed globally November 15th. With speakers from across the data space including former Google Chief Business Officer Mo Gawdat, Snowflake CEO Frank Slootman, Databricks’ and Matillion CEO Matthew Scullion. We’ll explore which tools make this transition into an AI-led world easier and much more fun.
About the Author
Ciaran Dynes is Chief Product Officer at Matillion, leading the product strategy to provide users with the technology they need to improve data productivity. Ciaran is an accomplished product leader with over 20 years of experience in global product development companies, driving cross-functional teams, managing products from cradle to maturity, and providing the foundation for new product development investments. Before joining Matillion, he held a series of roles at leading integration software vendors including Talend, Progress Software, and IONA Technologies.
Chief Product Officer
10 Best Practices for Maintaining Data Pipelines
Mastering Data Pipeline Maintenance: A Comprehensive GuideBeyond ...News
Matillion Adds AI Power to Pipelines with Amazon Bedrock
Data Productivity Cloud adds Amazon Bedrock to no-code generative ...Blog
Data Mesh vs. Data Fabric: Which Approach Is Right for Your Organization? Part 3
In our recent exploration, we've thoroughly analyzed two key ...