What Does A Data Engineer Do?


SOURCE: FORBES.COM
SEP 20, 2024

Adrian Bridgwater

Senior Contributor

I track enterprise software application development & data management.

Data engineers engineer. Obviously they do, the clue is in the name. But what does a data engineer really do as part of their daily routine and how does it impact us as users of enterprise software applications and data services today? Given that not all software developers (aka programmers) would necessarily classify themselves as data engineers, how do the two disciplines differ and yet still dovetail enough to form a positive symbiotic bond?

As opinions on this subject now start to proliferate, we need to consider just how many software engineers would consider themselves to be data engineers and, crucially, we need to know just how far their skills and competencies extend in this regard. What kinds of technologies would data engineers get their dirty hands on if it's not the “command line” code that developers love to squelch their fingers into? Will data engineering become increasingly AI-driven and will this role change as a result?

That’s a lot of data engineering questions, so let’s start with the easy one.

What Is A Data Engineer?

PROMOTED

“It’s a moving definition really, because the role of the data engineer itself is changing,” advised Alois Reitbauer, chief technology strategist and head of open source at observability company, Dynatrace. “Once we start moving data science and data engineering into the world of application delivery, we immediately need both business experts and ‘technology practice’ specialists (i.e. security professionals, performance gurus, debugging experts) all working together in a unified inter-disciplinary team.”

Reitbauer is realistic and says that, really, we’re not there yet with data engineering and the skills are not there in many developer communities, which may well be a part of why the idea of these new hybrid work teams is validated. From inside Dynatrace (a company traditionally known for its cloud application performance management services), he says customers are operationalizing and consolidating the way they run their data workloads. This means that data engineering teams need to adopt new modern operations practices like a Site Reliability Engineering mindset and operations people need to learn to manage a new type of data-intensive workload. Kubernetes seems to become the consolidation platform for this effort.

“Imagine if we needed to build large image models to create AI services that would show people how to change a filter on a vacuum cleaner (for want of a pretty random example) today. All that image and video data would need to be digitized and assembled into order via some fairly complex data engineering techniques. Those skills are not being taught as part of a software curriculum in universities and they rarely exist in the real world. We need to start planning now for the data engineering application use cases of tomorrow before they even happen,” said Reitbauer.

This is an exceptionally fluid topic in information technology circles i.e. some data engineers only feel comfortable using this job descriptor designation if they get to call themselves data streaming engineers, because they work with real-time streaming technologies like Apache Kafka. We need to tread carefully here.

Production Data Pipelines

“Data engineers are critical to the success of data science and AI teams for two main reasons. Firstly, they build the data pipelines that provide the training and experimentation data that data scientists use to conduct analyses, build ML models and design data, ML and AI products. Secondly, they build the production data pipelines that feed the production models and ML/AI pipelines that data scientists and ML engineers build,” detailed Dr. Kjell Carlsson, head of AI strategy at Domino Data Lab.

Carlsson reminds us that, in practice, all data scientists need good data engineering skills because they will constantly need to access additional data, build on the data pipelines they get access to and create new features from this data. Sometimes they find themselves needing to build the production data pipelines as well, although he agrees that this isn’t a recommended practice since specialized data engineers will usually build more efficient and more robust pipelines.

“The best organizations embed data engineers in their data science and AI teams to streamline ongoing collaboration throughout the model development process and get faster time to value, better performance and more robust ML and AI applications by doing so,” said Domino’s Dr Carlsson. “With the advent of generative AI, many observers and business leaders incorrectly presumed that data science and AI expertise was no longer necessary and that data engineers, software engineers would be all that is necessary to build AI applications. Unfortunately, this has been a large contributor to the plethora of proof of concept projects that have failed to be put into production and the growing disillusionment with generative AI as a transformative technology.”

Successful data engineering for enterprise generative AI use cases in customer service, tech & biopharma and expert ML & AI skills could provide us with some clues as to the data engineering skills needed today. We need data experts capable of understanding a business problem, decomposing it so that it is solvable using ML and AI methods, designing and iteratively developing, testing and validating a solution aligned to the business need and creating a robust ML/AI pipeline that orchestrates a range of new AI technology components.

“These are not skills that data engineers (or developers) are trained on, nor do they have the opportunity to acquire these skills as part of their normal roles. However, data engineers often have excellent prerequisites for becoming successful data scientists, ML engineers and the new role of ‘AI engineer’, it just requires additional training, opportunity and experience,” surmised Domino’s Carlsson.

Breathing Life Into AI Aspirations

John Roese, global chief technology officer & chief AI officer at Dell Technologies says that every AI aspiration will need data engineering, as without a modern foundation of data, the AI outcomes we desire will never happen. He reminds us that modern data environments are no longer defined by individual components such as databases, extract-transform-load ETL systems, analytic tools etc. Instead, modern data systems are a connected set of technologies that create pathways between sources of data - sensors, apps, telemetry systems, customer portals - and systems that can distil and extract value from that data, including AI chatbots, AI analytic tools, big data tools and agentic systems.

“These modern data systems, though, are not about creating connections between single sets of data and single tools to use that data; instead, the modern data systems create a fabric or mesh that allows these modern tools to access data from a range of sources and create real combinatorial insights; think of AI models that can organize content across many sources, AI's that can understand and optimize the entire selling process and AI's that can understand and engage with the whole customer experience from acquisition to services. Today and in the future, data engineering is the skill that defines, implements and operates this modern data system,” said Roese.

As per one of our central questions posed at the outset here, does Roese feel data engineering will become increasingly AI-driven and will the role change as a result?

“Yes! A modern enterprise today could have petabytes to exabytes to even zettabytes of internal data across all sources (this is many times larger than the data used to train the largest LLMs today). Applying AI to that modern enterprise taps into a data ecosystem so large and complex that it would be impossible for any human being to understand or operate manually. Because of that, the only path for modern data engineering to succeed is the aggressive adoption of AI tools and systems to scale the human efforts needed. However, we do not believe that the early co-pilot tools (tools that augment a human doing a task) will be enough,” said Roese.

He underlines the way things are moving and says that, fortunately, we are now entering the era of AI autonomous agents, so the inevitable path is that the human data engineer will become less of a doer of tasks and more of an orchestrator of autonomous agents that can do the work. Roese asks us to imagine a human data engineer who sets the intentions and guidelines of a data strategy but then delegates work in planning, architecting, implementing and even operating to specialised AI agents that execute those tasks collectively.

“As they work, the human is in the loop to make sure they are not getting stuck, that they are aligned with the business objectives and that, when new efforts are needed, that work can be defined and delegated to the right agents. This idea of agents working for and with the data engineer gives us a path to operate at petabyte, exabyte, or even zettabyte scale data systems. Given this, the data engineering role will evolve to be an expert in modern data architecture and a leader of teams of humans plus autonomous agents that will do the work needed to deliver modern data outcomes in the AI era,” concluded Roese.

Our Survey Said

As this debate plays out, extends and solidifies, it is not uncommon (at the time of writing) for technology vendors to hold live QR-code powered audience polls during annual conference keynote sessions asking attendees how many of the assembled identify as data engineers today.

The number is increasing all the time, but the majority of those who say yes would probably also say that they are essentially software application development professionals in the first instance. As this role cements itself and becomes more accurately codified, we may perhaps look back on the essentially quite fluid comments made here at the end of the current decade and see how far we have progressed.

Personal Robots in Production

A row of Topo personal robots, at their assembly facility in California. (Photo by Roger ... [+]

CORBIS/VCG VIA GETTY IMAGES

Follow me on Twitter or LinkedIn.

Adrian Bridgwater

I am a technology journalist with three decades of press experience.