Data engineers design, manage, and optimize the flow of data within an organization. And in an age of big data and AI, that’s one of the most important and in-demand jobs. A 2016 report found a total of 6,500 data engineers in the global workforce, but over 6,600 open job listings for the position in the city of San Francisco alone. The need has only grown since then, with data engineers being among the most critical roles across a wide range of industries.
For example, when a medical facility first makes the transition to electronic health records and digital collection, it’s awash with data and most of that data ends up in isolated silos. But data only produces searchable, actionable insights when used in conjunction with other data. That’s where a data engineer comes in, building an infrastructure of data pipelines, distributed systems, and a singular data lake from which all data can be securely deposited and queried. Operationalizing an institution’s data resources like that has a high, quantifiable value, which is part of the reason why data engineers are paid so handsomely, with most earning well over $100,000 per year.
While there is frequent collaboration between data scientists and data engineers, they’re different positions that prioritize different skill sets. Data scientists focus on advanced statistics and mathematical analysis of the data that’s generated and stored, all in the interest of identifying trends and solving business needs or industry questions. But they can’t do their job without a team of data engineers who have advanced programming skills (Java, Scala, Python) and an understanding of distributed systems and data pipelines. Some companies and universities still merge the roles of data scientist and data engineer, but this is trending down and the need for the separation of these roles is increasingly important, according to Forrester, an industry research firm.
Compared to careers in law and medicine, the role of data engineer is still so young that there aren’t many clearly defined steps to becoming one. A multitude of paths exist. The critical badge for any data engineer is not necessarily an advanced degree, but a true demonstration of capability. How one develops and certifies that capability is a customized and personalized journey.
Check out our step-by-step guide below, and start engineering your future.
After graduating from high school, aspiring data engineers need to earn a bachelor’s degree, ideally in computer science. Admissions requirements will vary from school to school, but typically include a competitive GPA (3.0 or greater), SAT or ACT scores, and a personal statement or letters of recommendation. Previous STEM experience can be seen as a bonus. Once enrolled in an undergraduate program, any opportunities for hands-on experience should be sought out and undertaken, as data engineering is much more practice-based than theory-based.
The University of Florida has an online bachelor’s degree in computer science. Required foundational coursework (which may be transferred over from another institution) covers analytic geometry and calculus, computational linear algebra, physics with calculus, and engineering statistics. Core coursework includes classes in programming fundamentals, information and database systems, data structures and algorithms, and digital logic. The program consists of 120 credits and costs $552 per credit.
Regis University also offers an online bachelor’s degree in computer science. In addition to breadth requirements, students take courses in data structures, algorithms, and the principles of programming languages. Upper division courses include topics like data science, database management, distributed systems, and artificial intelligence. The program consists of 120 credits and costs $1,139 per credit-hour.
Data engineering—like many computer science fields—tends to lean towards meritocracy. If you’re the most capable candidate, then you have a good chance of being hired. It’s entirely possible to be hired for an entry-level job out of college, and that’s a perfect opportunity to start building a portfolio of experience and achievement in the field. Work experience is its own education and a little work goes a long way in helping to assess one’s level of competency and determine their next steps.
While it’s not a necessary step, earning a master’s degree in computer science can be useful for those who want to leave their options open for crossover roles between data engineering, data science, and management. In addition to learning advanced skills, students of graduate programs can also build their professional networks and get career mentoring as a result of their enrollment. Admissions requirements vary from program to program, but often include some combination of the following: a competitive GPA (3.0 or greater), GMAT or GRE scores, letters of recommendation, a personal statement, and some level of work experience.
Arizona State University has a master of computer science (MCS) program that can be completed entirely online. Classes cover topics such as the foundations of algorithms, information assurance and security, data processing at scale, and deep learning in visual computing. The program consists of 30 credits, and costs $500 per credit-hour.
The University of Colorado, Boulder offers an on-campus MS in computer science with an emphasis in data science engineering. Breadth requirements cover areas like artificial intelligence, programming languages, database systems, and human-centered computing. The data science and engineering emphasis includes topics such as datacenter scale computing, big data architecture, computer storage systems, and big data analytics. The program consists of 30 credits, and costs an estimated $31,340 for out-of-state students.
Those interested in performing crossover duties between data science and data engineering may choose to pursue an online master of computer science in data science offered by the University of Illinois. Breadth courses cover topics like applied machine learning, database systems, data visualization, and cloud networking. Advanced coursework adds on classes in advanced Bayesian modeling, the foundations of data curation, and the practice of data cleaning. The program consists of 32 credits and costs $600 per credit.
Those looking for short-duration, targeted education on data engineering can turn to short term engineering courses. While not a requirement, they do provide hands-on experience and can culminate in a professional certificate. In a way, they’re a sort of hack: they do away with the bloat and offer advanced training at a fraction of the cost and time a more general advanced degree would.
Coursera hosts a series of short courses that make up a specialization in data engineering on Google Cloud Platform. Designed and taught by Google teams, there are five courses in the specialization: Google Cloud Platform big data and machine learning fundamentals; leveraging unstructured data with Cloud Dataproc on Google Cloud Platform; serverless data analysis with Google BigQuery and Cloud Dataflow; serverless machine learning with Tensorflow on Google Cloud Platform; and building resilient streaming systems on Google Cloud Platform.
This intermediate-level program takes approximately one month to complete, with 15 hours of study per week. The cost is $49 per month. While this specialization doesn’t equate to Google certification (see step five below), it does give students a solid foundational knowledge which, in combination with work experience, can aid one’s pursuit of official certification later on.
Coursera also hosts a series of short courses in data engineering that make up its big data for data engineers specialization. Offered in partnership with Yandex—a multinational tech corporation operating primarily in Russia, Eastern Europe, and Central Asia—this specialization is designed to upskill working data analysts and programmers. The courses cover the following subjects: big data essentials (HDFS, MapReduce, and Spark RDD); big data analysis (Hive, Spark SQL, DataFrames, and GraphFrames); big data applications (machine learning at scale); big data applications (real-time screening); and big data services (a capstone project). In total, the specialization takes approximately eight months to complete, with nine hours of study per week. It costs $49 per month.
In a young and dynamic discipline like data engineering, professional certification offers perhaps the most concrete way to verify one’s skills and capabilities. Built by and for working data engineers, these certifications measure one by standards agreed upon within the dynamic data engineering community. And while academic institutions are notoriously slow-moving, today’s tech giants are surprisingly nimble, and certifications from industry players can hold great significance to employers in proving a prospective employee’s talent.
One such certification is the Google Cloud Certified Professional Data Engineer, which has no prerequisites for eligibility. Earning this certification simply requires passing a two-hour, in-person, multiple-choice exam. The exam is broadly split into seven sections: designing data processing systems; building and maintaining data structures and databases; analyzing data and enabling machine learning; modeling business processes for analysis and optimization; ensuring reliability; visualizing data and advocating policy; and designing for security and compliance. Google offers both instructor-led and on-demand training for the exam. Certification is valid for two years, after which applicants must recertify. The registration fee is $200.
Another certification is the Cloudera Certified Professional (CCP) Data Engineer, which assesses one’s competencies with Cloudera’s Distributed Hadoop (CDH) environment. The four-hour exam covers the following areas: data ingestion; transforming, staging and storing; data analysis; and workflow. Applicants are given five to ten customer problems, a large and unique data set, and a CDH cluster. Applicants must select their own tools and implement a technical solution that meets all of the customer’s criteria. The registration fee is $400.
Those who wish to pursue an internationally-recognized, company-agnostic certification can look to the Data Science Council of America (DASCA). The DASCA offers certification both as an Associate Big Data Engineer (ABDE) and a Senior Big Data Engineer (SBDE). To apply for the ABDE, one needs only a bachelor’s degree in computer science or a related field. An applicant for the SBDE needs either a bachelor’s degree and two years of work experience or a master’s degree and one year of work experience.
To become certified, applicants for either certification will need to pass an exam based on the DASCA Essential Knowledge Framework. Both exams cover the following areas: introduction to data science and big data; storing and processing data in Hadoop; decoding Sqoop and Flume; Yarn, Hive, and Pig; decoding machine learning; big data analytics and R; integrating R and Hadoop; social media, mobile, and big data solution engineering; big data tools for engineers; and essential Python. Study materials are available on the DASCA website. The registration fee is $520 for the ABDE and $575 for the SBDE.
Data engineers need to be resourceful sleuths who grab insights and tools from wherever they can. As always, the data is out there, and it just needs to be wrangled. If you want to get an idea of what’s available and what’s being talked about in data engineering today, check out some of the following resources:
Data science, as described by University of California, Berkeley, involves the analysis and management of large quantities of data. The discipline requires professionals who can ask the right questions, chart out what information is needed, collect the data, and analyze it effectively.
Meet 25 leading professors of computer science, and learn more about what makes them standout educators and innovators.
Learn from the best. Meet 25 well-regarded software engineering professors who teach proven, best practice approaches to software engineering and testing.
Traditional forms of education are still important, but they can’t keep up with the rapid pace of cybersecurity. As soon as one form of threat is neutralized, innumerable others are developed. That’s why employers and employees are both increasingly turning to the more nimble world of professional certifications.
A master’s degree in data science trains students to expertly analyze data, as well as in other important disciplines such as machine learning, programming, database management, and data visualization. This degree is ideal for aspiring data scientists, data analysts, and pricing analysts.