Just completed Udacity's Data Engineering Nanodegree!
So, this happened recently:
Having previously completed the Machine Learning Engineer Nanodegree back in 2018, I’ve now completed 2 nanodegrees from Udacity.
Thanks again to my employer Airteam for sponsoring this, and continually investing in my education and professional development.
I want to talk a little bit about my experiences with the course, and what I’ve learned.
First, let’s talk a bit about how the course is structured.
I was part of the first cohort to sign up for this course, which at the time was still a term-based structure. That means, there was a fixed term of about 5 months, and if you don’t complete all the projects, you basically don’t get the certificate (which is probably useless to me anyway).
Udacity has since converted all Nanodegrees to subscription-based courses, where you pay a certain amount of money each month, and can take as long as you’d like.
There are 5 modules, and in each one of them you are required to submit between 1-2 guided projects. Your submission will be reviewed by a contractor hired by Udacity.
The projects are mostly done in the browser, and could be a Jupyter workspace, or a customised workspace on what I believe is a Docker container.
The 5 modules are:
1. Data Modeling
This module is an introduction to databases, and covers Postgres and Apache Cassandra. Basically one relational database, and one NoSQL database. There was a basic project for each database, so two projects in total.
2. Cloud Data Warehouses
An introduction to data warehousing, specifically on AWS. It talked about why transactional and analytical databases require different designs. The specific technology used for the data warehouse is Amazon Redshift. In the project, we did ETL from a data source to create a star schema database in Redshift.
3. Spark and Data Lakes
As the name probably tells you, it’s an introduction to Apache Spark, and also how to build a Data Lake using Spark. The project uses the same schema as the previous module.
4. Data Pipelines with Airflow
Basically, an introduction to Apache Airflow, and how you can use it to create data pipelines. You basically rebuild the cloud data warehouse project using Airflow, with some minor changes to leverage Airflow.
5. Capstone Project
In this module, you do an open-ended project using either Udacity-provided data sources, or you could find your own datasets. You complete a writeup, and submit it.
Student and peer experience
For my cohort, we were invited into a shared Slack workspace with other students and our assigned mentors. Each mentor is assigned to a small group of students.
Unfortunately, they’ve since moved newer students to Student Hub, which is essentially a worse version of Slack.
Quality of video lectures
Just for context, I’d previously done the machine learning nanodegree as mentioned, and in that nanodegree I experienced some of the best and clearest instruction I’ve ever received in my life through the video lectures.
Unfortunately, the Data Engineering Nanodegree’s lectures were a bit of a letdown this time, and in my opinion not of their usual quality (at least not of the quality of the machine learning nanodegree). The lectures were not very polished, had very little post-editing, and not rehearsed.
The course instructors who appeared in the videos appear to not be Udacity employees, but rather external “subject matter experts” contracted to record videos on the course content.
That in itself is not usually a problem, and Udacity has done that successfully in the past.
The problem is that the videos didn’t appear to have gone through much QA after being recorded, and there appeared to be a lack of preparation that went in to record these videos.
A specific example is numerous instances in the videos where a course instructor stumbles on sentences and repeats the sentence again, probably expecting the video to be edited appropriately.
The worst parts were the data warehousing and data lakes section, where the instructor appeared to be thinking of what to say on the fly. There were numerous pauses, long “uhhhs” and “ummms” which got REALLY annoying after a while because once you notice it you can’t unhear it. I had to put the videos on 2x speed, and even then it was barely tolerable.
If I might offer a suggestion to improve the video lectures: It would be MUCH better if course instructors were to prepare a script, and to read from that script in the videos. This isn’t theatre or acting class – it’s fine to read from a script, and is far better than just winging it during recording.
It wasn’t uniformly bad though, the Airflow and Spark sections actually were of their usual polished quality in my opinion. It’s likely that those lectures were either rehearsed or read from a prepared script, judging by the lack of “uhhs” and “umms”, and any stumbling. Any mistakes were probably edited out.
Quality of course content
The course has a substantial practical bent, rather than focusing on theory and motivation. The course is an excellent choice if you want to learn “how” to do things like ETL and Data Warehousing, on cloud providers like AWS.
I felt it was probably slightly thin on the “why” side of some things. So if you don’t have any experience at all working with databases, it’s possible you might get a little bit confused during the course.
Specifically, I wished there was more treatment on what the consumers of data engineering (like BI analysts or data scientists) expect and how they use the solutions we build.
I personally think that was fine, and you will probably get the most value out of this course if you do have a background working with databases.
For a deeper treatment of this, it’s probably better to read from a book anyway. I think one of the instructors recommended one from Kimball and Ross, which I’m planning to get.
Quality of the projects
All of the projects (except the capstone) were based on the same problem domain (a song streaming startup), with the same data, using the same schema.
So it’s just a matter of doing the same thing, using different tools. The difficulty of that is highly dependent on how good you are at learning a new API.
As an experienced developer, once I understood what was expected in the project, I could finish the project in under 2 hours.
But I think most people on the course Slack agreed that the project instructions were quite confusing. Often, I was confused myself at some of the instructions.
The project reviews were sometimes helpful, but often it was pretty pedantic, and your submission may get rejected for things like forgetting to delete a code comment in the Jupyter notebook cells.
For context, there were comments in the provided workspaces that were meant to mark sections to show where you’re meant to put in code, e.g.
# TODO: complete section here. Deleting the comments did absolutely nothing to help me understand data engineering, but the submission will get rejected if you don’t do it. I felt like it was a bit of a waste of time.
Some projects can also take a long time to run due to the amount of data, so if you’re doing it in the Udacity-provided workspace, it can go to sleep before it’s complete. My suggestion to avoid that is, don’t use the whole dataset. Work on a smaller dataset, verify it works, then replace the code with the full dataset BUT DON’T RUN IT. Just submit it as it is.
This is one aspect that I can’t comment on, because I didn’t make use of it. I didn’t actually want any mentoring, but mentoring was pushed onto me. Actually the mentors were hired from a pool of people who have completed other nanodegrees in the past. Which means I received an invitation to apply too, but I didn’t have the time for that.
So there’s a chance you get someone who has done a nanodegree, but isn’t really experienced as a data engineer.
The specific mentor I was assigned to lives in a different timezone and the time difference was quite unfortunate.
At one point I fell behind in the course due to illness. So, what happened is that every Sunday night, between 1am and 3am I would get pinged by my mentor asking “how’s progress” (despite my local time being on my Slack profile), and some basic advice on how to use AWS.
I was stupid enough to forget to turn off notifications on Slack, so that meant a few sleepy Mondays until I finally asked my mentor to stop, and that while I appreciate the advice about AWS, I’m probably going to be fine because I happen to use AWS for a living.
However, many people on Slack reported that they benefited from their mentor, including the mentor I was assigned to, so your mileage may vary.
There were a few problems with the course, and reviews online (especially on Reddit) haven’t been very kind. It’s true that the course felt like it was done in a rush, and it does show. There were also some issues like the content not being complete at launch time, and course content was added later (which didn’t affect me because I was slow anyway).
But I did learn a few things, and that’s what matters in the end.
Would I recommend the course? When I enrolled, the price of the course was $1300. That’s less than half the current price.
At the current price point of $2669 for 5 months access (and a monthly fee after), I personally wouldn’t recommend it, unless you happen to have a lot of money. It’s extremely expensive for what you get.
You’re definitely not going to get a job in data engineering with only this nanodegree, but it’s marketed like it’s a pathway to a data engineering career. Sorry, but it’s not. But it’s a good complement if you already have relevant work experience, probably with cloud platforms like AWS and some database experience.
Although I learned a lot from their courses, I doubt I will do another nanodegree, unless I suddenly start earning a lot more money (in which case I’ll attempt the self driving car one).