The (Amateur) Data Science Body of Knowledge
Published Apr 14 2020 05:14 AM 3,206 Views
Microsoft

There are no shortages of Internet posts, magazine articles, or college syllabi describing what a "D... You might think the term "Data Scientist" is still up for debate - but there are "real" Data Scientists that have formal degrees and years of experience with that official title. 

 

You *can* learn what you need to know about Data Science without a full degree. That being said, the full degree is important - it is really hard to gain the required depth and experience without it.

 

The term "Data Engineer" now represents what many call "Everything but the Algorithm" to mean all of the work required to bring a Data Science project to completion. They work with the Data Scientist to source accurate data, create ingress pipelines, condition the data using Feature Engineering techniques, helping to create, version and store the model, and then operationalize that model, all with an eye on retraining it. In many organizations, a Data Engineer pairs up with a Data Scientist on large projects - and in some projects I've worked on, there are multiple Data Engineers for a single Data Scientist. 

 

In this article, I'll help you find resources for whichever path you choose for yourself. At the very least, you'll gain valuable insight to the Data Science field, and how you can use the technologies and knowledge to create a very compelling solution.

 

NOTE: There's an absolutely wonderful visual representation of what a Data Scientist should know that you can find here:  http://nirvacana.com/thoughts/becoming-a-data-scientist/ by Swami Chandrasekaran, and I would encourage you to look over his work. What I show here is independent of that grouping, but similar. Of course, he's using several tools from IBM and I'm using the ones at Microsoft. Pick your stack and learn it well. Want to use Open Source only? Knock yourself out.

 

ALSO NOTE: I have never liked a "tools approach" to learning. Yes, you'll need to learn several tools and yes, I often use a tool to learn a thing (like using R to learn Statistics) but I focus on the concepts, not just how the work is done. First learn why you do something, and then shawarma after. So learning concepts first and then choosing a tool is the route I'll follow here.

 

Or, you can simply follow a complete course, online. There are several really good ones:

Among many, many others. See the comments below as well for even more.

 

Of course, if you want to "assemble your own"....

Foundation and Empanadas

Before you even get started learning all the new cool toys, you need a foundation (Asimov is my hero). Again, there are a lot of schools of thought on this, and certainly more knowledge is helpful. That being said, here is the barest of minimums for this discipline:

Statistics

Statistics is the life-blood of the Data Scientist. Whether it's something you already know well or something you have to learn, it's required. You can start with simple courses and move up, but Classification and Prediction are the two areas you want to focus on. I use a combination of books, online courses (The Khan Academy is one of my favorites), and even The Manga Guide to Statistics: http://www.amazon.com/Manga-Guide-Statistics-Shin-Takahashi/dp/1593271891. A quick web search will show you many courses for free on statistics, at almost any level from beginner to advanced.

Linear Algebra

Along with statistics, you'll need to have at least a High-School understanding of Linear Algebra. Many Machine Learning models assume you have this knowledge. Once again, lots of Public Library books , online courses and more.

Logic

Understanding formal Logic is important for the Data Scientist. Focus on Predicate, Mathematical, and Computational as a minimum: http://www.logicmatters.net/tyl/.

Visualization

You will need to present your findings at some point, and even to explore them you need to understand how to represent data in a graphical fashion. I use the The Wall Street Journal Guide to Information Graphics, but there are many other books and sites dedicated to visualizing information.

Programming

Believe it or not, you don't even have to use a computer to learn to program - although normally grabbing a language (Python or R are best for the Data Scientist) is a good way to use a tool to learn the topics. You don't need a full Computer Science (CS) degree, but if you do follow a syllabi for one you'll also get algorithms and other skills that will be helpful: http://spin.atomicobject.com/2015/05/15/obtaining-thorough-cs-background-online/

Data is Plural

Another base skill you need is working with data. This might sound obvious, but most of the time the obvious things, aren't. It's important to learn more about at least these topics:

 

  • Data types
  • Data sources
  • Data Interpretation
  • Data Ingress
  • Transforms and rollups

(You'll get pointers to where you can learn more in the sections that follow)

Getting down to Business

Along the same lines as the foreknowledge you need for starting with the Data Science tools, you need to know a great deal about various industries that use data analytics. While every business or organization benefits from a good application of Data Science, some use data analytics in a larger, deeper way. It's good to immerse yourself in some of the deeper knowledge about:

 

  • Healthcare
  • Physical sciences
  • Manufacturing
  • Government systems
  • Marketing

That is by no means an exhaustive list - far from it - but learn about how these types of organizations rely on data. In my career I've worked in all of these, and many more.

 

The key is that you can learn where you are, right now. Get involved in how your organization works, and how they do business - not just IT. Find out the hard problems, and join the teams that are solving them. Be in the moment of your current role, and work with any executive that will give you time. Also, couldn't hurt to read the Portable MBA: http://www.powells.com/biblio/9780471119845.

Tool Time

OK, at some point, you get to play with the new toys. While you can start with tools, it's a bad idea. Start with the foundations, then pick the right toolset for solving the problems. You'll get less attached to the tools that way and more attached to success.

Excel

Don't make that face. You need to know Excel. Not just how to create a workbook, but learn to milk it, make it dance, make it walk on all four legs. Start in Excel, get pushed out. The reasons for using this tool in data science are that it has much of what you need already, your users know it, and it will help you explore the problem and visualize it in new ways. I use this gem of a resource: https://www.microsoft.com/learning/en-us/book.aspx?ID=17313

SQL 

Structured data is a major source of intelligence, so you need to learn to work with it. Microsoft's SQL Server is a good platform to learn on because it just keeps adding more things to the box to work with structured data. It's fast, handles large datasets, is well used, and it plays well with everything else. Learn more here: http://www.microsoftvirtualacademy.com/product-training/sql-server and here: https://www.microsoft.com/learning/en-us/sql-training.aspx

Business Intelligence

Built into the SQL Server product is the Business Intelligence suite called Analysis Services. This will help you explore historical data and do data mining: https://www.microsoftvirtualacademy.com/en-us/training-courses/designing-bi-solutions-with-microsoft...

Visualization

You'll want to publish your data so that users can consume it using data visualization with a tool both of you understand. In addition to Excel, Power BI is that: https://www.microsoftvirtualacademy.com/en-us/training-courses/faster-insights-to-data-with-power-bi...

R (and/or Python)

You'll need a way to handle statistical programming. Two of the largest ways of doing that are using R or Python. As of this writing, either is useful. While there are thousands of resources on these topics, you can start here:

Hadoop

Hadoop is an ecostructure, not just a processing system. It's used in many data processing systems, including many we use at Microsoft. Microsoft's release is called HDInsight, and you can learn more about it here: http://azure.microsoft.com/en-us/documentation/services/hdinsight/

Machine Learning

Machine learning is the way that you can take sets of data and extrapolate reusable formulas for prediction and classification. I use AzureML for that, and you can learn more about that here. You get a free account and learning environment: http://azure.microsoft.com/en-us/documentation/services/machine-learning/

Streaming Analytics

Sometimes you need to act on the data as it arrives, especially in manufacturing and healthcare. You can use Storm or Azure's Stream Analytics (or both) for that: http://azure.microsoft.com/en-us/documentation/services/stream-analytics/

Cloud Analytics

Microsoft has multiple way for you to work with Data Science, from a complete Team Data Science Process, to multiple on-premises and in-cloud tools. You can find more about those here: https://www.microsoft.com/en-us/ai/ai-platform

References

Here are a few other views on what a Data Scientist should know:

1 Comment
Version history
Last update:
‎Apr 14 2020 05:21 AM
Updated by: