Must Love Data Mondays: Week 1 of #uwdatasci Introduction to Data Science

Week 1 is a simple explanation about the course and to the topic of data science. The week consists of ten videos that are based around those ideas.  The final video walks you through the first assignment that does some basic twitter analysis using Twitter API. I’ll write about the assignment in next week’s post.  Some of the topics this course covers in the first week are two videos on whetting your appetite, introduce you to some of the exciting things being done in the field.  It also provides brief course overview and touches on broad topics such as eScience and Big Data. On the course syllabus there are about 15 articles to read during the week but the majority of them deal with examples in the video lectures.  I’ll touch on the ones that weren’t listed as examples.

Drew Conway’s Venn Diagram

The famous diagram breaks down the field off data science into three main categories of knowledge (Hacking Skills, Math & Stats, and Substance) and how combining any two of those ideas is not data science, it can actually be dangerous.  You need to acquire all three to participate in this field.

Mike Loukides, What is data science?, O’Reilly Radar, 2010

I think Hal Varian’s quote used in the end of the article sums it up best, “The ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it – that’s going to be a hugely important skill in the next decade.”

Mike Driscoll, “The Seven Secrets of Successful Data Scientists

The Seven Secrets along with my thoughts:

  1. Choose The Right-Sized Tool: Sometimes Excel is the best tool for the job and other times you need Hadoop.
  2. Compress Everything: It saves you time and money but don’t watch your size.  It’s a good idea to split up large files.
  3. Split Up Your Data: I like this quote, “The real world is partitioned – whether as zip codes, states, hours, or top-level web domains – and your data should be too.” Follow Hadley Wickham’s design pattern: Split, Apply, Combine.
  4. Sample Your Data: Inside of wasting your time, try sampling your data to figure out your approach and if you need to make any adjustments before running the full workload.
  5. Smart Borrows, But Genius Uses Open Source: There is no reason to rebuild the wheel. Check the open source community to see if someone has already built what you’re thinking of building.
  6. Keep Your Head in the Cloud: Quote to live by: “If you want to compute locally, pull down a sample. But if your data is in the cloud, that’s where your tools and code should be.”
  7. Don’t Be Clever: “Cleverness doesn’t scale.” Follow this rule: KISS.

Origins of “Volume, Velocity, Variety”

The three Vs to understand and deal with Big Data. The definitions from the article:

  1. Volume: Increase the depth and breadth of data available about any point of interaction.
  2. Velocity: Increased point-of-interaction (POI) speed, and consequently the pace dat used to support interactions and generated by interactions.
  3. Variety: Biggest barrier to effective data management will exist than variety of incompatible data formats, non-aligned data structures, and inconsistent data semantics.

eScience: The Fourth Paradigm (Foreward and Introduction, pages xi – xxxi; Gray’s Laws, pages 5-12)

Reading this today.

Chris Anderson,  “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” , Wired magazine, 2008

This article tries to answer this question: Is the scientific approach becoming obsolete? Another important question: What can we learn from Google? They didn’t know anything about advertising but were able succeed with applied math. They figured more data and better analytics would win.  What other areas can this strategy change?

Responses to Chris Anderson, 2008

A curated group of responses against Chris Anderson’s article listed above.  All of the responses are all trying to say the same thing.  Scientific theory is not dead.  Models and statistics will assist science not remove the human element from the equation.

I also read the first chapter of Doing Data Science – Introduction: What is Data Science? A free sample of the first chapter is available from this link.

Based on Rachel Schutt, a data scientist at Google, at the time, and the data science course she taught and created at Columbia University. Cathy O’Neil blogged about her experience taking the course and this book is based on those blog posts.  The book is trying to answer the questions about what is data science, data scientists, and what kind of work they do.