Must Love Data Mondays: Week 2 of #uwdatasci Introduction to Data Science – Relational Databases, Relational Algebra

Week 2 of the University of Washington’s Introduction to Data Science deals with Relational Databases, Relational Algebra.  If you’ve taken a database course then you’ll be familiar with the topics discussed in the video lectures and readings.  I work with SQL and databases on a daily basis so this will be the easiest section for me. The video lectures deal with a quick explanation of databases and then focuses on Relational Algebra.

What is Relational Algebra?

Relational Algebra is the math behind the relational database.  The key idea from the lecture video on Relation Algebra Introduction deals with programs that manipulate tabular data exhibit an algebraic a structure allowing reasoning and manipulation independently off physical data representation.

Think of databases as the algebra of tables.

The following group of video lectures talk about different topics in Relational Algebra such as Union, Diff, Select, Product, and Joins.

SQL for Data Science

After the Relational Algebra lectures, there are two brief lectures on SQL for Data Science that are a SQL crash course but Bill does a good job of teaching you how to read a complicated query, which I think is the real difficulty of SQL.  SQL is an easy language to pick up the basics but many people I know are thrown off once you start getting involved with complicated queries involving multiple select statements and joins. Bill describes a step by step process for how to breakdown a complicated query into easily digestible chunks.

Physical Optimization, Declarative Lanuages, and Logical Data Independence

The last three lectures deal with Physical Optimization, Declarative Lanuages, and Logical Data Independence.

  • Physical Optimization deals with the same logical expression, different physical algorithms.  Impacts processing speed.
  • Declarative Languages only expresses conditions that must be met in order for a result to be an answer, not how to get an answer.
  • Logical Data Independence deals with the concepts of views.  It references Ted Codd’s paper, and states the purpose of Logical Data Independence is to insulate the application from changes to internal data and external data.

Weekly Reading

The two required articles where the following:

How Vertica Was the Star of the Obama Campaign, and Other Revelations

The article talks about how the Obama Data Team, which consisted of 35 software engineers, a 50 person analytics team, and a 100-person digital content team with a budget of around $20mm, used to Vertica to allow this team to target potential voters like never before.  Here are the lessons learned by Mike Conlow, who was the deputy CTO of the Obama 2012 campaign:

  1. Empower product managers.
    Campaigns have too many decisions from the top. They never let product managers own their products.  To do product right in business, you have to empower product managers.
  2. Employ a query policeman.
    Safety precautions against inexperienced SQL programmers so they did not write massive queries that could crash the system.  They wanted people to be able to just search the system and look for answers.
  3. Keep the money machine on at all costs.
    Used two gateways and two different merchants to ensure 100 percent uptime for campaign financial tools.

E. F. Codd, 1981 Turing Award Lecture, “ Relational Database:  A Practical Foundation for Productivity”, 1981 (Think about which arguments from this short piece are still relevant today.)

This article is about the man who created the relational model of data.  Codd’s work inspired a play on famous quote about God and integers.  The original quote states:

“God made the integers;
all else is the work of man.”
– Leopold Kronecker, 19th Century Mathematician

When you discuss database philosophy:

“Codd made relations;
all else is the work of man.”
– Raghu Ramakrishnan, DB text book author

The last article was from a Week One article:

eScience: The Fourth Paradigm (Foreward and Introduction, pages xi – xxxi; Gray’s Laws, pages 5-12)

Gray’s Laws:

  1. Scientific computing is becoming increasingly data intensive.
  2. The solution is in a “scale-out” architecture.
  3. Bring computations to the data, rather than data to the computations.
  4. Start the design with the “20 queries.”
  5. Go from “working to working.”

This week I’ll be completing the Database Assignment: Simple In-Database Text Analytics and working my through the week three video lectures and readings on MapReduce.

The main point of the Gray’s Laws is they represent an excellent set of guiding principles for designing the data-intensive systems of the future.