SQL Introduction (DuckDB) #

Collecting big data is no art these days. What is challenging, however, is the analysis. The more data you have, the more difficult it becomes to apply traditional approaches. Many methods assume that the data will fit in memory all at once. Even though memory sizes have increased greatly, however, data volumes have grown much faster. The mere purchase of a more powerful PC is therefore no longer sufficient in most cases and the analysis must be distributed over several systems.

For the most part, there are countless ways to solve a specific problem in computer science. Among other things, you can choose different programming languages, operating systems and algorithms. However, Structured Query Language (SQL) is by far the best known language for querying, managing and manipulating data in databases. Conseqeuntly, most relational databases support SQL. Especially for the analytical processing of large amounts of data, specific database management systems have been established in recent years, often reffered to Online Analytical Procressing (OLAP).

Expressions: Values, Operators and Functions #

Expressions are the elementar building-blocks of SQL-Queries; most SQL-Clauses (e.g., SELECT, WHERE) use them as arguments. An expression is either a value (e.g., 3, 2.3, ‘hello world’ or column_name), an operator (e.g., +, -, /), or a function (e.g., my_function()).

Some operations, and most functions, work only on specific data types. For example, there are a number of operators and functions specifically for dates and strings.

Data types #

Although SQL is primarily designed for processing data, the language only a few data types. Depending on the dialect, they are mostly integers (e.g. 42), floating point numbers (e.g. 3.1242), truth values (0 or 1), and strings (e.g. “Hello World”). Our database, DuckDB, also supports types for date and time. The data type specifies how operations or operators are to be interpreted. Thus, the division on integers is defined in a different way than for floating point numbers (3/3 returns a different result than 3.0/3).

Some Database management systems also support a special data type to encode null-values (i.e., missing values).

Semantics #

While we usually do not follow too strict standards in our use of language, the computer usually reacts very sensitively to details. For example, 2 and 2. are two different values. The first is an integer and the second is a floating point number.

Another example would be quotation marks. While we often switch between single, double, French, or German quotation marks depending on our mood, they have very different meanings. Since SQL is based on the English language, single or double (but not French or German) are used to mark up strings. A similar design decision can be seen with the comma characters, where the period is used as a separator rather than the single comma.

The SELECT-Statement #

Let’s start with some examples to show the different types of expressions.

  1. An expression could be a value such as ‘happy weekend’ (i.e., string)

Example 1.1: Return “happy weekend”

SELECT 'happy weekend'
  1. An expression can also be a more “complex” combination of other expressions, or the application of operators such as multiplications:

Example 1.2: Return result of mathematical expression

SELECT 5*64.34
  1. Another possibility is functions that transform the input (arguments) and yield an output (result). In this example, we call a function named log with the input 45. The result is calculated and replaces the expression such that the SELECT clause returns it. Most DBMS provide a large set of pre-defined functions, and some (e.g., BigQuery, Clickhouse) allow one to define own functions (user-defined functions). However, in the beginning, knowing how to use them is essential. A function call always starts with its name, followed by parentheses. The input of the functions is called arguments and is listed within the brackets. The number and type of arguments depend on the function at hand. For example, mathematical functions work on numeric data types, and String functions work on strings.

Example 1.3: Return result of function

SELECT log(45)

Filtering based on a condition: The WHERE-clause #

A thousand roads lead to Rome. So, as so often, there are umpteen ways to get the same result. The following select queries all lead to the same result.

Example 1.4: Different but semantically equivalent statements

SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 = 1 LIMIT 10;
SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 != 0 LIMIT 10;
SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 > 0 LIMIT 10;

Example 1.5: Filter by exact match (where-filter)

FROM youtube_tweets
WHERE youtube_url = ''

Sometimes a perfect match is not ideal - Searching for patterns with the Like-Operator #

You can very quickly compare expressions directly with the equal sign. However, the computer takes the task very seriously here and returns true only if the two expressions are identical. For example, a different decimal point for numbers or a different case for upper and lower case letters does not lead to a match. With numbers you can avoid the problem by rounding to the desired decimal place, but with text you use the like operator to match patterns. Like compares the expressions on both sides. If you don’t want the method to be sensitive to case, you can use the iLike operator instead. If only parts are to be matched, % can be included as placeholders.

Example 1.6: Filter based on pattern

select username from youtube_tweets where username like 'martin' limit 10;
select username from youtube_tweets where username like '%martin%' limit 10;
select username from youtube_tweets where username like 'martin%' limit 10;
select username from youtube_tweets where username ilike 'martin%' limit 10;

Example 1.7: How many distinct usernames contain your first name?

I’ll showcase it with my name (i.e., Christian)

 count(distinct username)
FROM youtube_tweets
WHERE username ilike '%christian%'

Combining conditions - Connecting logical expressions #

What are AND/OR? Both are logical operators

Aggregating by groups: The GROUP BY-clause #

Example 1.8: How to get the top 10 users based on their number of postings (i.e., two columns, first the name, second the number of postings)?

 username, count(*) as postings 
FROM youtube_tweets
GROUP BY username
ORDER BY postings DESC

Example 1.9: How to get two columns; one for Justin Bieber and the other for U2 (column name should be artist) and the respective number of tweets for these two in a separate column named num_tweets?

  if(text ilike '%justin bieber%', 'Justin B.', 'U2') as artist, 
  count(text) as num_tweets 
FROM youtube_tweets
WHERE text ilike '%justin bieber%' OR text like '% U2 %' 
GROUP BY artist 

Example 1.10: How many tweets are written on average on the 24th and on other days?

 if(day = '2022-12-24', '24th', 'other') as Type,
 avg(cnt) as avg_tweets_per_day 
    created_at::DATE as day, 
    count(*) as cnt 
   FROM youtube_tweets 
   GROUP BY day

Example 1.11: What is the average number of verified authors/day (in percent & rounded)?

  round(avg(verified::INT)*100, 2) as avg_verified_authors 
   author_id, dayname(created_at::DATE) as weekday, verified
 FROM youtube_tweets
 GROUP BY author_id, weekday, verified
GROUP BY weekday;

Example 1.12: How to get the volume of Tweets, number of likes and average likes per tweet for each saturday

 created_at::DATE as day,
 count(*) as volume, 
 sum(like_count) as likes, 
 likes::FLOAT/volume as avg_likes_per_tweet 
FROM youtube_tweets WHERE dayname(day) = 'Saturday'

Example 1.13: How many distinct days did the authors tweet? sorted by the count.

 COUNT(distinct created_at::DATE) as cnt 
FROM youtube_tweets
GROUP BY username
LIMIT 100;

Example 1.14: How to order not on a column, but on an expression?

FROM youtube_tweets 
ORDER BY like_count - retweet_count