SQL Introduction (DuckDB) #
Collecting big data is no art these days. What is challenging, however, is the analysis. The more data you have, the more difficult it becomes to apply traditional approaches. Many methods assume that the data will fit in memory all at once. Even though memory sizes have increased greatly, however, data volumes have grown much faster. The mere purchase of a more powerful PC is therefore no longer sufficient in most cases and the analysis must be distributed over several systems.
For the most part, there are countless ways to solve a specific problem in computer science. Among other things, you can choose different programming languages, operating systems and algorithms. However, Structured Query Language (SQL) is by far the best known language for querying, managing and manipulating data in databases. Conseqeuntly, most relational databases support SQL. Especially for the analytical processing of large amounts of data, specific database management systems have been established in recent years, often reffered to Online Analytical Procressing (OLAP).
Expressions: Values, Operators and Functions #
Expressions are the elementar building-blocks of SQL-Queries; most SQL-Clauses (e.g., SELECT, WHERE) use them as arguments. An expression is either a value (e.g., 3, 2.3, ‘hello world’ or column_name), an operator (e.g., +, -, /), or a function (e.g., my_function()
).
Some operations, and most functions, work only on specific data types. For example, there are a number of operators and functions specifically for dates and strings.
Data types #
Although SQL is primarily designed for processing data, the language only a few data types. Depending on the dialect, they are mostly integers (e.g. 42), floating point numbers (e.g. 3.1242), truth values (0 or 1), and strings (e.g. “Hello World”). Our database, DuckDB, also supports types for date and time.
The data type specifies how operations or operators are to be interpreted. Thus, the division on integers is defined in a different way than for floating point numbers (3/3
returns a different result than 3.0/3
).
Some Database management systems also support a special data type to encode null-values (i.e., missing values).
Semantics #
While we usually do not follow too strict standards in our use of language, the computer usually reacts very sensitively to details. For example, 2
and 2.
are two different values. The first is an integer and the second is a floating point number.
Another example would be quotation marks. While we often switch between single, double, French, or German quotation marks depending on our mood, they have very different meanings. Since SQL is based on the English language, single or double (but not French or German) are used to mark up strings. A similar design decision can be seen with the comma characters, where the period is used as a separator rather than the single comma.
The SELECT-Statement #
Let’s start with some examples to show the different types of expressions.
- An expression could be a value such as ‘happy weekend’ (i.e., string)
Example 1.1: Return “happy weekend”
SELECT 'happy weekend'
- An expression can also be a more “complex” combination of other expressions, or the application of operators such as multiplications:
Example 1.2: Return result of mathematical expression
SELECT 5*64.34
- Another possibility is functions that transform the input (arguments) and yield an output (result). In this example, we call a function named
log
with the input 45. The result is calculated and replaces the expression such that the SELECT clause returns it. Most DBMS provide a large set of pre-defined functions, and some (e.g., BigQuery, Clickhouse) allow one to define own functions (user-defined functions). However, in the beginning, knowing how to use them is essential. A function call always starts with its name, followed by parentheses. The input of the functions is called arguments and is listed within the brackets. The number and type of arguments depend on the function at hand. For example, mathematical functions work on numeric data types, and String functions work on strings.
Example 1.3: Return result of function
SELECT log(45)
Filtering based on a condition: The WHERE-clause #
A thousand roads lead to Rome. So, as so often, there are umpteen ways to get the same result. The following select queries all lead to the same result.
Example 1.4: Different but semantically equivalent statements
SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 = 1 LIMIT 10;
SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 != 0 LIMIT 10;
SELECT followers_count FROM youtube_tweets WHERE followers_count % 2 > 0 LIMIT 10;
Example 1.5: Filter by exact match (where-filter)
SELECT
youtube_url
FROM youtube_tweets
WHERE youtube_url = 'https://www.youtube.com/watch?v=sSgD19URsyU&feature=youtu.be'
LIMIT 10;
Sometimes a perfect match is not ideal - Searching for patterns with the Like-Operator #
You can very quickly compare expressions directly with the equal sign. However, the computer takes the task very seriously here and returns true only if the two expressions are identical. For example, a different decimal point for numbers or a different case for upper and lower case letters does not lead to a match. With numbers you can avoid the problem by rounding to the desired decimal place, but with text you use the like operator to match patterns. Like compares the expressions on both sides. If you don’t want the method to be sensitive to case, you can use the iLike operator instead. If only parts are to be matched, % can be included as placeholders.
https://duckdb.org/docs/sql/functions/patternmatching
Example 1.6: Filter based on pattern
select username from youtube_tweets where username like 'martin' limit 10;
select username from youtube_tweets where username like '%martin%' limit 10;
select username from youtube_tweets where username like 'martin%' limit 10;
select username from youtube_tweets where username ilike 'martin%' limit 10;
Example 1.7: How many distinct usernames contain your first name?
I’ll showcase it with my name (i.e., Christian)
SELECT
count(distinct username)
FROM youtube_tweets
WHERE username ilike '%christian%'
LIMIT 10;
Combining conditions - Connecting logical expressions #
What are AND/OR? Both are logical operators
Aggregating by groups: The GROUP BY-clause #
Example 1.8: How to get the top 10 users based on their number of postings (i.e., two columns, first the name, second the number of postings)?
SELECT
username, count(*) as postings
FROM youtube_tweets
GROUP BY username
ORDER BY postings DESC
Example 1.9: How to get two columns; one for Justin Bieber and the other for U2 (column name should be artist) and the respective number of tweets for these two in a separate column named num_tweets?
SELECT
if(text ilike '%justin bieber%', 'Justin B.', 'U2') as artist,
count(text) as num_tweets
FROM youtube_tweets
WHERE text ilike '%justin bieber%' OR text like '% U2 %'
GROUP BY artist
LIMIT 10;
Example 1.10: How many tweets are written on average on the 24th and on other days?
SELECT
if(day = '2022-12-24', '24th', 'other') as Type,
avg(cnt) as avg_tweets_per_day
FROM (
SELECT
created_at::DATE as day,
count(*) as cnt
FROM youtube_tweets
GROUP BY day
)
GROUP BY Type;
Example 1.11: What is the average number of verified authors/day (in percent & rounded)?
SELECT
weekday,
round(avg(verified::INT)*100, 2) as avg_verified_authors
FROM (
SELECT
author_id, dayname(created_at::DATE) as weekday, verified
FROM youtube_tweets
GROUP BY author_id, weekday, verified
)
GROUP BY weekday;
Example 1.12: How to get the volume of Tweets, number of likes and average likes per tweet for each saturday
SELECT
created_at::DATE as day,
count(*) as volume,
sum(like_count) as likes,
likes::FLOAT/volume as avg_likes_per_tweet
FROM youtube_tweets WHERE dayname(day) = 'Saturday'
GROUP BY day
LIMIT 10;
Example 1.13: How many distinct days did the authors tweet? sorted by the count.
SELECT
username,
COUNT(distinct created_at::DATE) as cnt
FROM youtube_tweets
GROUP BY username
ORDER BY cnt DESC
LIMIT 100;
Example 1.14: How to order not on a column, but on an expression?
SELECT *
FROM youtube_tweets
ORDER BY like_count - retweet_count
LIMIT 10;