Interviewing a database using SQL

I’m one of the co-founders of Laboratório Hacker de Campinas (aka LHC), a hackerspace located in Campinas, Brazil. We organize several open events that are published in our public calendar.

After more than two years using this calendar, we have now a small dataset about our events that can be interviewed to answer some questions that may help us to plan the use of our space more efficiently.

I wanted to know when (and how) the space is being used and if I can find some pattern that would help us to schedule our future events in a way that will bring more people to visit us.

Before starting the “interview” of our database, I prepared some question that I expect to answer using SQL.

  1. How many events did we have?
  2. What is the month of the year that the most events occurred?
  3. Is there any day of the week when an event is most likely to be happening in space?

Our database

We use Gancio as a tool to manage our calendar (I wrote about it before). Events data is stored in a SQLite database so I downloaded it and started to understand how it is structured.

The important tables to answer our questions are events and places. The first stores all information about each event, the second provide a list of venues where our events happen (the majority are in our hackerspace, but some are online or in a different venue).

A simplified definition of these tables (enough to follow this post) can be seen next (check Gancio’s documentation to see the complete schema of these tables). You can also download this simplified database if you want to try the comands too.

CREATE TABLE "places" (
    "id"    INTEGER,
    "name"  VARCHAR(255) NOT NULL,
    PRIMARY KEY("id")
);

CREATE TABLE "events" (
    "id"    INTEGER,
    "title" VARCHAR(255),
    "slug"  VARCHAR(255) UNIQUE,
    "start_datetime"    INTEGER,
    "placeId"   INTEGER,
    PRIMARY KEY("id"),
    FOREIGN KEY("placeId") REFERENCES "places"("id")
);

After understanding our data schema, we can start to interview our data. We will use SQL as the language to ask our questions.

So let’s open the database in the terminal!

$ sqlite3 gancio.sqlite
SQLite version 3.44.1 2023-11-02 11:14:43
Enter ".help" for usage hints.
sqlite>

How many events did we have?

We can answer that one with a SELECT statement and COUNT() aggregate function.

-- Returning all events
SELECT * FROM events;
-- Counting all events
SELECT COUNT(*) FROM events;

The last query returns 145, the number of events that happened all the time. Show the results grouped by year seems to be more useful in our analysis. This requires to perform a data cleanup in start_datetime field.

sqlite> SELECT title, start_datetime FROM events LIMIT 5;
1° Churrasco e Vinil|1684270800
Tutorial: Raspando Dados da Internet com Python|1688212800
Reunião mensal do LHC, dia 24 de maio de 2023|1684967400
Encontro GruPy-Campinas|1685793600
Reunião mensal do LHC, dia 19 de Junho de 2023|1687213800

We can notice that the event date is stored as a Unix time, a format that is not easy to identify the year of the event.

We have date and time functions in SQLite that can be used to convert this field to a datetime. Other databases, such as PostgreSQL oy MySQL, have similar functions to handle dates and times (check the docs of each DB to discover them).

In our case we will use datetime() function.

SELECT
    start_datetime,
    datetime(start_datetime, 'unixepoch')
FROM
    events
LIMIT 5;

The first 5 results of this query will look like:

1684270800|2023-05-16 21:00:00
1688212800|2023-07-01 12:00:00
1684967400|2023-05-24 22:30:00
1685793600|2023-06-03 12:00:00
1687213800|2023-06-19 22:30:00

Using strftime() function, we can extract only the year of each event as a column:

SELECT
    start_datetime,
    datetime(start_datetime, 'unixepoch'),
    strftime(
        '%Y',
        datetime(start_datetime, 'unixepoch')
    ) AS event_year
FROM
    events
LIMIT 5;
1684270800|2023-05-16 21:00:00|2023
1688212800|2023-07-01 12:00:00|2023
1684967400|2023-05-24 22:30:00|2023
1685793600|2023-06-03 12:00:00|2023
1687213800|2023-06-19 22:30:00|2023

Now we have a event_year column that we can use to group our results and have the counting of events separated by year. We need to add GROUP BY clause to our initial query to get that.

SELECT
    strftime(
        '%Y',
        datetime(start_datetime, 'unixepoch')
    ) AS event_year,
    COUNT(*) AS num_events
FROM
    events
GROUP BY
    event_year;

And we have the following results:

2023|30
2024|88
2025|27

We had 30 events in 2023, 88 in 2024 and 27 in 2025. As we started using Gancio only in May, 2023, our dataset doesn’t have more than one entire year of data other than 2024. So we can’t affirm that there is a growing trend in the number of events. When 2025 is over we will have more data to compare.

What is the month of the year that the most events occurred?

Maybe we can do a better analysis if we consider the year and the month (so we can compare the months that we have data in all years). We can group our results by multiple fields such as:

SELECT
    strftime('%Y', datetime(start_datetime, 'unixepoch')) AS event_year,
    strftime('%m', datetime(start_datetime, 'unixepoch')) AS event_month,
    count(*) as num_events
FROM events
GROUP BY event_year, event_month;

Ordering the results by the total number of events in descending order:

SELECT
    strftime('%Y', datetime(start_datetime, 'unixepoch')) AS event_year,
    strftime('%m', datetime(start_datetime, 'unixepoch')) AS event_month,
    count(*) as num_events
FROM events
GROUP BY event_year, event_month
ORDER BY num_events DESC;

We discovered that August 2024 was the month with the highest number of events (10)! Compared with August 2023 (only 2 events) we had a huge increase. However, because we don’t have enough data (less than 3 years), we can’t say that August will always be the month with the higher number of events.

Is there any day of the week when an event is most likely to be happening in space?

Maybe we can have more insights if we look at the data grouped by day of week. This can be valuable information for someone that would like to visit us unannounced, but that would like to have a better chance of meeting someone in space.

The option %w returns an integer from 0-7 each one representing one day of week (starting on Sunday).

SELECT
    COUNT(*) AS num_events,
    strftime('%w', datetime(start_datetime, 'unixepoch')) AS day_of_week
FROM
    events
GROUP BY
    day_of_week
ORDER BY
    num_events
DESC;
43|6
29|4
26|1
18|3
14|5
14|2
1|0

We have 6 (Saturday) with 43 events, and then 4 (Thursday) with 29 events. It appears that if someone wants to visit us, if they appears on Saturday or Thursday, there is a better chance to have an event happening there.

Having numbers to reference weekday is not user-friendly, so using CASE expression (that serves a role similar to IF-THEN-ELSE in other programming languages), we can have a nicer result.

SELECT COUNT(*) AS num_events,
    CASE
        strftime('%w', datetime(start_datetime, 'unixepoch'))
        WHEN '0' THEN 'Sunday'
        WHEN '1' THEN 'Monday'
        WHEN '2' THEN 'Tuesday'
        WHEN '3' THEN 'Wednesday'
        WHEN '4' THEN 'Thursday'
        WHEN '5' THEN 'Friday'
        WHEN '6' THEN 'Saturday'
    END AS day_of_week
FROM events
GROUP BY day_of_week
ORDER BY num_events desc;
43|Saturday
29|Thursday
26|Monday
18|Wednesday
14|Tuesday
14|Friday
1|Sunday

But as mentioned before, sometimes we have online events (or events in a different venue). So we should consider in my query only the events that are in-person in the hackerspace.

We need to combine data from events and places. First let’s see how places are returned.

SELECT id, name from places;
1|Laboratório Hacker de Campinas
2|online
3|ARCA
4|Anhembi
6|UNISAL - Campus São José
7|SIRIUS
8|Candreva Proença

In events table we have placeId field that relates both tables. Using a JOIN operator together with a WHERE clause for filtering, it is possible to return only the events that happened in our place.

SELECT COUNT(*) AS num_events,
    CASE
        strftime('%w', datetime(start_datetime, 'unixepoch'))
        WHEN '0' THEN 'Sunday'
        WHEN '1' THEN 'Monday'
        WHEN '2' THEN 'Tuesday'
        WHEN '3' THEN 'Wednesday'
        WHEN '4' THEN 'Thursday'
        WHEN '5' THEN 'Friday'
        WHEN '6' THEN 'Saturday'
    END AS day_of_week
FROM events
JOIN places ON places.id = events.placeId
WHERE places.name = 'Laboratório Hacker de Campinas'
GROUP BY day_of_week
ORDER BY num_events desc;
39|Saturday
27|Thursday
18|Wednesday
13|Tuesday
13|Friday
6|Monday

The results are quite different, but Saturday and Thursday are still the busiest days of the week. Monday, on the contrary, which appeared with 26 events before, had only 6 events at our house!

What now?

What we can do with this information? If I want to visit LHC at random, probably I would try to go there on Saturday or Thursday!

But if I want to improve the use of space, so that we have more activities on other days of the week, maybe organizing more events on Tuesday or Friday is an option.

Understanding why these other days are not the favorites is also important. Sunday not having any events makes sense to me, since it’s a day that people prefer to stay at home with family.

Interviewing this dataset gave me some insights, but opened more questions in my mind. Some probably I will be able “to ask” to my database. Others I will need to talk with our hackerspace members to figure out.