Simple E-Library SQL Database Design Project

Posted on Sunday, Jan 7, 2024

Background

This project is part of my submission for an SQL course that I’m taking for Pacmann Data Science Training.

Project Objectives

The objective of this project is to design a database system for an e-library application. The application has several libraries, and each library has a diverse collection of books. In the application, the users can borrow or place hold on books (if the book is currently not available for borrowing).

PostgreSQL is used to create the database, while Python is used to fill in the database with dummy data (to make sure that the database design is going to work well).

Designing the Database

Mission Statement

The library database system must be able to do the following:

Satisfying the above requirements will help day-to-day operation of the application (e.g. books organization, book lending management, etc.) as well as for analysis (e.g. analyzing which books are popular lately).

Table Structure and Business Rules

The list of tables, its purpose, the fields of each table, and the constraints of each field are given below:

(Note: PK = primary key.)

Additional information:

Table Relationships

The relationship between each table is given as follows:

Implementing the Design

ERD Design

Based on the table specifications described previously, the entity relationship diagram (ERD) is as follows:

ERD

(Diagram created with https://dbdiagram.io/.)

Creating the Tables in PostgreSQL

All tables in the database are created with the following query (using PostgreSQL syntax):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
create table libraries (
    library_id serial primary key,
    library_name varchar(256) not null unique
);

create table categories (
    category_id serial primary key,
    category_name varchar(256) not null unique
);

create table books (
    book_id serial primary key,
    title varchar(256) not null,
    author varchar(256) not null,
    category_id integer not null references categories,
    library_id integer not null references libraries,
    total_quantity integer not null,
    constraint total_quantity_nonnegative check (total_quantity >= 0)
);

create table users (
    user_id serial primary key,
    email varchar(256) not null unique,
    user_name varchar(256) not null
);

create table queues (
    queue_id serial primary key,
    queue_start timestamp not null default now(),
    queue_end timestamp null,
    user_id integer not null references users,
    book_id integer not null references books
);

create table loans (
    loan_id serial primary key,
    book_id integer not null references books,
    user_id integer not null references users,
    queue_id integer null references queues,
    loan_start timestamp not null default now(),
    loan_due timestamp not null generated always as
        (loan_start + interval '14 days') stored,
    loan_end timestamp null,
    constraint due_date_after_loan_date check (loan_due > loan_start),
    constraint return_date_between_two_dates check (loan_end >= loan_start and loan_end <= loan_due)
);

Populating the Database

Generating Dummy Data

There are three kinds of data to be generated for this database:

  1. Book details data
  2. User details data
  3. Lending/queueing activities data

For book details, I used the Amazon Kindle Books Dataset 2023 (I only used 300 most-reviewed books from the dataset). Meanwhile, for user details I used the Faker python library to generate dummy data.

Generating the lending activities data is not such a straightforward task, since there are restrictions like “number of borrowed books cannot exceed the book’s total quantity at any given time,” or “users cannot borrow more than two books at the same time,” and so on. If the data are just randomly generated, then those restrictions would certainly be violated. Therefore, I found the best method to generate the lending activities data is to simulate the library application.

The simulation is done via a Python script, which will simulate the flow of events that will happen if the application were to actually be implemented. For example, in case of a user requesting to borrow a book, the following events will be executed:

Request borrow event

Meanwhile, the following events will be executed in case of user returning the borrowed book:

Return book event

When running the simulation, the following parameters are specified so the results are reproducible:

The script for generating the data is too long to be included on this post, but it can be seen on the scripts generate_data.py and run_simulation.py on https://github.com/abangfarhan/simple-library-database.

Inputting the Dummy Data

There are two methods for inputting the dummy data: (1) saving the dummy data to csv files then import from pgAdmin, or (2) save the data directly from Python. I chose option number 2, since it would allow the project to be rebuilt more seamlessly (I can recreate the whole tables from a single script, instead of having to go back and forth between Python and pgAdmin). Here is an excerpt from the Python script that would save all of the tables into the database:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sqlalchemy import create_engine
import pandas as pd

# ...

engine = create_engine(database_url)
dfs = (
    (df_categories, 'category', 'categories'),
    (df_libraries, 'library', 'libraries'),
    (df_books, 'book', 'books'),
    (df_users, 'user', 'users'),
    (df_queues, 'queue', 'queues'),
    (df_loans, 'loan', 'loans'),
)
for df,id_prefix,table_name in dfs:
    df = df.reset_index(drop=True)
    df[f'{id_prefix}_id'] = df.index
    cols = df.columns

    # convert index columns to start with 1
    id_cols = cols[cols.str.endswith('_id')]
    for col in id_cols:
        df[col] = df[col] + 1

    # save table to database
    df.to_sql(table_name, engine, index=False, if_exists='append')

Retrieving the Data

Before we dive into the interesting questions, the following are some basic characteristics of the dummy data:

Next, we will examine some interesting questions from the dummy data.

1. What categories are borrowed the most?

Query:

1
2
3
4
5
6
7
select category_name, count(*) as loan_count
from loans
left join books using(book_id)
left join categories using(category_id)
group by category_name
order by loan_count desc
limit 5;

Output:

1
2
3
4
5
6
7
       category_name        | loan_count 
----------------------------+------------
 Literature & Fiction       |       6772
 Teen & Young Adult         |       1387
 Biographies & Memoirs      |        925
 Religion & Spirituality    |        494
 Politics & Social Sciences |        362

From the above, we can see the top five categories that are borrowed the most, with “Literature & Fiction” as the top one which has been borrowed 6772 times.

2. Which authors are borrowed the most?

Query:

1
2
3
4
5
6
select author, count(*) as loan_count
from loans
left join books using(book_id)
group by author
order by loan_count desc
limit 5;

Output:

1
2
3
4
5
6
7
     author      | loan_count 
-----------------+------------
 Sarah J. Maas   |        593
 Colleen Hoover  |        390
 J.K. Rowling    |        336
 Diana Gabaldon  |        284
 Suzanne Collins |        200

From the above, we can see the top five authors whose books are borrowed the most, starting from “Sarah J. Maas” which has been borrowed 593 times.

3. What’s the average, minimum, and maximum borrow duration (in days)?

Query:

1
2
3
4
5
6
select
    avg(loan_end::date - loan_start::date) as average_duration,
    min(loan_end::date - loan_start::date) as minimum_duration,
    max(loan_end::date - loan_start::date) as maximum_duration
from loans
where loan_end is not null;

Output:

1
2
3
  average_duration  | minimum_duration | maximum_duration 
--------------------+------------------+------------------
 9.4659863945578231 |                1 |               14

As it turns out, the average duration for the borrowing is 9.47 days, and the maximum is 14 days as specified by the library requirement.

4. What book is borrowed the most in each month?

Query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
select date, title, count
from (
    select
        to_char(loan_start, 'yyyy-mm') as date,
        book_id,
        count(*) as count,
        rank() over(partition by to_char(loan_start, 'yyyy-mm')
                    order by count(*) desc,
                    book_id desc) as rank
    from loans
    group by date, book_id
)
left join books
    using(book_id)
where rank = 1;

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
  date   |                            title                            | count 
---------+-------------------------------------------------------------+-------
 2015-01 | Fair Warning: The Instant Number One Bestselling Thriller   |    10
 2015-02 | A Little Life: A Novel                                      |     9
 2015-03 | Tower of Dawn (Throne of Glass Book 6)                      |    11
 2015-04 | The Hate U Give: A Printz Honor Winner                      |     9
 2015-05 | 1984                                                        |     9
 2015-06 | The 5 Love Languages: The Secret to Love that Lasts         |     8
 2015-07 | Sooley: A Novel                                             |     9
 2015-08 | The Beekeeper of Aleppo: A Novel                            |    10
 2015-09 | The Lincoln Highway: A Novel                                |    10
 2015-10 | Hillbilly Elegy: A Memoir of a Family and Culture in Crisis |     9
 2015-11 | Mad Honey: A Novel                                          |     7

From the above, we can see the most borrowed books in each month, along with how many times each book was borrowed (note: if there are two or more books with highest equal count in the same month, book with the lowest book_id will be chosen). As it turns out, each month has different most-popular book. We can also see that the maximum count in each month is only about 9-10 times.

5. Which categories are borrowed the most in each month?

Query:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
select date, category_name, count
from (
    select
        to_char(loan_start, 'yyyy-mm') as date,
        category_id,
        count(*) as count,
        rank() over(partition by to_char(loan_start, 'yyyy-mm')
                    order by count(*) desc,
                    category_id desc) as rank
    from loans
    left join books
        using(book_id)
    group by date, category_id
)
left join categories
    using(category_id)
where rank = 1;

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
  date   |    category_name     | count 
---------+----------------------+-------
 2015-01 | Literature & Fiction |   660
 2015-02 | Literature & Fiction |   599
 2015-03 | Literature & Fiction |   601
 2015-04 | Literature & Fiction |   573
 2015-05 | Literature & Fiction |   674
 2015-06 | Literature & Fiction |   604
 2015-07 | Literature & Fiction |   631
 2015-08 | Literature & Fiction |   644
 2015-09 | Literature & Fiction |   653
 2015-10 | Literature & Fiction |   658
 2015-11 | Literature & Fiction |   475

As it turns out, category “Literature & Fiction” was the most borrowed category in all of the months in the dataset, which was borrowed around 600 times each month. Why only one category? The reason is simple, if we do the following query:

1
2
3
4
5
6
7
select category_name, count(*) as count
from books
left join categories
    using(category_id)
group by category_name
order by count desc
limit 5;

We can see the following output:

1
2
3
4
5
6
7
       category_name        | count 
----------------------------+-------
 Literature & Fiction       |   183
 Teen & Young Adult         |    38
 Biographies & Memoirs      |    27
 Religion & Spirituality    |    14
 Politics & Social Sciences |    11

The query above shows the top five categories with the most number of books, along with the count of books in each category. As you can see, the books in the dataset are concentrated on the “Literature & Fiction” category, which explains why that category became the most popular in each month.

Conclusion

In this report, I have outlined a database design and implementation for an e-library application. The design I have provided should solve most of the application’s requirements and needs. Moreover, the implementation for the database has also been demonstrated, along with the ERD, SQL syntax implementation, and a Python script to populate the database with dummy data. Lastly, I have also demonstrated some SQL queries that would be useful or interesting to answer.

References