quick summary

relational databases... SQL.... DDL.... DML....

DDL: drop table... create table.... 
DML: select ...     insert, update, delete.

functions: substr, sqrt, upper, lower, position('blah' in str), sin, cos, 
 split, ... regex match, etc., 

aggregate functions: operate on a "group". if no "group by" is specified, then they operate on the whole result.., e.g.:

select count(*) from blah;  --counts all records in blah.
select glah,count(*) from blah group by glah; --counts records per each glah.

select state,count(*) from customer group by state; --count customers by state.

library: libevents.. events.... book borrowed 'B'... book returned 'R'... 
--- count number of books currently borrowed (for the entire library).

select sum(case when evttype='B' then 1 
                when evttype='R' then -1
                else 0 end) cnt
from libevents;

--- count number of books borrowed at 4PM on 2021-01-01 ?

select sum(case when evttype='B' then 1 
                when evttype='R' then -1
                else 0 end) cnt
from libevents
where evttim <=cast('2021-01-01 16:00:00' as timestamp);

------------------------------------------------------------
CTEs...common table expressions; allow you to never use sub-queries.

with blah as (
  select * from TBL1 where ...
),
glah as (
  select * from TBL1 .... 
)
select *
from blah inner join glah.. 

---inner query alternative (UGLY):

select * 
from ( select * 
from TBL1 where ... ) a inner join ( select * 
from TBL1  ... ) b
... 


--------------------------------------------------------------

joins....

inner join 
left outer join 

    don't use: right outer join
    almost never use: full outer join
    
--------------------------------------------------------------
-- two tables that are related by a key of some kind...  you can 
--- relate them using a join... 

-- imagine a bank

-- customer table keeps info on customer
customer(custid, fname, lname, ssn, street, city, state, zip, phone, email);

-- account table links customer to account identifier.
account(accntid, custid, type_of_accnt)

-- log of all the transactions for each account (with + or - amnt).
journal(tid, accntid, amnt, tim timestamp)

--- what's the balance of John Doe's (phone number: 123-552-3463) account.
select fname, lname, phone, sum(c.amnt) balance
from customer a
  inner join account b
  on a.custid=b.custid
  inner join journal c
  on b.accntid=c.accntid
where fname='John' and lname='Doe' and phone='123-552-3463'

------------------------------------------------------
--- left outer join returns all records form "left" table, and
--- matching records from the right table (if no match for left-record is 
----is found, then a NULL is returned.)


--- find accounts with at least 10 transactions.

select accntid
from journal
group by accntid
having count(*) >= 10

--- find accounts with *less*than* 10 transactions.

select accntid
from journal
group by accntid
having count(*) < 10   --this *almost* works, 
                       --but what about accounts with 0 transactions?

---for this, we need left outer join.

select a.accntid
from account a
  left outer join journal b
  on a.accntid=b.accntid
group by a.accntid
having count(b.accntid) < 10    -- acount number of journal entries.

--- account (left table) left joined to journal (right table) ... 
--- we'll always return an accountid, even when journal has no records.
--- then we're counting number of journal entries per account.

---------------------------
-- where... is applied on each record: cannot use aggregate functions.
-- having ... is applied on each group: can use aggregate functions.

-------------------------------------------------------------------

interview-like question:

given these 2 files, a.txt and b.txt

$ cat a.txt 
1,Bob,bob@yahoo.com
2,John,john@msn.com
4,Bill,bill@google.com
6,Jane,jane@amazon.com

$ cat b.txt 
1,111 Broadway
4,1234 Fulton
5,235 Elm St.
1,4444 45th St.

write a computer program (in any language) to "inner join" them. should return similar results to a 'join' unix command line utility. do not assume that inputs are sorted.


difference between full outer join vs left outer join.

dwprod=>  select * from a full outer join b on a.id=b.id
;
 id | name |      email      | id |    address    
----+------+-----------------+----+---------------
  1 | Bob  | bob@yahoo.com   |  1 | 111 Broadway
  4 | Bill | bill@google.com |  4 | 1234 Fulton
    |      |                 |  5 | 235 Elm St.
  1 | Bob  | bob@yahoo.com   |  1 | 4444 45th St.
  2 | John | john@msn.com    |    | 
  6 | Jane | jane@amazon.com |    | 
(6 rows)

dwprod=>  select * from a left outer join b on a.id=b.id
;
 id | name |      email      | id |    address    
----+------+-----------------+----+---------------
  1 | Bob  | bob@yahoo.com   |  1 | 111 Broadway
  4 | Bill | bill@google.com |  4 | 1234 Fulton
  1 | Bob  | bob@yahoo.com   |  1 | 4444 45th St.
  2 | John | john@msn.com    |    | 
  6 | Jane | jane@amazon.com |    | 
(5 rows)


dwprod=> explain select * from a full outer join b on a.id=b.id
;
                            QUERY PLAN                            
------------------------------------------------------------------
 Hash Full Join  (cost=17.20..69.67 rows=992 width=322)
   Hash Cond: (b.id = a.id)
   ->  Seq Scan on b  (cost=0.00..16.20 rows=620 width=102)
   ->  Hash  (cost=13.20..13.20 rows=320 width=220)
         ->  Seq Scan on a  (cost=0.00..13.20 rows=320 width=220)
(5 rows)

---most common join type is: hash join.

we can force db to do nested loop join:

dwprod=> explain select * from a, b where a.id<b.id;
                            QUERY PLAN
------------------------------------------------------------------
 Nested Loop  (cost=0.00..3006.20 rows=66133 width=322)
   Join Filter: (a.id < b.id)
   ->  Seq Scan on b  (cost=0.00..16.20 rows=620 width=102)
   ->  Materialize  (cost=0.00..14.80 rows=320 width=220)
         ->  Seq Scan on a  (cost=0.00..13.20 rows=320 width=220)
(5 rows)

whenever you see ``Nested Loop'' .... BEWARE.
(especially if you're not expecting it).


----inner loop program:
alex@electron:~/dbsys$ cat innerloopjoinab.pl

use strict;

open my $a,"a.txt";
my @a = map {s/^\s+|\s+$//sgi;[split /,/,$_] } <$a>;
close $a;

open my $b,"b.txt";
my @b = map {s/^\s+|\s+$//sgi;[split /,/,$_] } <$b>;
close $b;

# inner loop join
for $a (@a){
  for $b (@b){
    if($a->[0] == $b->[0]){
      print join(",",@$a,@$b)."\n";
    }
  }
}

alex@electron:~/dbsys$ perl innerloopjoinab.pl 
1,Bob,bob@yahoo.com,1,111 Broadway
1,Bob,bob@yahoo.com,1,4444 45th St.
4,Bill,bill@google.com,4,1234 Fulton

---- merge sort (using "join" command line program)

$ cat a.txt |sort > a1.txt
$ cat b.txt |sort > b1.txt

$ join -t, -j1 a1.txt b1.txt
1,Bob,bob@yahoo.com,111 Broadway
1,Bob,bob@yahoo.com,4444 45th St.
4,Bill,bill@google.com,1234 Fulton


---------------------------------------------------------
-- imagine 2 tables:

xyz_events(...., start_time, end_time )
abc_events(...., tim)


--- find all the abc events that are within some xyz events.

select ...
from abc_events a
  inner join xyz_events b
  on a.tim between b.start_time and b.end_time

-- this kind of query doesn't seem "bad"... except most databases will do
-- inner loop join.


----hash join explanation---

pretend that hash function is just the "left letter", e.g.:

hash("john") = j.
hash("johnson") = j.

only compare stuff where "hash" matches.... 

seledct ...
from ....
where ... ---only compare when hash matches: this "simple" hash would only compare 1/26 of the data...  (not everything to everything).