quick summary relational databases... SQL.... DDL.... DML.... DDL: drop table... create table.... DML: select ... insert, update, delete. functions: substr, sqrt, upper, lower, position('blah' in str), sin, cos, split, ... regex match, etc., aggregate functions: operate on a "group". if no "group by" is specified, then they operate on the whole result.., e.g.: select count(*) from blah; --counts all records in blah. select glah,count(*) from blah group by glah; --counts records per each glah. select state,count(*) from customer group by state; --count customers by state. library: libevents.. events.... book borrowed 'B'... book returned 'R'... --- count number of books currently borrowed (for the entire library). select sum(case when evttype='B' then 1 when evttype='R' then -1 else 0 end) cnt from libevents; --- count number of books borrowed at 4PM on 2021-01-01 ? select sum(case when evttype='B' then 1 when evttype='R' then -1 else 0 end) cnt from libevents where evttim <=cast('2021-01-01 16:00:00' as timestamp); ------------------------------------------------------------ CTEs...common table expressions; allow you to never use sub-queries. with blah as ( select * from TBL1 where ... ), glah as ( select * from TBL1 .... ) select * from blah inner join glah.. ---inner query alternative (UGLY): select * from ( select * from TBL1 where ... ) a inner join ( select * from TBL1 ... ) b ... -------------------------------------------------------------- joins.... inner join left outer join don't use: right outer join almost never use: full outer join -------------------------------------------------------------- -- two tables that are related by a key of some kind... you can --- relate them using a join... -- imagine a bank -- customer table keeps info on customer customer(custid, fname, lname, ssn, street, city, state, zip, phone, email); -- account table links customer to account identifier. account(accntid, custid, type_of_accnt) -- log of all the transactions for each account (with + or - amnt). journal(tid, accntid, amnt, tim timestamp) --- what's the balance of John Doe's (phone number: 123-552-3463) account. select fname, lname, phone, sum(c.amnt) balance from customer a inner join account b on a.custid=b.custid inner join journal c on b.accntid=c.accntid where fname='John' and lname='Doe' and phone='123-552-3463' ------------------------------------------------------ --- left outer join returns all records form "left" table, and --- matching records from the right table (if no match for left-record is ----is found, then a NULL is returned.) --- find accounts with at least 10 transactions. select accntid from journal group by accntid having count(*) >= 10 --- find accounts with *less*than* 10 transactions. select accntid from journal group by accntid having count(*) < 10 --this *almost* works, --but what about accounts with 0 transactions? ---for this, we need left outer join. select a.accntid from account a left outer join journal b on a.accntid=b.accntid group by a.accntid having count(b.accntid) < 10 -- acount number of journal entries. --- account (left table) left joined to journal (right table) ... --- we'll always return an accountid, even when journal has no records. --- then we're counting number of journal entries per account. --------------------------- -- where... is applied on each record: cannot use aggregate functions. -- having ... is applied on each group: can use aggregate functions. ------------------------------------------------------------------- interview-like question: given these 2 files, a.txt and b.txt $ cat a.txt 1,Bob,bob@yahoo.com 2,John,john@msn.com 4,Bill,bill@google.com 6,Jane,jane@amazon.com $ cat b.txt 1,111 Broadway 4,1234 Fulton 5,235 Elm St. 1,4444 45th St. write a computer program (in any language) to "inner join" them. should return similar results to a 'join' unix command line utility. do not assume that inputs are sorted. difference between full outer join vs left outer join. dwprod=> select * from a full outer join b on a.id=b.id ; id | name | email | id | address ----+------+-----------------+----+--------------- 1 | Bob | bob@yahoo.com | 1 | 111 Broadway 4 | Bill | bill@google.com | 4 | 1234 Fulton | | | 5 | 235 Elm St. 1 | Bob | bob@yahoo.com | 1 | 4444 45th St. 2 | John | john@msn.com | | 6 | Jane | jane@amazon.com | | (6 rows) dwprod=> select * from a left outer join b on a.id=b.id ; id | name | email | id | address ----+------+-----------------+----+--------------- 1 | Bob | bob@yahoo.com | 1 | 111 Broadway 4 | Bill | bill@google.com | 4 | 1234 Fulton 1 | Bob | bob@yahoo.com | 1 | 4444 45th St. 2 | John | john@msn.com | | 6 | Jane | jane@amazon.com | | (5 rows) dwprod=> explain select * from a full outer join b on a.id=b.id ; QUERY PLAN ------------------------------------------------------------------ Hash Full Join (cost=17.20..69.67 rows=992 width=322) Hash Cond: (b.id = a.id) -> Seq Scan on b (cost=0.00..16.20 rows=620 width=102) -> Hash (cost=13.20..13.20 rows=320 width=220) -> Seq Scan on a (cost=0.00..13.20 rows=320 width=220) (5 rows) ---most common join type is: hash join. we can force db to do nested loop join: dwprod=> explain select * from a, b where a.id Seq Scan on b (cost=0.00..16.20 rows=620 width=102) -> Materialize (cost=0.00..14.80 rows=320 width=220) -> Seq Scan on a (cost=0.00..13.20 rows=320 width=220) (5 rows) whenever you see ``Nested Loop'' .... BEWARE. (especially if you're not expecting it). ----inner loop program: alex@electron:~/dbsys$ cat innerloopjoinab.pl use strict; open my $a,"a.txt"; my @a = map {s/^\s+|\s+$//sgi;[split /,/,$_] } <$a>; close $a; open my $b,"b.txt"; my @b = map {s/^\s+|\s+$//sgi;[split /,/,$_] } <$b>; close $b; # inner loop join for $a (@a){ for $b (@b){ if($a->[0] == $b->[0]){ print join(",",@$a,@$b)."\n"; } } } alex@electron:~/dbsys$ perl innerloopjoinab.pl 1,Bob,bob@yahoo.com,1,111 Broadway 1,Bob,bob@yahoo.com,1,4444 45th St. 4,Bill,bill@google.com,4,1234 Fulton ---- merge sort (using "join" command line program) $ cat a.txt |sort > a1.txt $ cat b.txt |sort > b1.txt $ join -t, -j1 a1.txt b1.txt 1,Bob,bob@yahoo.com,111 Broadway 1,Bob,bob@yahoo.com,4444 45th St. 4,Bill,bill@google.com,1234 Fulton --------------------------------------------------------- -- imagine 2 tables: xyz_events(...., start_time, end_time ) abc_events(...., tim) --- find all the abc events that are within some xyz events. select ... from abc_events a inner join xyz_events b on a.tim between b.start_time and b.end_time -- this kind of query doesn't seem "bad"... except most databases will do -- inner loop join. ----hash join explanation--- pretend that hash function is just the "left letter", e.g.: hash("john") = j. hash("johnson") = j. only compare stuff where "hash" matches.... seledct ... from .... where ... ---only compare when hash matches: this "simple" hash would only compare 1/26 of the data... (not everything to everything).