MATCH_RECOGNIZE
(short intro)

Alex S, http://theparticle.com/
alex@theparticle.com 

ref: https://trino.io/docs/current/sql/match-recognize.html

The MATCH_RECOGNIZE is similar to windowing-functions, in that it operates on a window of records. It provides for regex matching on sequence of records!

In traditional string regular-expressions, suppose we wanted to detect an "A" followed by one-or-more "B"s, followed by a single "C". The regex would be "AB+C", and we'd write something like this

with sample_strings as (
  select 'AAAAAABBBBBCCCCC' as s union all
  select 'BBBBBCCCCC' union all
  select 'AAAAAABBBBB' union all
  select 'AAAAAACCCCC' union all
  select 'ABCCCCC' union all
  select 'AAAAAABBBBBC' union all
  select 'ABC'  
)
select s, regexp_match(s, 'AB+C') mtched
from sample_strings
where regexp_like(s, 'AB+C');

output:
        s         |  mtched   
------------------+-----------
 AAAAAABBBBBCCCCC | {ABBBBBC}
 ABCCCCC          | {ABC}
 AAAAAABBBBBC     | {ABBBBBC}
 ABC              | {ABC}
(4 rows)


With SQL2016 and MATCH_RECOGNIZE we have something similar with records. Suppose we wanted to detect a sequence of events such as, 
for each customer: 
  one-or-more sell orders, 
  a sequence of one-or-more buy orders at progressively increasing prices, 
  followed by one-or-more sell-trades,
  followed by one-or-more buy-order-cancels.

note: this is often called layering or spoofing: 
https://en.wikipedia.org/wiki/Layering_(finance)
https://en.wikipedia.org/wiki/Spoofing_(finance)

Suppose we have a mktevents table such as: 

drop table mktevents;
create table mktevents (
  type_cd varchar(1),          --event type: O-order, T-trade, C-cancel
  tim bigint,                  --timestamp, nanoseconds since epoch or whatever.
  customer_id varchar(10),     --identifies the customer
  order_id varchar(6),         --identifies the order
  side varchar(1),             --B-buy, S-sell
  price decimal(18,8),         --limit price of the order or trade price.
  qty int                      --quantity to buy/sell, or trade quantity.
);

insert into mktevents values 
 --positive scenario
 ('O',1,'XYZ','ABC123','S',10.50,100),    --sell order priced at 10.50
 ('O',2,'XYZ','ABC124','B',10.40,100),    --buy order priced at 10.40
 ('O',3,'XYZ','ABC125','B',10.41,100),
 ('O',4,'XYZ','ABC126','B',10.42,100),
 ('O',5,'XYZ','ABC127','B',10.45,100),
 ('O',6,'XYZ','ABC128','B',10.48,100),
 ('O',7,'XYZ','ABC129','B',10.49,100),    --buy order priced at 10.49
 ('T',8,'XYZ','ABC123','S',10.50,100),    --SELL trade at 10.50
 ('C',9,'XYZ','ABC124','B',null,null),    --cancels for buy interst.
 ('C',10,'XYZ','ABC125','B',null,null),
 ('C',11,'XYZ','ABC126','B',null,null),
 ('C',12,'XYZ','ABC127','B',null,null),
 ('C',13,'XYZ','ABC128','B',null,null),
 ('C',14,'XYZ','ABC129','B',null,null),
 --negative scenario 
 ('O',15,'XYZ','ABC991','S',10.50,100),    --sell order priced at 10.50
 ('T',16,'XYZ','ABC991','S',10.50,100),    --SELL trade at 10.50
 --negative scenario
 ('O',17,'XYZ','ABC888','S',10.50,100),    --sell order priced at 10.50
 ('O',18,'XYZ','ABC999','B',10.40,100),    --buy order priced at 10.40
 ('T',19,'XYZ','ABC999','B',10.40,100),    --BUY trade at 10.40
 --small positive scenario
 ('O',20,'XYZ','ABC001','S',10.50,100),    --sell order priced at 10.50
 ('O',21,'XYZ','ABC002','B',10.49,100),    --buy order priced at 10.49
 ('T',22,'XYZ','ABC001','S',10.50,100),    --SELL trade at 10.50
 ('C',23,'XYZ','ABC002','B',null,null),    --cancels for buy interst.
 --small negative scenario (buy prices are not increasing)
 ('O',24,'XYZ','ABC001','S',10.50,100),    --sell order priced at 10.50
 ('O',25,'XYZ','ABC002','B',10.49,100),    --buy order priced at 10.49
 ('O',26,'XYZ','ABC003','B',10.48,100),    --buy order priced at 10.48  NOT INCREASING.
 ('O',27,'XYZ','ABC004','B',10.47,100),    --buy order priced at 10.47
 ('T',28,'XYZ','ABC001','S',10.50,100),    --SELL trade at 10.50
 ('C',29,'XYZ','ABC002','B',null,null),    --cancel
 ('C',30,'XYZ','ABC003','B',null,null),
 ('C',31,'XYZ','ABC004','B',null,null)
;

We can use the match_recognize

SELECT *
FROM mktevents match_recognize (
  PARTITION BY customer_id
  ORDER BY tim
  MEASURES 
    match_number() as nmbr
  ALL ROWS PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A+ B+ C+ D+)
    DEFINE
      A as type_cd='O' and side='S',
      B as type_cd='O' and side='B' and (price >= prev(price) or prev(side)='S'),
      C as type_cd='T' and side='S',
      D as type_cd='C' and side='B'
);

output:
 customer_id | tim | nmbr | type_cd | order_id | side |    price    | qty  
-------------+-----+------+---------+----------+------+-------------+------
 XYZ         |   1 |    1 | O       | ABC123   | S    | 10.50000000 |  100 
 XYZ         |   2 |    1 | O       | ABC124   | B    | 10.40000000 |  100 
 XYZ         |   3 |    1 | O       | ABC125   | B    | 10.41000000 |  100 
 XYZ         |   4 |    1 | O       | ABC126   | B    | 10.42000000 |  100 
 XYZ         |   5 |    1 | O       | ABC127   | B    | 10.45000000 |  100 
 XYZ         |   6 |    1 | O       | ABC128   | B    | 10.48000000 |  100 
 XYZ         |   7 |    1 | O       | ABC129   | B    | 10.49000000 |  100 
 XYZ         |   8 |    1 | T       | ABC123   | S    | 10.50000000 |  100 
 XYZ         |   9 |    1 | C       | ABC124   | B    |        NULL | NULL 
 XYZ         |  10 |    1 | C       | ABC125   | B    |        NULL | NULL 
 XYZ         |  11 |    1 | C       | ABC126   | B    |        NULL | NULL 
 XYZ         |  12 |    1 | C       | ABC127   | B    |        NULL | NULL 
 XYZ         |  13 |    1 | C       | ABC128   | B    |        NULL | NULL 
 XYZ         |  14 |    1 | C       | ABC129   | B    |        NULL | NULL 

 XYZ         |  20 |    2 | O       | ABC001   | S    | 10.50000000 |  100 
 XYZ         |  21 |    2 | O       | ABC002   | B    | 10.49000000 |  100 
 XYZ         |  22 |    2 | T       | ABC001   | S    | 10.50000000 |  100 
 XYZ         |  23 |    2 | C       | ABC002   | B    |        NULL | NULL 
(18 rows)


If we do not want all records, we can just pick out aggregate scenarios:

SELECT *
FROM mktevents match_recognize (
  PARTITION BY customer_id
  ORDER BY tim
  MEASURES 
    match_number() as nmbr,
    first(A.tim) as start_tim,
    last(D.tim) as end_tim,    
    avg(A.price) as avg_sell_px, 
    avg(B.price) as avg_buy_px,
    avg(C.price) as avg_sell_ex_px
  ONE ROW PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A+ B+ C+ D+)
    DEFINE
      A as type_cd='O' and side='S',
      B as type_cd='O' and side='B' and (price >= prev(price) or prev(side)='S'),
      C as type_cd='T' and side='S',
      D as type_cd='C' and side='B'
);

output:
 customer_id | nmbr | start_tim | end_tim | avg_sell_px | avg_buy_px  | avg_sell_ex_px 
-------------+------+-----------+---------+-------------+-------------+----------------
 XYZ         |    1 |         1 |      14 | 10.50000000 | 10.44166667 |    10.50000000 
 XYZ         |    2 |        20 |      23 | 10.50000000 | 10.49000000 |    10.50000000 
(2 rows)


The MATCH_RECOGNIZE feature is an unbelievably powerful tool! It opens up amazing possibilities for pattern detection and event stream processing, making complex data analysis feel almost effortless.

Good luck!