Bucket Joins alex@theparticle.com Front Running Regulation Front-running is the practice in which a broker or trading firm, knowing of a pending customer order (or an upcoming large trade), delays or withholds execution of that order, trades first for its own (proprietary) account, and then fills the customer's order. The broker aims to profit from the expected price movement caused by the client's order. In FINRA Rule 5320, a firm that holds a customer's order must not execute a proprietary trade on the same side of the market at a price that would satisfy the customer (unless it then promptly executes the customer's order at the same or better price). Pretend we have access to order events, such as order arrived, order executed, etc. To spot instances of front running, we need to figure out when the the broker has an open customer order. In other words, for each customer order we need to find "start" and "end" time: start is when order arrives, and end is when we see an execution or cancelation with zero leaves quantity. create table tmp_customer_orders as select tdate,symbol,oid,side,firm, min(case when event_type='ORDER' then tim else null end) as start_tim, max(case when event_type in ('EXECUTION','CANCEL') and lvs_qty=0 then tim else null end) as end_tim from order_events where account_type='CUSTOMER' group by tdate,symbol,oid,side,firm; (we will ignore the scenario when order isn't executed or canceled). Now the front running surveillance is essentially: select * from order_events a inner join tmp_customer_orders b on a.tdate=b.tdate and a.symbol=b.symbol and a.side=b.side and a.firm=b.firm and a.tim between b.start_tim and b.end_tim where a.event_type='EXECUTION' and a.acount_type='PROP' We check this in our dev environment, and it works great. We run it in production, and it never returns: we run it, and after 8-hours we terminate it. We run EXPLAIN PLAN on it, and see that it uses hash-map join on tdate,symbol,side,firm. Looks good. We then look at stats. Some small number of brokers (firms) have millions customer orders and a few hundred thousand prop trades. The problem is not the hash-map, it is the implied inner-loop join within each tdate,symbol,side,firm. BUCKET JOINS One solution is to do a bucket-join. We need to create a key to join on. On the prop_trade side, we can create a key via: create tmp_prop_trade as select a.*, floor(extract(epoch from timstamp)/60) as minute_bucket from order_events a where a.event_type='EXECUTION' and a.acount_type='PROP'; on the tmp_customer_orders side, we need to "generate" a record for every spanned minute... create tmp_customer_orders_buckets as with wbucket as ( select a.*, floor(extract(epoch from start_tim)/60) start_bucket, floor(extract(epoch from end_tim)/60) end_bucket from tmp_customer_orders a ) select a.*, generate_series(start_bucket,end_bucket) minute_bucket from wbucket a; [note the above is for PostgreSQL, most other databases have similar feature of synthesizing records]. once we have that... we can just do an equality join, with a where clause to check range. e.g.: select * from tmp_prop_trade a inner join tmp_customer_orders_buckets b on a.tdate=b.tdate and a.symbol=b.symbol and a.side=b.side and a.firm=b.firm and a.minute_bucket=b.minute_bucket where a.tim between b.start_tim and b.end_tim Now the hash-join (or sort-merge join) will be on: tdate,symbol,side,firm,minute_bucket and the implied inner-loop join will only include orders and trades within each minute. If performance is still bad, we can cut it into 10 second intervals, etc.