Aggregation of high-frequency data — aggregate

Aggregates high-frequency trading data into aggregated daily data using different trade classification algorithms.

Usage

aggregate_trades(data, algorithm = "Tick", timelag = 0, ...,
 verbose = TRUE)

Arguments

data

A dataframe with 4 variables in the following order (timestamp, price, bid, ask).

algorithm

A character string refers to the algorithm used to determine the trade initiator, a buyer or a seller. It takes one of four values ("Tick", "Quote", "LR", "EMO"). The default value is "Tick". For more information about the different algorithms, check the details section.

timelag

A number referring to the time lag in milliseconds used to calculate the lagged midquote, bid and ask for the algorithms "Quote", "EMO" and "LR".

...

Additional arguments passed on to the function aggregate_trades(). The recognized arguments are reportdays, and is_parallel. Other arguments will be ignored.

reportdays is binary variable that determines whether the variable day is returned. The default value is FALSE.
is_parallel is a logical variable that specifies whether the computation is performed using parallel or sequential processing. The default value is TRUE. For more details, please refer to the vignette 'Parallel processing' in the package, or online.

verbose

A binary variable that determines whether detailed information about the progress of the trade classification is displayed. No output is produced when verbose is set to FALSE. The default value is TRUE.

Value

Returns a dataframe of two (or three) variables. If reportdaysis set to TRUE, then the returned dataframe has three variables {day, b, s}. If reportdays is set to FALSE, then the returned dataframe has two variables {b, s}, and, therefore, can be directly used for the estimation of the PIN and MPIN models.

Details

The argument algorithm takes one of four values:

"Tick" refers to the tick algorithm: Trade is classified as a buy (sell) if the price of the trade to be classified is above (below) the closest different price of a previous trade.
"Quote" refers to the quote algorithm: it classifies a trade as a buy (sell) if the trade price of the trade to be classified is above (below) the mid-point of the bid and ask spread. Trades executed at the mid-spread are not classified.
"LR" refers to LR algorithm as in Lee and Ready (1991) . It classifies a trade as a buy (sell) if its price is above (below) the mid-spread (quote algorithm), and uses the tick algorithm if the trade price is at the mid-spread.
"EMO" refers to EMO algorithm as in Ellis et al. (2000) . It classifies trades at the bid (ask) as sells (buys) and uses the tick algorithm to classify trades within the then prevailing bid-ask spread.

LR recommend the use of mid-spread five-seconds earlier ('5-second' rule) mitigating trade misclassifications for many of the 150 NYSE stocks they analyze. On the other hand, in more recent studies such as Piwowar and Wei (2006) and Aktas and Kryzanowski (2014) , the use of 1-second lagged midquotes are shown to yield lower rates of misclassifications. The default value is set to 0 seconds (no time-lag). Considering the ultra-fast nature of today’s financial markets, time-lag is in the unit of milliseconds. Shorter than 1-second lags can also be implemented by entering values such as 100 or 500.

References

Aktas OU, Kryzanowski L (2014). “Trade classification accuracy for the BIST.” Journal of International Financial Markets, Institutions and Money, 33, 259-282. ISSN 1042-4431.

Ellis K, Michaely R, Ohara M (2000). “The Accuracy of Trade Classification Rules: Evidence from Nasdaq.” The Journal of Financial and Quantitative Analysis, 35(4), 529--551.

Lee CMC, Ready MJ (1991). “Inferring Trade Direction from Intraday Data.” The Journal of Finance, 46(2), 733--746. ISSN 00221082, 15406261.

Piwowar MS, Wei L (2006). “The Sensitivity of Effective Spread Estimates to Trade-Quote Matching Algorithms.” Electronic Markets, 16(2), 112-129.

Examples

# There is a preloaded dataset called 'hfdata' contained in the package.
# It is an artificially created high-frequency trading data. The dataset
# contains  100 000 trades and five variables 'timestamp', 'price',
# 'volume', 'bid', and 'ask'. For more information, type ?hfdata.

xdata <- hfdata
xdata$volume <- NULL

# Use the LR algorithm with a timelag of 0 milliseconds

daytrades <- aggregate_trades(xdata, algorithm = "LR", verbose = FALSE)

# Since the argument 'reportdays' is set to FALSE by default, then the
# output 'daytrades' can be used directly for the estimation of the PIN
# model, namely using pin_ea().

estimate <- pin_ea(daytrades, verbose = FALSE)

# Show the estimate

show(estimate)
#> ----------------------------------
#> PIN estimation completed successfully
#> ----------------------------------
#> Estimation method 	: Maximum likelihood estimation
#> Initial parameter sets	: Ersan and Alici (2016)
#> Likelihood factorization: Ersan (2016)
#> ----------------------------------
#> 5 initial set(s) are used in the estimation 
#> Type object@initialsets to see the initial parameter sets used
#> 
#>  PIN model  
#> 
#> ==========  ===========
#> Variables   Estimates  
#> ==========  ===========
#> alpha       0.739132   
#> delta       0.274509   
#> mu          490.85     
#> eps.b       531.6      
#> eps.s       554.88     
#> ----                   
#> Likelihood  (760.765)  
#> PIN         0.250332   
#> ==========  ===========
#> 
#> -------
#> Running time: 0.522 seconds