# An R package for estimating the probability of informed trading

Source:`R/PINstimation.R`

`PINstimation-package.Rd`

The package provides utilities for the estimation
of probability of informed trading measures: original PIN (`PIN`

) as
introduced by Easley and Ohara (1992)
and
Easley et al. (1996)
, multilayer PIN (`MPIN`

) as introduced by
Ersan (2016)
, adjusted PIN (`AdjPIN`

) model
as introduced in Duarte and Young (2009)
, and
volume-synchronized PIN (`VPIN`

) as introduced by
Easley et al. (2011)
and
Easley et al. (2012)
. Estimations of
`PIN`

, `MPIN`

, and `adjPIN`

are subject to floating-point exception
error, and are sensitive to the choice of initial values.
Therefore, researchers developed factorizations of the model likelihood
functions as well as algorithms for determining initial parameter sets for
the maximum likelihood estimation - (MLE henceforth).

As for the factorizations, the package includes three
different factorizations of the `PIN`

likelihood function :`fact_pin_eho()`

as in Easley et al. (2010)
, `fact_pin_lk()`

as in
Lin and Ke (2011)
, and `fact_pin_e()`

as in
Ersan (2016)
;
one factorization for `MPIN`

likelihood function: `fact_mpin()`

as in
Ersan (2016)
; and one factorization for
`AdjPIN`

likelihood function: `fact_adjpin()`

as in
Ersan and Ghachem (2022b)
.

The package implements three algorithms to generate initial
parameter sets for the MLE of the `PIN`

model in: `initials_pin_yz()`

for the algorithm of Yan and Zhang (2012)
,
`initials_pin_gwj()`

for the algorithm of
Gan et al. (2015)
, and `initials_pin_ea()`

for the
algorithm of Ersan and Alici (2016)
. As for the
initial parameter sets for the MLE of the `MPIN`

model, the function
`initials_mpin()`

implements a multilayer extension of the algorithm of
Ersan and Alici (2016)
. Finally, three functions
implement three algorithms of initial parameter sets for the MLE of
the `AdjPIN`

model, namely `initials_adjpin()`

for the algorithm in
Ersan and Ghachem (2022b)
, `initials_adjpin_cl()`

for the algorithm of Cheng and Lai (2021)
; and
`initials_adjpin_rnd()`

for randomly generated initial parameter sets.
The choice of the initial parameter sets can be done directly, either using
specific functions implementing MLE for the PIN model, such as, `pin_yz()`

,
`pin_gwj()`

, `pin_ea()`

; or through the argument `initialsets`

in generic
functions implementing MLE for the `MPIN`

and `AdjPIN`

models, namely
`mpin_ml()`

, and `adjpin()`

.
Besides, `PIN`

, `MPIN`

and `AdjPIN`

models can be estimated using custom
initial parameter set(s) provided by the user and fed through
the argument `initialsets`

for the functions `pin()`

, `mpin_ml()`

and
`adjpin()`

. Through the function `get_posteriors()`

, the package also
allows users to assign, for each day in the sample, the posterior
probability that the day is a no-information day, good-information day
and bad-information day.

As an alternative to the standard maximum likelihood estimation,
estimation via expectation conditional maximization algorithm (`ECM`

)
is suggested in Ghachem and Ersan (2022a)
, and is
implemented through the function `mpin_ecm()`

for the `MPIN`

model, and
the function `adjpin()`

for the `AdjPIN`

model.

Dataset(s) of daily aggregated numbers of buys and sells with user
determined number of information layers can be simulated with the function
`generatedata_mpin()`

for the `MPIN`

(`PIN`

) model;
and `generatedata_adjpin()`

for the `AdjPIN`

model. The output of these functions contains the
theoretical parameters used in the data generation, empirical parameters
computed from the generated data, alongside the generated data itself.
Data simulation functions allow for broad customization
to produce data that fit the user's preferences. Therefore, simulated data
series can be utilized in comparative analyses for the applied methods in
different scenarios. Alternatively, the user can use two example datasets
preloaded in the package: `dailytrades`

as a representative of a quarterly
trade data with daily buys and sells; and `hfdata`

as a simulated
high-frequency dataset comprising `100 000`

trades.

Finally, the package provides two functions to deal with
high-frequency data.
First, the function `vpin()`

estimates and provides detailed output on the
order flow toxicity metric, volume-synchronized probability of informed
trading, as developed in Easley et al. (2011)
and
Easley et al. (2012)
. Second, the function
`aggregate_trades()`

aggregates the high-frequency trade-data into daily
data using several trade classification algorithms, namely the `tick`

algorithm, the `quote`

algorithm, `LR`

algorithm
(Lee and Ready 1991)
and the `EMO`

algorithm (Ellis et al. 2000)
.

The package provides fast, compact, and precise utilities to tackle
the sophisticated, error-prone, and time-consuming estimation procedure of
informed trading, and this solely using the raw trade-level data.
Ghachem and Ersan (2022b)
provides comprehensive overview of the package: it first
details the underlying theoretical background, provides a thorough
description of the functions, before using them to tackle relevant
research questions.

## Functions

adjpin estimates the adjusted probability of informed trading (

`AdjPIN`

) of the model of Duarte and Young (2009) .aggregate_trades aggregates the trading data per day using different trade classification algorithms.

detectlayers_e detects the number of information layers present in the trade-data using the algorithm in Ersan (2016) .

detectlayers_eg detects the number of information layers present in the trade-data using the algorithm in Ersan and Ghachem (2022a) .

detectlayers_ecm detects the number of information layers present in the trade-data using the expectation-conditional maximization algorithm in Ghachem and Ersan (2022a) .

fact_adjpin returns the

`AdjPIN`

factorization of the likelihood function by Ersan and Ghachem (2022b) evaluated at the provided data and parameter sets.fact_pin_e returns the

`PIN`

factorization of the likelihood function by Ersan (2016) evaluated at the provided data and parameter sets.fact_pin_eho returns the

`PIN`

factorization of the likelihood function by Easley et al. (2010) evaluated at the provided data and parameter sets.fact_pin_lk returns the

`PIN`

factorization of the likelihood function by Lin and Ke (2011) evaluated at the provided data and parameter sets.fact_mpin returns the

`MPIN`

factorization of the likelihood function by Ersan (2016) evaluated at the provided data and parameter sets.generatedata_adjpin generates a dataset object or a list of dataset objects generated according to the assumptions of the

`AdjPIN`

model.generatedata_mpin generates a dataset object or a list of dataset objects generated according to the assumptions of the

`MPIN`

model.get_posteriors computes, for each day in the sample, the posterior probabilities that it is a no-information day, good-information day and bad-information day respectively.

initials_adjpin generates the initial parameter sets for the

`ML`

/`ECM`

estimation of the adjusted probability of informed trading using the algorithm of Ersan and Ghachem (2022b) .initials_adjpin_cl generates the initial parameter sets for the

`ML`

/`ECM`

estimation of the adjusted probability of informed trading using an extension of the algorithm of Cheng and Lai (2021) .initials_adjpin_rnd generates random parameter sets for the estimation of the

`AdjPIN`

model.initials_mpin generates initial parameter sets for the maximum likelihood estimation of the multilayer probability of informed trading (

`MPIN`

) using the Ersan (2016) generalization of the algorithm in Ersan and Alici (2016) .initials_pin_ea generates the initial parameter sets for the maximum likelihood estimation of the probability of informed trading (

`PIN`

) using the algorithm of Ersan and Alici (2016) .initials_pin_gwj generates the initial parameter set for the maximum likelihood estimation of the probability of informed trading (

`PIN`

) using the algorithm of Gan et al. (2015) .initials_pin_yz generates the initial parameter sets for the maximum likelihood estimation of the probability of informed trading (

`PIN`

) using the algorithm of Yan and Zhang (2012) .mpin_ecm estimates the multilayer probability of informed trading (

`MPIN`

) using the expectation-conditional maximization algorithm (`ECM`

) as in Ghachem and Ersan (2022a) .mpin_ml estimates the multilayer probability of informed trading (

`MPIN`

) using layer detection algorithms in Ersan (2016) , and Ersan and Ghachem (2022a) ; and standard maximum likelihood estimation.pin estimates the probability of informed trading (

`PIN`

) using custom initial parameter set(s) provided by the user.pin_bayes estimates the probability of informed trading (

`PIN`

) using the Bayesian approach in Griffin et al. (2021) .pin_ea estimates the probability of informed trading (

`PIN`

) using the initial parameter sets from the algorithm of Ersan and Alici (2016) .pin_gwj estimates the probability of informed trading (

`PIN`

) using the initial parameter set from the algorithm of Gan et al. (2015) .pin_yz estimates the probability of informed trading (

`PIN`

) using the initial parameter sets from the grid-search algorithm of Yan and Zhang (2012) .vpin estimates the volume-synchronized probability of informed trading (

`VPIN`

).

## Datasets

dailytrades A dataframe representative of quarterly (60 trading days) data of simulated daily buys and sells.

hfdata A dataframe containing simulated high-frequency trade-data on 100 000 timestamps with the variables

`{timestamp, price, volume, bid, ask}`

.

## Estimation results

estimate.adjpin-class The class

`estimate.adjpin`

stores the estimation results of the function`adjpin()`

.estimate.mpin-class The class

`estimate.mpin`

stores the estimation results of the`MPIN`

model as estimated by the function`mpin_ml()`

.estimate.mpin.ecm-class The class

`estimate.mpin.ecm`

stores the estimation results of the`MPIN`

model as estimated by the function`mpin_ecm()`

.estimate.pin-class The class

`estimate.pin`

stores the estimation results of the following`PIN`

functions:`pin(), pin_yz(), pin_gwj()`

, and`pin_ea()`

.estimate.vpin-class The class

`estimate.vpin`

stores the estimation results of the`VPIN`

model using the function`vpin()`

.

## Data simulation

dataset-class The class

`dataset`

stores the result of simulation of the aggregate daily trading data.data.series-class The class

`data.series`

stores a list of`dataset`

.

## References

Cheng T, Lai H (2021).
“Improvements in estimating the probability of informed trading models.”
*Quantitative Finance*, **21**(5), 771-796.

Duarte J, Young L (2009).
“Why is PIN priced?”
*Journal of Financial Economics*, **91**(2), 119--138.
ISSN 0304405X.

Easley D, De Prado MML, Ohara M (2011).
“The microstructure of the \"flash crash\": flow toxicity, liquidity crashes, and the probability of informed trading.”
*The Journal of Portfolio Management*, **37**(2), 118--128.

Easley D, Hvidkjaer S, Ohara M (2010).
“Factoring information into returns.”
*Journal of Financial and Quantitative Analysis*, **45**(2), 293--309.
ISSN 00221090.

Easley D, Kiefer NM, Ohara M, Paperman JB (1996).
“Liquidity, information, and infrequently traded stocks.”
*Journal of Finance*, **51**(4), 1405--1436.
ISSN 00221082.

Easley D, Lopez De Prado MM, OHara M (2012).
“Flow toxicity and liquidity in a high-frequency world.”
*Review of Financial Studies*, **25**(5), 1457--1493.
ISSN 08939454.

Easley D, Ohara M (1992).
“Time and the Process of Security Price Adjustment.”
*The Journal of Finance*, **47**(2), 577--605.
ISSN 15406261.

Ellis K, Michaely R, Ohara M (2000).
“The Accuracy of Trade Classification Rules: Evidence from Nasdaq.”
*The Journal of Financial and Quantitative Analysis*, **35**(4), 529--551.

Ersan O (2016).
“Multilayer Probability of Informed Trading.”
*Available at SSRN 2874420*.

Ersan O, Alici A (2016).
“An unbiased computation methodology for estimating the probability of informed trading (PIN).”
*Journal of International Financial Markets, Institutions and Money*, **43**, 74--94.
ISSN 10424431.

Ersan O, Ghachem M (2022a).
“Identifying information types in probability of informed trading (PIN) models: An improved algorithm.”
*Available at SSRN 4117956*.

Ersan O, Ghachem M (2022b).
“A methodological approach to the computational problems in the estimation of adjusted PIN model.”
*Available at SSRN 4117954*.

Gan Q, Wei WC, Johnstone D (2015).
“A faster estimation method for the probability of informed trading using hierarchical agglomerative clustering.”
*Quantitative Finance*, **15**(11), 1805--1821.

Ghachem M, Ersan O (2022a).
“Estimation of the probability of informed trading models via an expectation-conditional maximization algorithm.”
*Available at SSRN 4117952*.

Ghachem M, Ersan O (2022b).
“PINstimation: An R package for estimating models of probability of informed trading.”
*Available at SSRN 4117946*.

Griffin J, Oberoi J, Oduro SD (2021).
“Estimating the probability of informed trading: A Bayesian approach.”
*Journal of Banking \& Finance*, **125**, 106045.

Lee CMC, Ready MJ (1991).
“Inferring Trade Direction from Intraday Data.”
*The Journal of Finance*, **46**(2), 733--746.
ISSN 00221082, 15406261.

Lin H, Ke W (2011).
“A computing bias in estimating the probability of informed trading.”
*Journal of Financial Markets*, **14**(4), 625-640.
ISSN 1386-4181.

Yan Y, Zhang S (2012).
“An improved estimation method and empirical properties of the probability of informed trading.”
*Journal of Banking and Finance*, **36**(2), 454--467.
ISSN 03784266.

## Author

Montasser Ghachem montasser.ghachem@pinstimation.com

Department of Economics at Stockholm University, Stockholm, Sweden.

Oguz Ersan oguz.ersan@pinstimation.com

Department of International Trade and Finance at Kadir Has University,
Istanbul, Turkey.