PARma - PAR-CLIP data microRNA assignment

PAR-CLIP is a high-throughput method to identify binding sites of RNA binding proteins. This is done by immunoprecipitation (IP) of the protein of interest and deep sequencing the RNA crosslinked to proteins prior to IP. PAR-CLIP also has extensively been used to determine target sites of microRNAs by isolating their binding protein AGO. In an AGO-PAR-CLIP experiment, the identity of the microRNA responsible for a target site is a priori not clear and must be revealed by matching the microRNA seed sequences to the target site sequence, which is not a trivial task. PAR-CLIP data has several specific characteristics, most notably, frequent T to C conversions that are indicative for crosslinking sites. We utilize these and other features to accurately determine the seed site.

Our method, PARma, consists of two main components: A generative model incorporates PAR-CLIP specific features to compute likely seed sites and the novel pattern discovery tool kmerExplain estimates seed activity probabilites based on the likelihood inferred by the model. KmerExplain can estimate seed activities without a predefined set of microRNAs, which allows to detect active regulators in an unbiased way, but can also incorporate prior probabilies for microRNA seeds. Both components, model estimation and kmerExplain are iteratively applied until convergence.

The final PAR-CLIP model is in agreement with known binding mechanisms of microRNAs and with structural knowledge of AGO and many active k-mers correspond to seeds of expressed microRNAs. The final seed assignment has two properties: Each active seed sequence explains several clusters with high likelihood and the seed positions match the model of PAR-CLIP data learned from all target sites. Based on evaluations using differential PAR-CLIP data from both a publicly available dataset as well as from a new dataset, we show that PARma is more accurate than existing approaches in terms of correct seed assignments.