FINRA Customer Response

Last updated
Save as PDF

Auxillary Models sit upstream from all classifiers that have a configurable policy; consequently, they do not have an associated policy and are not available for configurable by any user. These models feed features into these downstream classifiers.

Optical Character Recognizer

Model Description

Proofpoint does not maintain this model in-house but rather uses engine 2 of the OCR.space service (http://ocr.space) to transcribe the text. OCR.space does not disclose the methodology or structure of their algorithm, but it employs machine learning with a high degree of likelihood.

Performance

Proofpoint maintains an internal validation set of images that assesses to assess the accuracy of the vendor’s algorithm. By giving the model a series of 102 images with various words of profanity in different forms and measuring whether or not the model correctly transcribes the profane word so that the text classifier will flag successfully. The accuracy of engine 2 was 88.7%.

Part of speech Tagger

Model Description

Proofpoint utilizes an open-source NLP software named spaCy. The part-of-speech tagger is one of the included models in the open-source distribution.

Performance

Explosion.ai, the developer of spaCy, maintains accuracy metrics for its models. The measured accuracy of the tagger is 97.4% on the OntoNotes5 development set (https://spacy.io/usage/facts-figures#benchmarks).

Language Identification Detector

Model Description

Proofpoint utilizes an open-source Language Identification model using the fastText framework.

Performance

The maintainers of fastText produced the model and maintain accuracy metrics for the model. The measured accuracy is 92.7% on the Wikipedia language hold-out set (https://fasttext.cc/blog/2017/10/02/blog-post.html).