Tasks And Techniques

Last updated
Save as PDF

Text Techniques

Text Techniques are a set of methods applied an entire section of unstructured text. Social Patrol uses these techniques for processing social media content and are explained below in order of increasing complexity.

Text Lexicon

This technique checks if any entries in a lexicon match any part of an unstructured piece of text.

Text Regex

This technique applies a regular expression (i.e. regex) to unstructured text. To learn more than you ever wanted to know about regular expressions, please visit https://www.regular-expressions.info/.

Text Proximity

This technique compares the character distance between two submatches (either by lexicon or by regex) and returns True if the distance is within the given threshold.

Token Techniques

Token Techniques are a set of methods applied to individual tokens for flagging. Social Patrol uses these techniques for processing social media content and are explained below in order of increasing complexity.

Token Text

This technique matches character for character on the text of the token.

Token Rule: {"LOWER": "bank"}
Positive Example: This Bank, I swear, it is the worst!
Tokens Lower: this bank , i swear , it is the worst !

Token Lexicon

This technique checks if the text of the token is in a particular lexicon.

Token Rule: {"PHRASE": "first_name"}
Positive Example: Jerry caught the flu!
Tokens Phrase: first_name __ __ health_term __

Token Regex

This technique uses a regular expression to match the text of a token.

Token Rule: {"LOWER": "password"}, {"LOWER": "is"}, {"REGEX": "\w{3,}"}
Positive Example: That password is a joke; my password is CatchMeIf42.
Tokens Orth: that password is a joke ; my password is CatchMeIf42

Token Proximity

This technique returns a match if two token matches are within a certain token distance.

Token Rule (5 or fewer tokens between): {"LOWER": "guarantee"} {?} {?} {?} {?} {?} {"LOWER": "returns"}
Positive Example: I guarantee, no lie, great returns this quarter.
Tokens Orth: I guarantee , no lie , great returns this quarter .

Machine Learning Techniques

Machine Learning (ML) algorithms use training data, whether labeled or unlabeled or both, to accomplish tasks. Social Patrol uses a number of ML algorithms for processing social media content that are explained below.

Statistical Models

Statistical Models (SM) are a large category of algorithms that rely on statistical properties to leverage a smaller amount of labeled data for training. These models are well-suited when there is not enough labeled data for more complicated techniques (see below) or when the task is less challenging.

Convolutional Neural Networks

Convolutional Neural Networks (CNN) learn filters that detect meaningful patterns in the input data and combine the information to either flag content (classification) or extract information (named entity recognition in text, object detection in images). Because CNNs have a large number of learnable parameters, they require several thousands of labeled data points to be successful.

Natural Language Processing Tasks

Natural Language Processing (NLP) is an ever-growing field of study that focuses on deriving meaning from unstructured text. Social Patrol engages in several common NLP tasks for processing social media content that are explained below in order of increasing difficulty.

Classification

The vast majority of our NLP classifiers participate in this task as binary classifiers, flagging incoming content into one of two buckets: accept or reject.

Tokenization

Tokenization is the process of splitting unstructured text into smaller tokens with semantic meaning. This process enables the intuitive creation of rules as opposed to reliance on long, complicated regular expressions that run over unstructured text.

The most basic form is Whitespace Tokenization, where tokens are separated by whitespace characters. The following example shows the weaknesses in such a basic approach.

Call the bank, (604 555 9432) and give them the card: 4010 0101 1010 0110.

A whitespace tokenizer generates the following tokens:

`Call` `the` `bank,` `(604.555.9432)` `and` `give` `them` `the` `card:` `4010` `0101` `1010` `0110.`

The Social Patrol tokenizer, however, generates a more meaningful set of tokens:

`Call` `the` `bank` `,` `(` `604.555.9432` `)` `and` `give` `them` `the` `card` `:` `4010 0101 1010 0110` `.`

In the second set of tokens, notice how the phone number and the credit card number are single tokens and how two important words in the sentence (`bank` and `card`) are ready to be matched simply without the obfuscating punctuation. The Social Patrol tokenizer recognizes and preserves entities and puncutation groups to yield better tokenization than basic approaches.

Part of Speech Tagging

Part of Speech Tagging is the process of assigning a part of speech to indivudal tokens. Tagging takes on two granularities, coarse-grained (NOUN, VERB) and fine-grained (PROPER NOUN, TRANSITIVE VERB), and helps in suppressing false positives that would arise when token matching.

For example, consider the word `fire`. The Layoffs classifier is concerned with `fire` when it is a verb, while the Public Safety classifier is more interested in `fire` when it is a noun.

All text that is tokenized is also tagged for later use with token techniques.

Entity Detection

Entity Recognition (ED) is the process of identifying and extracting key information (entities) within text. Social Patrol detects many entities including phone numbers, credit card numbers, email addresses, and stock tickers; ER is closely coupled with the tokenizer.

All text that is tokenized is also scanned for entities for later use with token techniques.

Named Entity Recognition

Named Entity Recognition (NER) is subset of ER focused on identifying and extracting entities that are "named" (i.e. are proper nouns).

NER is more difficult that ER because of the range of variability in proper nounse. When considering the "place" entity, `Bank of America Stadium` and `Joe's Burger on Main` are named entities while `stadium` and `restaurant` are entities.

The Physical Locations and the Mergers and Acquisitions classifiers are the only classifiers that engage in NER currently.

Sentiment Analysis

Sentiment Analysis (SA) is the process of determining in the sentiment of text ranging from negative to neutral to positive. Machine learning approaches have been very successful in the task.

The Customer Complaints and the FINRA Response Risks classifiers are the only classifiers that engage in SA currently.

Computer Vision Tasks

Computer Vision (CV) is a field devoted to understanding visual information. In recent years, the field has been dominated by Convolutional Neural Networks that have surfaced many breakthroughs in long-standing, difficult tasks.

Classification

All of our CV classifiers participate in this task as binary classifiers, flagging incoming content into one of two buckets: accept or reject.

Optical Character Recognition

Optical Character Recognition (OCR) is the process . We use a third-party service (ocr.space) to perform this task on all images and selected frames from videos; the extracted text can then be sent through the text classifier suite.

Object Detection

Object Detection (OD) is the process of detecting the presence of certain objects in an image. The training process for this type of classifier is more involved than a straightforward image classifier.