Getting Started With Rockwell

Tutorial

The intended audience for this tutorial are the first-time users of Rockwell.

This tutorial guides the reader step-by-step through various features of Rockwell, and demonstrates how to quickly put them to use. It follows a specific use-case scenario: the required functionality is the automatic alerting about selected major events in the life of a hi-tech company as reported in the financial news (i.e. fundraising, acquisition, merger, and IPO). It uses a selection from the collection of TechCrunch news articles publicly available here.

The code used in this tutorial is available as a project titled Playground that is available in the Rockwell GitHub repository.

Table Of Contents

Introduction

 

Let’s assume that we wish to receive alerts every time a hi-tech company goes through a major event: fundraising, acquisition, merger or IPO. Our application will monitor financial news streams, analyze the articles, and fire alerts when such an event is identified. To accomplish this we need to develop an NLP application that can carry out this identification. The following tutorial demonstrates how to do that using Rockwell.

 

This tutorial assumes that you have cloned or downloaded the Rockwell repository from GitHub: https://github.com/NahumKorda/Rockwell-NLP.

 

Before we start building this application we need to collect a representative data sample. In this tutorial we shall use 4,102 TechCrunch articles stored in this file: https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/resources/data/techcrunch.

 

Term And Phrase Extraction

The first step is to analyze and better understand the specific terminology used in the data sample. This is accomplished using the VocabularyExtractor class in Rockwell. This class provides three extraction methods:

  • Single word extraction outputs a list of single words sorted descending by their frequency of occurrence. The words are extracted in lower case, and include inflected forms as they occur in the text.

  • Lemma extraction also outputs a list of single words sorted descending by their frequency of occurrence, only inflections are grouped together by their canonical form (lemma).

  • Phrase extraction outputs a list of multi-word phrases sorted descending by their frequency of occurrence. The constituents of the phrases are lemmas.

The VocabularyExtractor class is instantiated by passing an instance of the Java util Property class containing a configuration specific to each of the methods above.

 

Single Word Extraction

Let’s start with the single word extraction. The following code instantiates the VocabularyExtractor for the single word extraction:

String filePath = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/data/techcrunch";

 

int threshold = 1000;

String exclusions = Exclusions.STOPWORDS.name() + "," + Exclusions.CONTRACTIONS.name() + "," + Exclusions.SYMBOLS.name() + "," + Exclusions.DIGITS.name();

 

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), VocabularyExtractor.Tasks.EXTRACT_SINGLE_WORDS.name());

properties.put(PropertyFields.EXCLUSIONS.getField(), exclusions);

properties.put(PropertyFields.THRESHOLD.getField(), Integer.toString(threshold));

 

this.extractor = new VocabularyExtractor(properties);

ArrayList<String> lines = TextFileReader.read(filePath);

for (String line : lines) {

    this.extractor.process(line);

}

 

this.extractor.print(false);

This code is available here (you can try it if you cloned or copied the Rockwell repository): https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/SingleWordExtractionTester.java.

 

Threshold is the minimum frequency of occurrence for a single word to be included in the output.

 

Exclusions specify which word classes must be excluded from the output. The options are the following:

  • Stopwords are listed in a text file and include generic word classes (conjunction, determiners, pronouns, etc.),

  • Contractions are created by contracting two words in one (“can’t”, “don’t”, “it’s”, etc.),

  • Symbols are single character words with specific meaning (“&”, “$”, “%”, etc.),

  • Digits are words consisting only of numbers.

 

As the result the following list is printed out (truncated to twenty items):

  • ☆ 9718 company

  • ☆ 6494 million

  • ☆ 6431 app

  • ☆ 5623 users

  • ☆ 4598 says

  • ☆ 4459 mobile

  • ☆ 4256 data

  • ☆ 3871 startup

  • ☆ 3598 companies

  • ☆ 3456 facebook

  • ☆ 3305 service

  • ☆ 3290 funding

  • ☆ 3162 today

  • ☆ 3143 platform

  • ☆ 3048 business

  • ☆ 2958 apps

  • ☆ 2828 use

  • ☆ 2697 social

  • ☆ 2655 google

  • ☆ 2505 market

 

This list is clearly not very interesting. Besides, it is unnecessary to distinguish inflections (“company” and “companies”, “app” and “apps”). The latter can be resolved by using the lemma extraction method instead.

 

Lemma Extraction

However, beyond grouping the inflections in a single item we shouldn’t expect more exciting results. To get a better insight how these words are used in sentences, we can print out exemplary sentences for each lemma. Moreover, looking into our specific task, most of the extracted words are not of a real interest. The lemma extraction method provides a filter that outputs only the specified lemmas. This filter enables us to see how often the lemmas of interest occur in the text, and how they are used in sentences.

 

The following code instantiates the VocabularyExtractor for the lemma extraction using the filter and printing out the sentences:

String filePath = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/data/techcrunch";

 

int threshold = 100;

String exclusions = Exclusions.STOPWORDS.name() + "," + Exclusions.CONTRACTIONS.name() + "," + Exclusions.SYMBOLS.name() + "," + Exclusions.DIGITS.name();

 

String positive = "fund,funding,invest,investment";

 

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), VocabularyExtractor.Tasks.EXTRACT_LEMMAS.name());

properties.put(PropertyFields.EXCLUSIONS.getField(), exclusions);

properties.put(PropertyFields.THRESHOLD.getField(), Integer.toString(threshold));

properties.put(PropertyFields.POSITIVE_FILTER.getField(), positive);

this.extractor = new VocabularyExtractor(properties);

 

ArrayList<String> lines = TextFileReader.read(filePath);

for (String line : lines) {

    try {

this.extractor.process(line);

    } catch (Exception ex) {

Printer.print(ex.getMessage());

    }

}

 

this.extractor.print(true);

This code is available here (you can try it if you cloned or copied the Rockwell repository): https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/LemmaExtractionTester.java.

 

Note the try/catch block when reading the file. This block is necessitated since the text is being split, tokenized and lemmatized. All three of these operations can throw exceptions if the text cannot or should not be processed. For example, if the text is not in the Latin alphabet.

 

Also note that we lowered the threshold to 100. The terms of interest don’t necessarily occur very often in the sample data, but nonetheless we want to inspect them.

 

The term “funding” can be both a noun and a verb (in gerund). The lemma of “funding” as a noun is “funding”, but as a verb it is “fund”. This is why it is necessary to include both forms in the filter. As the result of this ambiguity, the output will include overlapping sentences.

As the result the following output is printed out (sentences truncated to three):

  • ☆ 4261 fund

    • Follow-on funding tends to be a good sign that the incubator itself isn't the only one who sees value in the nascent business.

    • "But with some reality checks sinking in, it will be far more tough for Flipkart and others to raise fresh funding at the valuation they would like.

    • That sounds like a lot of money for one product, but Altitude is also announcing that it has raised $17.5 million in new funding.

  • ☆ 3290 funding

    • Rockets of Awesome is backed by $7 million in seed funding from General Catalyst, Forerunner Ventures and LAUNCH.

    • Civic, a team of 14 in Palo Alto, is backed by $2.75 million in seed funding from investors including Social Leverage, Founder Collective, Ashton Kutcher and Guy Oseary's Sound Ventures, Propel Venture Partners, Pantera Capital, Blockchain Capital, Block Chain Space and Digital Currency Group.

    • To do so, the startup announced today that it is taking on $2.6 million in seed funding from Jeff Clavier at SoftTechVC, 500 Startups, Burrill & Co, among others.

  • ☆ 1373 investment

    • Yes, there's been a big jump in the amount of debt financing and a less-than-expected increase in VC investment over the last year.

    • The second of these will be Alibaba making a bigger overall investment into SingPost.

    • Without help, it will be a big cost for startups in Cuba, where the average GDP per capita is $6,051.22, compared to $53,041 in the U.S. On the other hand, even a $500 fee is less expensive than the investment a company would typically have to make to go through this process before Stripe launched Atlas.

  • ☆ 707 invest

    • Alipay, An Alibaba Group affiliate, makes it easy for consumers to purchase goods online, while Alipay's microfinance site Yu'e Bao ("leftover treasure" in Chinese), which launched in June 2013, lets users to invest tiny amounts of money - as little as one yuan (about 17 cents) - into a money market fund.

    • "Qualcomm's investing a lot in things like machine learning, the ability to run all of the sensors in your smartphone all of the time.

    • In total, there were 116 investors in the round, 70 percent of which invested purely online through a product AngelList announced earlier this year.

  • ☆ 231 invested

    • Chamath Palihapitiya, founder and managing partner, Social+Capital Partnership said in a release of the new investment: "When we first invested in Yammer, it was because the company reminded us so much of the early days at Facebook.

    • Investors in ClassDojo's seed round include Y Combinator co-founder Paul Graham (who invested personally, not through YC), Ron Conway, Jeff Clavier, Kapor Capital, Start Fund, General Catalyst, Morado Ventures, Lerer Ventures, NewSchools Ventures, Learn Capital, along quite a few angel investors (including Flixter CEO Joe Greenstein and OpenFeint Founder Jason Citron).

    • "We have some of the highest engagement in mobile social apps, and they saw what kind of a team we had and how we were able to make products and they invested in that, " Chalasani added, discussing how this round came together.

  • ☆ 198 investing

    • Something important to gun owners investing in safety tech like locks or safes is the ability to quickly access their firearm when they need to.

    • Many of them are investing because our mission resonates with their personal experiences, and all of them care deeply about what we're doing.

    • But, when a friend or family member's primary reason for investing is to see a return, things can become problematic.

  • ☆ 108 funded

    • Startups MUST be funded beyond early stage.

    • What Now Travel said the pair will be launching a jointly branded app later this month - called The London Official City Guide App (and clad in its official red brand colourings) - which will replace the current official city guide app made by the London tourist board's promotional organisation, London & Partners (itself funded by the Mayor of London, plus a network of commercial partners).

    • Funded through a crowd-sale, more than 9,000 transactions were made to acquire Ethereum from buyers consisting mainly of software developers, says Lubin.

 

This output indeed contains more useful information than obtained by the single word extraction, and we can start noticing certain regularities (e.g., “backed by $xxxx million in seed funding”). The frequency of occurrence also indicates that the selected terms are relevant. However, in order to identify the regularities in the way these terms are used, we need to extract phrases containing them.

 

Phrase Extraction

The following code instantiates the VocabularyExtractor for the phrase extraction:

String filePath = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/data/techcrunch";

 

int max = 6;

int min = 0;

 

String exclusions = Exclusions.STOPPHRASES.name();

int threshold = 10;

 

String positive = "investment, invest, funding, funded";

String negative = null;

String required = "investment, invest, funding, funded";

 

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), VocabularyExtractor.Tasks.EXTRACT_PHRASES.name());

properties.put(PropertyFields.MIN_PHRASE_LENGTH.getField(), Integer.toString(min));

properties.put(PropertyFields.MAX_PHRASE_LENGTH.getField(), Integer.toString(max));

 

properties.put(PropertyFields.EXCLUSIONS.getField(), exclusions);

properties.put(PropertyFields.THRESHOLD.getField(), Integer.toString(threshold));

 

properties.put(PropertyFields.POSITIVE_FILTER.getField(), positive);

properties.put(PropertyFields.NEGATIVE_FILTER.getField(), negative);

properties.put(PropertyFields.REQUIRED_FILTER.getField(), required);

 

this.extractor = new VocabularyExtractor(properties);

 

ArrayList<String> lines = TextFileReader.read(filePath);

for (String line : lines) {

    try {

this.extractor.process(line);

    } catch (Exception ex) {

Printer.print(ex.getMessage());

    }

}

 

extractor.print(true);

This code is available here (you can try it if you cloned or copied the Rockwell repository): https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/PhraseExtractionTester.java

 

Now we are excluding stop phrases, rather than stopwords. Stop phrases are also stored in a text file.

 

Phrase extraction accepts three filters:

  1. Positive filter contains strings that a sentence must contain in order to be considered at all. Sentences that don’t contain any of the positive filter strings are ignored.

  2. Negative filter contains strings that cause sentences to be ignored, even if they contain positive filter strings.

  3. Required filter contains strings that an extracted phrase must contain in order to be included in the output. Positive filter and required filter are often identical.

Note that filters can also contain stems (in contrast to the lemma extraction filter that must contain valid lemmas only). For example, “acquir” in the positive filter would include sentences containing ”acquire”, “acquires”, “acquired”, “acquiring”.

Values min and max determine the minimum and maximum length of a phrase in tokens. Note that the Saxon genitive, contractions, currencies and percentage are expanded into separate tokens when the sentence is tokenized. For example, “Jane’s book” consists of three tokens, since “‘s” is a separate token.

 

Also note that specifying the min value as 0 or 1 would still output only two-token phrases. If single tokens need to be extracted, single word extraction and lemma extraction must be used.

As the result the following output is printed out (containing a handful of selected items and sentences truncated to three):

  • ☆ 709 million in funding

    • Today, the company has raised another $2 million in funding to continue to grow its company, whose app now reaches 2 million smartphone users.

    • Appistry, a St. Louis, MO-based developer of cloud solutions, has raised $12 million in funding in a Series D round led by private equity firm eXome Capital.

    • It has raised a total of $10.8 million in funding, including a $6 million Series B round from returning investor Battery Ventures last December.

  • ☆ 415 million in seed funding

    • it, is announcing that it has raised $4.3 million in seed funding from Metamorphic Ventures, Founder Collective, Tekton Ventures, Western Technology Investment, Stanford-StartX Fund (SpotOn.

    • Cleanly has raised $2.3 million in seed funding.

    • In its case, the company has raised $1.5 million in seed funding for its "hyper-local" chatting app, which controversially grew popular with school kids, leading to several bannings due to cyberbullying, drawing national headlines.

  • ☆ 172 million in new funding

    • -based company said it raised $15 million in new funding from seed investor Khosla Ventures and new investor Otter Capital.

    • Lolly Wolly Doodle has since gone on to raise $20 million in new funding from Steve Case's Revolution Growth fund, FirstMark Capital, Highline Venture Partners, and Novel TMT Ventures - move that the Soldsie team believes validates what they're building in terms of Facebook-based commerce.

    • But it seems to have bounced back, and it's accelerating those efforts with $7 million in new funding.

  • ☆ 101 venture funding

    • The company secured more than $25 million in venture funding, but it still wasn't enough.

    • This week the startup added to its coffers, announcing that it has raised an additional $6.3 million in venture funding.

    • Smart home automation startup Zuli has just raised $1.65 million in venture funding following a successful Kickstarter campaign.

  • ☆ 81 additional funding

    • The company is announcing that it has raised $1 million in additional funding from backers including Chopra and former IBM executive Doug Maine, bringing its Series A to $2.1 million total.

    • We had heard reports they had raised additional funding last year, but we were never able to confirm this.

    • The company has also just raised another $500,000 in additional funding from existing investors including Lars Rasmussen, Google Maps creator and currently Facebook's Director of Engineering, as well as Silicon Valley angel investor Bill Tai.

  • ☆ 78 million in venture funding

    • In less than a year, Freshdesk has already raised $6 million in venture funding from Tiger Global and Accel, and, though it believes that the biggest market opportunity down the road will be in offering its brand of cloud customer support to the enterprise, Freshdesk wants to entice (and give back to) the little guys as well.

    • The company secured more than $25 million in venture funding, but it still wasn't enough.

    • The San Francisco-headquartered company, which has raised $76 million in venture funding to date, previously bought Foundog in 2010, HelloWorld in 2014, and Chronos Mobile Technologies just last year.

 

It is easy to quickly identify in the output several repeating patterns. For example, “raise xxx million in funding”. This is sufficient to start developing the Rockwell expressions for the classification of news articles in the first of the four predefined categories: fundraising. We shall come back to this output of the phrase extraction in this tutorial, and consult it extensively while developing the Rockwell expressions.

 

Classification

Rockwell expressions are stored in a plain text file available here: https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/resources/script/expressions.

 

Let’s start with a simple expression that can identify a phrase such as “company raised $1.2 million in seed funding”. First we need to see what this sentence looks like after lemmatization. For that we shall use the Pipeline class. This class provides several NLP methods, which we shall meet later in this tutorial. Currently, we shall use the tokenize and lemmatize methods. The following is the code that instantiates Pipeline for lemmatization:

String test = "company raised $1.2 million in seed funding";

 

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), Pipeline.Tasks.LEMMATIZE.name());

 

Pipeline pipeline = new Pipeline(properties);

 

ArrayList<Token> tokens = pipeline.lemmatize(pipeline.tokenize(new StringBuilder(test)));

 

TokenPrinter.printTokensWithPOS(tokens);

This code is available here (you can try it if you cloned or copied the Rockwell repository): https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/LemmatizerTester.java.

 

As the result we receive the following:

[NN1]company raised [XZ2]$ [CRD]1.2 million [PRP]in seed funding

Nothe that “$1.2” created two tokens: “[XZ2]$” and “[CRD]1.2”.  “XZ2” signals that the token is either a currency name (e.g., “Dollar”) or a currency symbol (e.g., “$”). “CRD” signals that the token is a cardinal number either in words (e.g., “eleven”) or in digits (e.g., “11”).

 

Accordingly, we can now formulate a simple Rockwell expression:

@lemma :raise ; @pos :XZ2 ; @pos :CRD ; @cain :million ; @cain :in ; @cain :funding | funding

Rockwell expressions consist of the instruction part and the tag, separated by the pipe character (“|”). Tag can be any string, and has a meaning only in the context of the application that uses the expression. In our case, the tag signals that the expression should identify a funding event.

 

The instruction consists of several elements separated by a semicolon (“;”). Each element consists of an aspect, preceded by an “at” character (“@”), and a value, preceded by a colon (“:”). The aspect part specifies which aspect of a token is to be matched, and the value specifies what should be matched in that aspect. The following is an explanation of the instruction elements above:

Accordingly, the elements “@pos :XZ2 ; @pos :CRD ; @cain :million” stand for any amount of millions of Dollars (e.g., “$1.2 million”).

 

Next, let’s try this expression on our data sample. We shall again use the Pipeline class, but now we shall instantiate it for classification:

String filePath = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/data/techcrunch";

 

String patterns = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/script/patterns";

 

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), Pipeline.Tasks.CLASSIFY.name());

properties.put(PropertyFields.EXPRESSIONS.getField(), expressions);

 

try {

    

    Pipeline pipeline = new Pipeline(properties);

 

    for (String line : TextFileReader.read(filePath)) {

 

        try {

            

            if (TextToolbox.isEmpty(line)) continue;

 

            for (StringBuilder sentence : pipeline.split(line)) {

                

                ArrayList<Token> tokens = pipeline.lemmatize(pipeline.tokenize(sentence));

                ArrayList<Tag> tags = pipeline.classify(tokens);

                tags.forEach((tag) -> {

                    Printer.print(tag.getTag() + "\t" + sentence);

                });

            

            }

 

        } catch (Exception ex) {

            Printer.printException(ex);

        }

 

    }

 

} catch (Exception ex) {

    Printer.printException(ex);

}

This code is available here (you can try it if you cloned or copied the Rockwell repository): https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/ClassificationSampler.java.

 

As the result the following output is printed out (the first ten sentences):

  • funding AlleyNYC, a co-working space founded 2012, has raised $16 million in funding led by Vandewater Capital Holdings, with participation from Entrepreneur Media.

  • funding Stripe has raised $280 million in funding and is valued at $5 billion, and Atlas is an example of how the company is looking both to widen the funnel for the number of businesses that use its payments platform, as well as diversify the kinds of revenue-generating products that it can offer to them.

  • funding A new startup promises to help mobile developers spend their ad dollars more effectively, and it has raised $3.5 million in funding.

  • funding After graduating from the startup incubator H-Farm, Depop's founder Simon Beckerman also raised €1 million in funding from Balderton Capital and Holtzbrinck Ventures.

  • funding It has raised $66 million in funding and has filed for a $120 million IPO.

  • funding Another milestone for online publishing company SAY Media, just 10 days after it had announced that Time Magazine publisher Kim Kelleher would be coming on board as president in September: today it confirmed that it has raised $27 million in funding, which it will use to make acquisitions and enhance its publishing platform.

  • funding Twilio has to date raised $33.7 million in funding from an A-list of backers including Besssemer Venture Partners, Union Square Ventures and Dave McClure.

  • funding AppDynamics, which specializes in application performance management (APM) solutions for high-maintenance Web apps, has raised $20 million in funding in a Series C round led by Kleiner Perkins Caufield & Byers.

  • funding That might not seem like a big round for a company that has already raised $30 million in funding, but a spokesperson insisted that Appia isn't raising money because it needed more cash.

  • funding Appistry, a St Louis, MO-based developer of cloud solutions, has raised $12 million in funding in a Series D round led by private equity firm eXome Capital.

 

Instead of printing out sentences in which the expression was identified, we can count how many sentences were tagged by the “funding” tag, using this code: https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/ClassificationTester.java.

The result is: 256.

 

This is not bad for the beginning, but now we need to go back to the output of the phrase extraction. One thing that immediately sticks out are various types of funding: seed funding, venture funding, new funding, additional funding… therefore we should create an expression for each of these types. These expressions would be very similar to the one above, only each would have an additional element specifying the type. For example:

@lemma :raise ; @pos :XZ2 ; @pos :CRD ; @cain :million ; @cain :in ; @cain :seed ; @cain :funding | funding

However, this is not how this is done in Rockwell. Instead we shall use patterns.

 

A pattern is also a Rockwell expression, only it is not executed unless it is called from another expression. The way to call a pattern is by applying multi-aspect specifications. For example:

@cain+prefix :funding+*funding_type* | funding

The first aspect “@cain” requires the value “funding” to be matched case insensitive. The additional aspect “+prefix” requires whatever precedes the matched token to match a pattern whose tag is “*funding_type*”. The additional aspect is specified as “prefix”, since the pattern is expected to precede the token matching “funding”. This token is referred to as the “anchor”, since a pattern is “anchored” to it.  Patterns can be executed also as suffixes (following an anchor) and infixes (preceding an anchor that is preceded by some other token).

 

Let’s define few such patterns:

@cain :seed | *funding_type*

@cain :total | *funding_type*

@cain :additional | *funding_type*

@cain :new | *funding_type*

@cain :venture | *funding_type*

@cain :vc | *funding_type*

@cain :angel | *funding_type*

We shall store these patterns in a plain text file here: https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/resources/script/patterns.

 

A good practice is to enclose pattern tags between two asterisks (“*”), in order to easily distinguish them from the classification expressions.

 

Accordingly, the expression above would match all of the following: “seed funding”, “total funding”, “additional funding”, etc. Let’s apply this to our classification expression:

@lemma :raise ; @pos :XZ2 ; @pos :CRD ; @cain :million ; @cain :in ; @cain+infix{*} :funding+*funding_type* | funding

Note that the pattern aspect is specified as “infix”. This is because the anchor “funding” is preceded by “in”. Also, note the modality “{*}”. It signals that the pattern is optional. Accordingly, both of the following phrases would be matched:

  • “company has raised $2 million in funding”

  • “company has raised $2 million in seed funding”

 

Now we need to modify the code by adding the pattern location to the configuration:

String filePath = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/data/techcrunch";

 

String patterns = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/script/patterns";

String concepts = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/script/concepts";

String expressions = System.getProperty("user.home") + "/code/Rockwell-NLP/Playground/src/main/resources/script/expressions";

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), Pipeline.Tasks.CLASSIFY.name());

properties.put(PropertyFields.PATTERNS.getField(), patterns);

properties.put(PropertyFields.EXPRESSIONS.getField(), expressions);

 

try {

    

    Pipeline pipeline = new Pipeline(properties);

 

    for (String line : TextFileReader.read(filePath)) {

 

        try {

            

            if (TextToolbox.isEmpty(line)) continue;

 

            for (StringBuilder sentence : pipeline.split(line)) {

                

                ArrayList<Token> tokens = pipeline.lemmatize(pipeline.tokenize(sentence));

                ArrayList<Tag> tags = pipeline.classify(tokens);

                tags.forEach((tag) -> {

                    Printer.print(tag.getTag() + "\t" + sentence);

                });

            

            }

 

        } catch (Exception ex) {

            Printer.printException(ex);

        }

 

    }

 

} catch (Exception ex) {

    Printer.printException(ex);

}

If we count the matched sentences now we get: 728. This is almost a three-fold improvement. Nonetheless, we can do even better. Let’s go back to the output of the phrase extraction.

 

Another thing that sticks out is that there are many ways to state that a company obtained funding. For example: in addition to raising funding, a company can “announce funding”, “disclose funding”, “confirm founding”, “secure funding”, etc. Again, we could create multiple classification expressions: one for each such case, but again this is not how this is done in Rockwell. For this Rockwell features the concept identification.

 

Concept Identification

Concept identification is similar to the classification in one aspect: it also applies Rockwell expressions. However, unlike classification, concept identification does not return tags of the matched expressions, but rather replaces matched tokens with a so-called semtoken (semantic token).

 

Semantic Tokens

Let’s take, for example, the following sentence: “company is backed by $2 million in funding”. We wish to modify our classification expression, so that it can also identify this sentence. We need first to define an expression that identifies both “raised” and “backed by” as having the same meaning in the given context (they both express the fact that a company received funding):

@lemma :raise | _funding_action_

@cain :backed ; @cain :by | _funding_action_

We can further extend these expressions with additional phrases featuring the same meaning in the given context:

@lemma :raise | _funding_action_

@lemma :disclose | _funding_action_

@lemma :announce | _funding_action_

@lemma :confirm | _funding_action_

@lemma :secure | _funding_action_

@lemma :nab | _funding_action_

@lemma :grab | _funding_action_

@lemma :land | _funding_action_

@lemma :get | _funding_action_

@lemma :receive | _funding_action_

@lemma :close | _funding_action_

@lemma :collect | _funding_action_

@lemma :take | _funding_action_

@lemma :take ; @cain :on | _funding_action_

@cain :backed ; @cain :by | _funding_action_

@lemma :pick ; @cain :up | _funding_action_

@lemma :pull ; @cain :in | _funding_action_

We shall store these patterns in a plain text file here: https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/resources/script/concepts.

 

It is a good practice to enclose concepts within two underscore characters (“_”), in order to distinguish them from both patterns and classification expressions.

 

Next, let’s process the sentence above using the Pipeline class instantiated for concept identification:

String test = "company is backed by $2 million in funding";

 

String patterns = "/home/nahum/code/Rockwell-NLP/Playground/src/main/resources/script/patterns";

 

String concepts = "/home/nahum/code/Rockwell-NLP/Playground/src/main/resources/script/concepts";

 

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), Pipeline.Tasks.INSERT_CONCEPTS.name());

properties.put(PropertyFields.PATTERNS.getField(), patterns);

properties.put(PropertyFields.CONCEPTS.getField(), concepts);

 

Pipeline pipeline = new Pipeline(properties);

 

for (ArrayList<Token> tokens : pipeline.insertConcepts(test)) {

    TokenPrinter.printTokensWithPOSAndRole(tokens);

}

This code is available here (you can try it if you cloned or copied the Rockwell repository): https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/InsertionTester.java.

 

As the result, we get the following:

[NN1]company [VBZ]is (backed [PRP]by {_funding_action_ [2->3]}) [XZ2]$ [CRD]2 million [PRP]in funding

Note that “backed by” is now treated as a single semtoken (as it is enclosed in the brackets), and that the semantic role of that semtoken is specified as the “_funding_action_”. Next, let’s process the other sentence: “company raised $1.2 million in seed funding”. As the result we get this:

[NN1]company (raised {_funding_action_ [1->1]}) [PRP]by [XZ2]$ [CRD]1.2 million [PRP]in funding

Clearly, both “backed by” and “raised” feature the same semantic role: “_funding_action_”. This enables us to modify the classification expression like this:

@role :_funding_action_ ; @pos :XZ2 ; @pos :CRD ; @cain :million ; @cain :in ; @cain+infix{*} :funding+*funding_type* | funding

Note that the aspect that matches the semantic role of a semtoken is “role”.

 

At this point the question can be raised regarding why it is necessary to use the concept identification method here, rather than to define another set of patterns. The complete answer to this question is granted below, but only after we undertake yet another modification of our classification expression.

 

If we count the matched sentences now, we get: 1003. Nonetheless, there is still room for improvement. We must go back to the output of the phrase extraction.

 

Semtokens And Patterns Combined

What can we do with the sentences such as: “company has raised a total of $20.6 million in funding”, or “company has raised a whopping $50 million in funding”? Clearly, the currency amount can be preceded by an adjective or a noun phrase. However, these phrases are rather specific, and are used only in the monetary context. Therefore, we can define them as additional patterns:

@cain :another | *amount_qualifier*

@cain :a ; @cain :further | *amount_qualifier*

@cain :an ; @cain :additional | *amount_qualifier*

@cain :a ; @cain :substantial | *amount_qualifier*

@cain :a ; @cain :total ; @cain :of | *amount_qualifier*

@cain :a ; @cain :whopping | *amount_qualifier*

@cain :close ; @cain :to | *amount_qualifier*

@cain :just ; @cain :over | *amount_qualifier*

@cain :less ; @cain :than | *amount_qualifier*

@cain :more ; @cain :than | *amount_qualifier*

Now we can improve our classification expression like this:

@role :_funding_action_ ; @pos+infix{*} :XZ2+*amount_qualifier* ; @pos :CRD ; @cain :million ; @cain :in ; @cain+infix{*} :funding+*funding_type* | funding

Note that “*amount_qualifier*” is also specified as optional.

 

The count of the matched sentences is now 1374.

 

Now we can go back to the question above: why did we resort to the concept identification instead of using the patterns?

 

Let’s assume that instead of the concept “_funding_action_” we indeed preferred defining a collection of patterns tagged “*funding_action*”. Accordingly, our classification expression would be defined as follows:

@pos+prefix :XZ2+*funding_action* ; @pos :CRD ; @cain :million ; @cain :in ; @cain+infix{*} :funding+*funding_type* | funding

But how could we also use the pattern “*amount_qualifier*” now when we are already prefixing a pattern to the first token? -- In some cases it would be possible to use the recursive nature of patterns, in order to combine multiple patterns into one, but recursive solution is impossible in this particular case, and the entire recursion subject is out of the scope of this “getting started” tutorial. 

 

Nonetheless, this case warrants a more generic question: in which cases it is appropriate to apply patterns, and in which cases they cannot or at least should not be applied? -- To answer this question, we need to encounter yet another use of patterns.

 

The purpose of this tutorial is purely didactic, and we have no intention of providing here a comprehensive classification system for financial news. Nonetheless, we need to take a look here into another of the four classification categories: IPO.

 

We are leaving to the reader to run the phrase extraction as an exercise. It will become soon evident that initial public offerings are reported in the news using only a handful of recurring phrases. One of them is: “company filed for [initial] public offering”. Accordingly, this would make a good combination of patterns, concepts and a classification expression:

Note that concepts can also use patterns, if you specify them in the configuration.

 

Also note that the patterns do not use the part of speech specification “DT0” (the generic determiner part of speech) that would cover both “@lemma :a” and “@cain :the”. That problem here is that “DT0” also includes “no”, which would negate the entire statement.

 

When you try the expression above you’ll find out that it also includes the following: a company “filed for IPO in 2011”. Such historic information is of no interest to us, since we are developing an application that raises alerts about the most recent events. The kind of a statement of interest would rather be something like this: company “filed for an IPO just last night”. Therefore, we can define a few suitable patterns that express recentness of an event:

@cain+prefix{*} :now+adverb | *recently*

@cain+prefix{*} :today+adverb | *recently*

@cain+prefix{*} :this+adverb ; @cain :afternoon | *recently*

@cain+prefix{*} :this+adverb ; @cain :morning | *recently*

@cain+prefix{*} :yesterday+adverb | *recently*

@cain+prefix{*} :this+adverb ; @cain :week | *recently*

@cain+prefix{*} :last+adverb ; @cain :night | *recently*

Next we can modify the classification expression:

@cain :filed ; @cain :for ; @role+suffix :_ipo_+*recently* | ipo

This expression indeed identifies only recent events. However, it also misses a sentence like this one: company “filed for its highly anticipated initial public offering this afternoon”. The problem with such a sentence is that it features an adjective phrase (“highly anticipated”) preceding the “_ipo_” concept (“initial public offering”).

 

Rockwell provides a collection of builtin patterns that identify adverbial, adjective and noun phrases. We can make use of the builtin adjective phrase patterns here, and reformulate the expression:

@role :_ipo_action_ ; @role+infix{*}+suffix :_ipo_+adjective+*recently* | ipo

@role :_ipo_action_ ; @lemma :a ; @role+infix{*}+suffix :_ipo_+adjective+*recently* | ipo

@role :_ipo_action_ ; @lemma :its ; @role+infix{*}+suffix :_ipo_+adjective+*recently* | ipo

Note the use of two patterns on the ame anchor: the IPO concept anchors both an infix (an adjective phrase) and a suffix (expressions of recentness). However, the patterns cannot be of identical type (e.g., you cannot anchor two infixes with the same token).

 

Also, note that we dropped the determiner pattern. This pattern was an infix anchored to the IPO concept, but now we are anchoring an adjective phrase as an infix to that concept. As a result instead of one we have now three almost identical expressions!

 

Considering that a comprehensive classification system would be expected to feature a very large number of expressions, it would be convenient to somehow replace three almost identical expressions with only one, and thus simplify the future maintenance. The simplest way to accomplish this is to extend the determiner pattern with the adjective phrase:

@lemma+suffix{*} :a+adjective | *extended_determiner*

@cain+suffix{*} :the+adjective | *extended_determiner*

@pos+suffix{*} :PNS+adjective | *extended_determiner*

Note that a pattern can anchor another pattern.

 

And now we can replace three expressions with just a single one:

@role :_ipo_action_ ; @role+infix{*}+suffix :_ipo_+*extended_determiner*+*recently* | ipo

 

Semtokens vs. Patterns

Now we can revisit the question that was raised earlier: in which cases it is appropriate to apply patterns, and in which cases they cannot or at least should not be applied? This question can be rephrased: how do we decide whether to use semtokens or patterns?

 

First, we must note that semtokens and patterns serve different purposes.

 

Semtokens identify word constructs that carry a specific meaning in the specific context of the application. One such extreme are idioms - word constructs whose meaning cannot be inferred from the words (e.g., “to beat around the bush” has nothing to with either beating or bushes), while the other are ad hoc word patterns recurrent only in a specific intellectual domain, but rarely used otherwise (e.g., “file for” referring to an IPO). Semtokens identify such cases of interest, and then assign them an operative meaning.

 

Patterns, on the other hand, are merely a simplification mechanism. If a noun can be preceded by “a”, “the”, “this”, etc. it would be awkward to create a separate expression for each case. This would become completely impractical if an expression contains two or more nouns that all can be preceded by each of the determiners. Even solutions offered by the modern regular expression implementations (i.e. alternation blocks) become awkward when the set of alternatives is either large or complex (try to define a generic noun phrase using regular expressions). This is where the patterns come handy: a relatively simple mechanism can define a very large number of very complex lexical and syntactic constructs.

 

This distinction between the roles played by semtokens and patterns may seem merely conceptual at the first glance, but it is in fact also very technical. Semtokens are executed like expressions - serially. However, patterns are recursive: a pattern can call another pattern, which can call another, and another… This makes patterns preferred for handling both complexity and diversity. Moreover, patterns can be optional, while semtokens cannot.

 

On the other hand, patterns must be anchored. You cannot define a pattern as the primary or only aspect of an instruction element. Patterns are called on demand from a matched anchor. They are an auxiliary mechanism, but not the primary element. Semtokens, on the other hand, are primary elements, and they are referenced in expressions just like any other, simple token.

 

Summing it up, the best practice is to approach the development of Rockwell expressions in the following way:

  1. Identify recurring patterns of interest. For example: “[company] filed for IPO”.

  2. Identify lexical variations. Lexical variations introduce alternative wording with identical meaning in the given context. For example: “IPO”, “public offering” and “initial public offering” are all used with the same meaning in the sample data.

  3. Develop expressions that replace the lexical variations with semtokens, and use them for concept identification.

  4. Identify stylistic variations. Stylistic variations add adverbial and adjective phrases that specify or intensify the meaning of the pattern. For example, “this afternoon” provides a temporal coordinate to the “filing”, while “highly anticipated” intensifies the importance of the “IPO”.

  5. Generalize stylistic variations to the highest possible level. For example, “highly anticipated” could be replaced by any adjective phrase. If you can generalize them as adverbial, adjective or noun phrases, use the predefined patterns. Otherwise, devise your own patterns.

  6. Try hard to devise reusable patterns - patterns whose use is not restricted to a single expression, but can be applied to a wide variety of expressions.

 

Information Extraction

Now that we have an application that can alert us about some major corporate events reported in the news, you may wish to go one step further, and automatically extract some data about these events. For example, you may decide that not every funding event is of interest to you, but only if the funding amount exceeds $10 million. Accordingly, you need to extract the funding amount from the text, convert it to a number, and validate it against a threshold.

 

Let’s start with our classification expression:

@role :_funding_action_ ; @pos+infix{*} :XZ2+*amount_qualifier* ; @pos :CRD ; @cain :million ; @cain :in ; @cain+infix{*} :funding+*funding_type* | funding

The information that we wish to extract is “@pos :CRD ; @cain :million”. We can easily convert “@pos :CRD” into a number, and then compare it to a threshold. This information is somehow “pinned” between “@role :_funding_action_ ; @pos+infix{*} :XZ2+*amount_qualifier*” and “@cain :in ; @cain+infix{*} :funding+*funding_type*”. Clearly, we already know how to identify these two expressions in the text. What we need now is a mechanism that can extract what is between them. This is what the Rockwell frames do.

 

Rockwell frames consist of only four elements:

These elements are separated by commas.

 

Note that “from:” and “until:” are optional, and only one of them must be present, but not necessarily both. The element “filter:”, on the other hand, can be completely omitted.

 

Next, we need to list the expressions that our frame will use.

@role :_funding_action_ ; @pos+infix{*} :XZ2+*amount_qualifier* | -funding_amount_from-

@cain :in ; @cain+infix{*} :funding+*funding_type* | -funding_amount_until-

@pos :CRD ; @cain :million | -funding_amount_filter-

It is a good practice to enclose frame expressions within two hyphens (“-”), in order to distinguish them from concepts, patterns, and classification expressions.

 

Now we can define our frame:

from: -funding_amount_from-, until: -funding_amount_until-, filter:-funding_amount_filter-, meaning: funding_amount

Finally, we need to initiate the Pipeline class for information extraction:

String filePath = "/home/nahum/code/Rockwell-NLP/Playground/src/main/resources/data/techcrunch";

 

String patterns = "/home/nahum/code/Rockwell-NLP/Playground/src/main/resources/script/patterns";

String concepts = "/home/nahum/code/Rockwell-NLP/Playground/src/main/resources/script/concepts";

String expressions = "/home/nahum/code/Rockwell-NLP/Playground/src/main/resources/script/frameExpressions";

String frames = "/home/nahum/code/Rockwell-NLP/Playground/src/main/resources/script/frames";

 

Properties properties = new Properties();

properties.put(PropertyFields.TASK.getField(), Pipeline.Tasks.EXTRACT.name());

properties.put(PropertyFields.PATTERNS.getField(), patterns);

properties.put(PropertyFields.CONCEPTS.getField(), concepts);

properties.put(PropertyFields.FRAME_EXPRESSIONS.getField(), expressions);

properties.put(PropertyFields.FRAMES.getField(), frames);

 

try {

    

    Pipeline pipeline = new Pipeline(properties);

    

    for (String line : TextFileReader.read(filePath)) {

 

        try {

            

            if (TextToolbox.isEmpty(line)) continue;

 

            for (StringBuilder sentence : pipeline.split(line)) {

                

                ArrayList<Token> tokens = pipeline.lemmatize(pipeline.tokenize(sentence));

                ArrayList<Extract> extracts = pipeline.extract(tokens);

                extracts.forEach((extract) -> {

                    Printer.print(extract.toString());

                });

            }

            

        } catch (Exception ex) {

            Printer.printException(ex);

        }

 

    }

 

} catch (Exception ex) {

    Printer.printException(ex);

}

This code is available here (you can try it if you cloned or copied the Rockwell repository): https://github.com/NahumKorda/Rockwell-NLP/blob/master/Playground/src/main/java/com/itcag/rockwell/playground/ExtractorTester.java.

 

Note that we keep frame expressions in a separate file, in order to avoid the confusion with the classification expressions.

 

As the result we get the following list of extracted data (truncated to the first ten items):

  • 16 million [funding_amount]

  • 42 million [funding_amount]

  • 280 million [funding_amount]

  • 2 million [funding_amount]

  • 3.5 million [funding_amount]

  • 1 million [funding_amount]

  • 66 million [funding_amount]

  • 27 million [funding_amount]

  • 16 million [funding_amount]

  • 1 million [funding_amount]

 

Final Words

The purpose of this tutorial is to enable a programmer to start using Rockwell. Rockwell’s chief functionality is information extraction. Therefore, this tutorial demonstrates the methodology that starts with a corpus of documents in plain text (in our case, a collection of financial news articles), and ends with obtaining valuable information from it. Consequently, this tutorial does not only demonstrate how to write the code that operates Rockwell, but also instructs the reader how to analyze the underlying textual content, identify the recurring lexical and syntactic constructs, as well as lexical and stylistic variations, and formalize them as Rockwell expressions and Rockwell frames.

 

At the heart of this methodology is a realization that the way we express our thoughts, experiences and emotions favor patterns. The first to realize this was George Kingsley Zipf, an American linguist who discovered an odd mathematical relationship between the frequency of occurrence of individual words, and their rank if ordered by that frequency of occurrence - a statistical oddity known today as the Zipf’s Law, which was later demonstrated to hold also for n-grams and phrases. This feature of human languages enables natural language processing to identify useful information, and transform unstructured text into structured data that can be further analyzed using machine learning and statistics.

 

Rockwell is a natural language processing platform designed around this notion of patterns, and we hope that it will serve you well in your endeavours.

© 2020 IT Consulting AG. All rights reserved.