Artificial Intelligence. It’s one of those concepts that seems scary, fantastic, and also very far away from a place like Cleveland, OH. It's important to keep in mind that A.I. is a general term for many different topics in computer science that work towards a similar goal: the ability to make decisions without specific human instruction. Such topics include Computer Vision, Natural Language Processing (NLP), Machine Learning, and many others.

We often associate these brainy topics with famous west coast companies such as Google and Microsoft, or with prestigious east and west coast Ivy League universities. But let’s not forget about the pockets of brilliance that exist right here in our hometown. Our building of the filmbot is a nod to local skills and talent that often get overlooked despite the growing renaissance of our region. It’s easy to forget that Cleveland does have a bit of tech heft — be it big names such as IBM, startups in the Health Tech Corridor, or yours truly (my firm CodeRed).

But enough of the why — let’s get on to the how. There were 4 parts involved:

  1. Scraping data from clevelandfilm.org.
  2. Using NLP to analyze the plots to compute similarity and sentiment scores.
  3. Building a website to display our results.
  4. Building a twitterbot to serve up real-time chatter about our results.

The entirety of this project was implemented in Python. I’ll get into specifics below.

1. Getting the Film Data

This was actually one of the more challenging tasks. We had to put on our search engine hats and build a crawler to scrape data from the Film Festival’s website. Because the film information on clevelandfilm.org was not explicitly marked up, we had to get creative in how our crawler discovered the data. The first few runs found us pulling in information that was not exactly correct, such as “World Premier” instead of the full film description. We eventually got it right. But this further reinforced the importance of SEO — or search engine optimization. If clevelandfilm.org would have implemented SEO techniques such as structured data, it would have given us (and big brother Google) a much clearer outline of the key information.

The crawler was implemented in Python using urllib and BeautifulSoup, which then saved the parsed data into django models. We grabbed the title, description, image, and also the show times so that our twitterbot would be able to conveniently tweet whenever a film was starting or ending.

2. Natural Language Processing

Now, to implement the “brains” of the whole thing. Natural Language Processing (NLP) is essentially the computer science topic focused around breaking down and understanding human language. NLP was a key challenged addressed as part of the IBM Watson Jeopardy challenge, if anyone remembers that from a few years ago, so that the Watson computer could understand what in the heck those Jeopardy questions were actually asking. There are three specific NLP techniques we used as part of our process, all of which were implemented using the NLTK.

TF-IDF

TF-IDF, which stands for “Term Frequency, Inverse Document Frequency”, is one of the more well-established ways to score the similarity of two documents (document meaning a body of text, in this case, a film description). Essentially the computer breaks down the document into individual words, throws away the stopwords (common words such as: the, a, an, and, is, etc.), and then looks to see which other documents have the highest number of words in common with this particular document.

This technique is used all over the place, from simple suggestion engines, to built-in database search functions. TF-IDF is a simple way to compare how similar two documents are simply by how many words they have in common. Thus, it is effective, but not very smart. For instance, TF-IDF would say that “I went to the bank to deposit money” is similar to “I slid down the bank by the lake”. But in reality we know that “bank” has completely different meanings in both contexts.

Word Sense Disambiguation

In order to make our bot smarter to solve the “bank” problem, we must turn to word sense disambiguation (WSD). WSD means that the meaning of the word is determined based on the context, not just the spelling alone. By first breaking each document into sentences, and then analyzing each word of each sentence, we get a much better sense of the meaning of the film description.

Once we have determined the meaning of each word within the context of each sentence, we then look for lemmas to that meaning. Lemma is an abstract term that defines the true meaning of a word before you have spoken or written the word, but have an idea in your head. You can think of lemmas as synonyms. So for the first sentence, we extract the meaning of “bank”, and then find lemmas for each word in its current context.

“I went to the bank to deposit money”
BANK: meaning: a financial institution that accepts deposits and channels the money into lending activities.
BANK: lemmas: bank, banking company, financial institution

“I slid down the bank by the lake”
BANK: meaning: sloping land (especially the slope beside a body of water).
BANK: lemmas: slope, curve, side, edge, shore

As you can see, applying WSD give us a much clearer meaning as to what the sentence actually means.

Once we have compiled a list of every lemma of every word from every sentence in a film plot, we then compare the list of lemmas against the words in every other document. This gives us a very intelligent comparison of how similar two film plots actually are.

Our crawler pulled in 436 films from the clevelandfilm.org website. Every word in every sentence of every film was broken down, analyzed, and then compared to every other film. In computer science lingo, this means O(n2-n) comparison, or in human terms: 189,660 different comparisons.

That's right, the filmbot index contains 189,660 film analyses!

Sentiment Analysis

The final NLP principal we applied to the film plots was sentiment analysis. In this case, we use the VADER algorithm that is built into the NLTK to do the heavy lifting. The VADER sentiment analysis algorithm was trained using 10,000 tweets and 10,000 movie reviews from rotten tomatoes. Each one of these tweets and reviews was labeled by a human as being "positive" or "negative". The VADER algorithm processed each one, and essentially "learned" what was positive and what was negative. When we ran the film plots through VADER, it had a pretty good understanding of positive and negative from all those rotten tomato movie reivews, and was able to accurately score CIFF films in a similar manner. Each film was rated on a scale from -1.0 to 1.0 to indicate negativity or positivity.

3. Building the Web App

Now on to the easier part. We built the web app at https://ciff.coderedcorp.com/ using our favorite web framework, django. Since it's robot-themed, we decided to use materialize.css front-end framework to give it a nice Android-ish material design look and feel. As with all CodeRed projects, the site is fully responsive and fluid on desktops and smartphones alike. There's a lot of data here and we tried to best utilize the screen space to show the results as clearly as possible.

4. The Twitterbot

Not necessarily the most complicated, but one of the more fun parts of the project, is the twitterbot. While it may have a cute avatar, it's actually just a few lines of code run by a cron job every 5 minutes. This job checks to see what films are playing, tweets a link on the site; and also what films have recently ended, and looks through our database of 189,660 plot comparisons to see which one is the most similar, and tweets it to you. Twitter connectivity is obtained directly through Twitter's API (via use of twython) — no human intervention required.


Well, that’s all folks. It’s also worth noting that our development team built this over the course of 2 days. It was sort of a last minute idea that stretched our brain muscles and helped promote the Cleveland International Film Festival at the same time — win-win!

If you’re interested in working on these types of projects, get in contact with me at salvino@coderedcorp.com. We'd love to work with you as part of our team. And of course, we’re looking forward to filmbot 2.0 next year at CIFF41!