A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.

Alan Turing

This thought by Alan Turing is still resonating even today and driving today’s data scientists toward the end goal – building intelligent machines. What could make machines more intelligent than mastering a task that had been the exclusive privilege of humans − conversing effortlessly about arbitrary topics? To do this, they must understand and be able to produce natural language. Most of us never, or rarely, think about this aspect of our lives and how easy it is for us to speak and understand language. A most powerful machine, our brain constantly processes huge amounts of unstructured data in meaningful information and it has become very efficient at this task. We have the ability to communicate with one another through very short, non-explicit messages, relying on common sense. And we do this naturally. For machines, however, understanding human language is very hard, and solving an AI-complete problem is as hard as it gets.

What makes this problem so difficult is precisely what makes language natural − the rules relating to language building evolve daily and without predetermination. This makes natural language processing in the form of hand-coding sets an impossible task. As in similar situations, this is where data science or, more precisely, Natural Language Processing (NLP) steps in, combining the power of artificial intelligence, computer science and computational linguistics to build machines that can understand natural language and interact with humans naturally.

Whether we are aware of it or not, we use NLP applications every day through language translators (Skype Translator, Google Translate), digital assistants (Apple’s Siri, Amazon’s Alexa), search engines (Google Search), automatic summarization, email filters, etc. In the global job market, those with NLP skills are in high demand. Those working in the field produce ideas, new algorithms, and new techniques and resources on a daily basis. Recent achievements in fields of deep learning and memory-based recurrent networks are, together with access to big data, bringing NLP to a whole new level. Before digging into the latest NLP achievements and developing skills to progress with them, it is necessary to understand the basics and difficulties NLP faces. NLP deals with text, but it goes beyond symbols and words to try to understand linkages that create meaning. That is why NLP practitioners are concerned with the hierarchical structure of language: symbols make a word, words make a phrase, phrases make a sentence and all, together, spread ideas and meaning.

So, language is being approached on different levels: morphological, lexical, syntactic, semantic and pragmatic. The last one, dealing with meaning in different situations, is the biggest barrier preventing NLP from reaching its ultimate goal. Language ambiguity includes irony and sarcasm, both very complex language forms. However, ambiguity goes beyond the lexical form and even affects morphemes, such as with letter s in the English language. One of the sentences displaying language ambiguity in all its colors is:

Will, will Will will Will Will’s will?

(Check the meaning here)

If you’ve been struggling to understand the above sentence, imagine the struggle building machines which would understand it. Even though NLP faces lots of challenges, ambiguity being the biggest one, each of them is being attacked from different angles and it is just a matter of time before the barriers are overcome. With lots of open datasets and source frameworks such as the Natural Language Toolkit, Apache Open NLP, Stanford CoreNLP, Spacy, OpeNER and many others, the world of NLP has never been more accessible. However, in order to really understand NLP, we must get our hands at least a little bit dirty.

So, let’s solve one simple NLP example from the field of sentiment analysis. We will construct your own Vibes Guru, a system which would motivate you to keep a positive attitude.

Sentiment analysis usually involves three groups: positive, negative and neutral. We will build a classifier for two groups (classes), positive and negative, and fill the middle part with the probabilities of belonging to one of these classes. The example will be coded using Python.

We will start by constructing a train set, vocabularies of positive and negative words:

positive_vocab = ['good', 'awesome', 'outstanding', 'fantastic', 
                  'nice', 'great', ':)', 'successful', 'happy', 
                  'marvelous', 'like', 'love', 'adorable', 'happiness', 
                  'interesting', 'amazing', 'positive', 'beautiful'] 

negative_vocab = ['bad', 'terrible', 'useless', 'hate', 'hating', 
                  ':(', 'awful', 'boring', 'despicable', 'dislike', 
                  'ugly', 'horrible', 'negative', 'sad', 'sadness']

Then we will assign labels to each of these words. Positive words will get label 1 and negative label 0.

# labels: 1 - positive
#         0 - negative
positive_labels = [1] * len(positive_vocab)
negative_labels = [0] * len(negative_vocab)

We will then merge vocabularies and labels into a final train set.

x_train = positive_vocab + negative_vocab
y_train = positive_labels + negative_labels

Now we are ready to approach classification. Firstly, it is necessary to point out that there is, at the moment, no NLP model that can understand text data. So we need to transform data into a form the model can understand − a numerical one. We will use sckit-learn and CountVectorizer. This module basically takes string, tokenizes it and transforms it into numerical form, which contains information about tokens and their repetition in text. Its full power is gained when processing larger texts but we will use it here because it is a simple way to convert our dictionaries into numerical form.

# transform train data
count_vect = CountVectorizer()
count_vect.fit(x_train)
x_train = count_vect.transform(x_train)

Now, our train set is ready to be used in a classifier. We will use the simplest one, Naive Bayes Classifier, more precisely, Multinomial Naive Bayes, which is especially suitable for text classification, and we will use the one from sckit-learn − MultinomialMB.

It can be done with a maximum of two lines of code:

# train classifier
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(x_train, y_train)

We are ready to use our classifier on input sentences. The first step is preprocessing, which is a very important part of all data science projects. For classification it eliminates discrimination and transforms both train and input data into the same appropriate form, so that the classifier can learn all necessary forms, on character level, and understand all data on input.

Preprocessing is a big topic in itself, so we created very simple, noise-free train data represented with lower-case characters. We will assume the noise-free form of input data as well and the only preprocessing we need to perform is done using lower-case input words. Then, we will transform input sentence/sentences into numerical form using our trained CountVectorizer():

# classify vibes
input_text = input_text.lower()
input_text_cv = count_vect.transform([input_text])

One way of making classification is:

prediction = naive_bayes_classifier.predict(input_text_cv)

which will result in one of the labels being 0 (for negative vibes) and 1 (for positive vibes).

But, in order to build our Vibes Guru, we will use:

prediction = naive_bayes_classifier.predict_proba(input_text_cv)

which will result in probabilities of belonging to negative and positive vibes (classes). This will allow us to recognize levels of positivity/negativity. We will then use these values to get results in the form of messages:

result_dict = ['Keep up with positive vibes. :)',
               'Bright up a bit.',
               'No negative vibes allowed! Cheer up.']
def create_mood_message(probability):
    if probability > 0.6:
       return result_dict[0]
    elif probability >= 0.5:
       return result_dict[1]
    else:
       return result_dict[2]

Then, with the following part of code:

prediction = naive_bayes_classifier.predict_proba(input_text_cv)
message = create_mood_message(prediction[0][1])

you just made your Vibes Guru understand your vibes and talk back.

By wrapping this simple code into script you can lead the following conversation with your Vibes Guru:

That’s it!

Now, use the help of Vibes Guru, you just built up, to cope with the answer on our title question – whether machines can think and talk. I believe not, not yet. However, as our simple example has shown, they can pretend that they do and with the power of NLP they pretend very well. It is only a matter of time until NLP will go from pretending to real conversation. This blog barely scratches the surface of Natural Language Processing but, hopefully, it has provided you with your first steps towards making a machine that would be truly intelligent and pass the Turing test in every way.

Also, the example given is more naive than Naive Bayes, so, until next time, try to improve it by increasing preprocessing, training with text/documents or, at the very least, try to add two new categories (blog_fans and blog_haters) or just one category (blog) and lead the following conversation with your Vibes Guru:

Random String Detection
BlogData Science & Analytics
May 15, 2023

Random String Detection

What are Random Strings? Data analysis often presents us with the challenge of dealing with noisy inputs. This is particularly evident when working with large datasets of user inputs. Detecting and filtering out random string inputs can prove invaluable in various scenarios, such as data validation, quality control or the…

Leave a Reply