Create ML - Language Processing

1. NLP and Finding the dominant language

This is one of the simplest but most foundational NLP tasks. The goal here is to answer the question “What language is this text written in?”

There are few ways to do this.

Character-level patterns

Languages have unique patterns - letter frequencies, digraphs and trigraphs.

Digraphs		Trigraphs
Grapheme	Example	Grapheme	Example
ch	Church	tch	Watch
sh	Ship	dge	Bridge
th	Think / That	igh	Night
ph	Phone	ore	Sore
kn	Knee	air	Stair
wh	Wheel	ear	Learn
ck	Back	ure	Pure
ng	Sing	sch	School
qu	Queen	iou	Precious

For example:

English loves “th”, “ing”, “the”
German loves “sch”

Statistical Models

Older Systems used Naive Bayes or Markov models trained on large corpora. They compare your text o known distributions and pick the closest match.

Modern ML/Deep Learning

Transformers or CNNs trained on multilingual datasets can classify language with extremely high accuracy, even for short text.

Hybrid approaches

Many systems combine heuristics and ML

code-mixed test - “Namaste dude, como estas”
For short texts - “oui”, “ha”, “ja”
Popularly borrowed words - “Samosa”, “Pizza”

2. Why finding the dominant language is important

Finding the right dominant language is a gateway for several things.

Choosing the right tokenizer
Routing the text to the right translation model/system
Enabling multilingual search
Improving speech recognition
Localization of UI elements

3. Apple’s `NaturalLanguage` package

Here are the list of ISO 639 language codes

Find the dominant Language in a sentence

In Xcode Playground, run this code.

import NaturalLanguage
 
let recognizer = NLLanguageRecognizer()
recognizer.processString("Namaste, como estas?)
if let language = recognizer.dominantLanguage {
    print(language.rawValue)
}

The output would be: es even though Namaste, como estas has Sanskrit or Hindi with Espanol. But, there are more Espanol words in the sentence.

Find the top 3 possible languages in a sentence

import NaturalLanguage
 
let recognizer = NLLanguageRecognizer()
recognizer.processString("Bonjour, como estas.")
let languageProbabilities = recognizer.languageHypotheses(withMaximum: 3)
 
for (language, probability) in languageProbabilities {
    print("Detected \(language.rawValue). Probability \(probability)")
}

The output would be:

Detected pt. Probability 0.34345224499702454
Detected es. Probability 0.6560376286506653
Detected fr. Probability 0.0003371217171661556

`NLTagger` for tokenizing a sentence

import NaturalLanguage
 
// English, Sanskrit and Japanese
let array = ["Doesn't it?", "त्वं कुत्र गच्छसि?", "私は学生です"]
 
let tagger = NLTagger(tagSchemes: [.tokenType])
 
for text in array {
    print("---")
    tagger.string = text
    
    tagger.enumerateTags(
        in: text.startIndex..<text.endIndex,
        unit: .word,
        scheme: .tokenType,
        options: [.omitPunctuation, .omitWhitespace]) { (tag, range) -> Bool in
            print(text[range])
            return true
        }
}

The output will be:

---
Does
n't
it
---
त्वं
कुत्र
गच्छसि
---
私
は
学生
です

`NLTagger` for finding Parts of Speech

import NaturalLanguage
 
// English, Sanskrit and Japanese
let array = ["Doesn't it?", "त्वं कुत्र गच्छसि?", "私は学生です"]
 
let tagger = NLTagger(tagSchemes: [.lexicalClass, .language, .script])
 
for text in array {
    print("---")
    tagger.string = text
    
    tagger.enumerateTags(
        in: text.startIndex..<text.endIndex,
        unit: .word,
        scheme: .lexicalClass,
        options: [.omitPunctuation, .omitWhitespace]) { (tag, range) -> Bool in
            print(text[range] + ":" + (tag?.rawValue ?? "Unknown"))
            return true
        }
}

Output:

---
Does:Verb
n't:Adverb
it:Pronoun
---
त्वं:OtherWord
कुत्र:OtherWord
गच्छसि:OtherWord
---
私:OtherWord
は:OtherWord
学生:OtherWord
です:OtherWord

4. Sentiment Analysis of a sentence

import NaturalLanguage
 
func sentimentScore(for text: String) -> Double {
    let tagger = NLTagger(tagSchemes: [.sentimentScore])
    tagger.string = text
 
    let (sentiment, _) = tagger.tag(at: text.startIndex,
                                    unit: .paragraph,
                                    scheme: .sentimentScore)
 
    // Convert the tag (String) into a Double
    return Double(sentiment?.rawValue ?? "0") ?? 0
}
 
let array = [
    // Very Positive
    "This product is absolutely perfect and exceeded every single one of my expectations!",
    
    // Mildly Positive
    "The service was just fine and experience was good.",
    
    // Neutral
    "The delivery arrived at 3 PM and was placed on the front porch.",
    
    // Mildly Negative
    "The food was okay, but the wait time was a bit longer",
    
    // Very Negative
    "This was a catastrophic failure and the most horrible experience I have ever had."
]
 
for sentence in array {
    let score = sentimentScore(for: sentence)
    print("\(score):\t \(sentence)")
}

The output:

1.0:	 This product is absolutely perfect and exceeded every single one of my expectations!
0.2:	 The service was just fine and experience was good.
-0.8:	 The delivery arrived at 3 PM and was placed on the front porch.
-1.0:	 The food was okay, but the wait time was a bit longer
-1.0:	 This was a catastrophic failure and the most horrible experience I have ever had.

Chandra Polepeddi

Explorer

Create ML - Language Processing

1. NLP and Finding the dominant language

Character-level patterns

Statistical Models

Modern ML/Deep Learning

Hybrid approaches

2. Why finding the dominant language is important

3. Apple’s `NaturalLanguage` package

Find the dominant Language in a sentence

Find the top 3 possible languages in a sentence

`NLTagger` for tokenizing a sentence

`NLTagger` for finding Parts of Speech

4. Sentiment Analysis of a sentence

Table of Contents

Chandra Polepeddi

Explorer

Create ML - Language Processing

1. NLP and Finding the dominant language

Character-level patterns

Statistical Models

Modern ML/Deep Learning

Hybrid approaches

2. Why finding the dominant language is important

3. Apple’s NaturalLanguage package

Find the dominant Language in a sentence

Find the top 3 possible languages in a sentence

NLTagger for tokenizing a sentence

NLTagger for finding Parts of Speech

4. Sentiment Analysis of a sentence

Table of Contents

3. Apple’s `NaturalLanguage` package

`NLTagger` for tokenizing a sentence

`NLTagger` for finding Parts of Speech