### Activity 1 is available on Canvas! - Under 'Discussions' --- # Machine Learning Basics ### Will Styler - LIGN 6 --- ### This class is about Natural Language Processing - A big part of that is teaching computers about the many patterns of language - Facts about grammar, about words, about speech, about sound - We need the computer to be able to make decisions and choices based on language patterns --- ### These decisions can exist at many levels - Is 'recrod' a valid word in the language being used or a typo? - Is the user using 'record' as a noun or a verb? - Did the user really want to write 'duck these COVID restrictions'? - What device does the user want me to activate? - Is this a negative or positive product review? --- ### We don't want to hard-code every decision! - "The list of valid English words is..." - "Unless the word 'pond' also occurs, don't say 'duck'..." - "If 'record' comes after a noun or pronoun, it's probably a verb..." - "The thing after 'Turn on the...' is the thing you should turn on..." --- ### ... but some decisions are very difficult to hard-code - What features would define a 'bad' review? - Of a vacuum? - How does this vary across product categories? - "Goes through a lot of gas" for cars vs. air filters - How does this vary across languages and cultures? --- ### Language is complex, variable, and changing - We don't want to write code that's based on only our knowledge of the world - We don't want to write code that's based on only one speaker or situation - We don't want to write code for each grammatical feature - We want our analysis to be able to change and evolve over time and adapt to different situations - We want our processing to handle a complex set of situations --- ### There are always more possible sentences than lines of code to analyze them - "This dinner left my beloved as if she'd been given three daggers to the midsection" --- ### We don't want to hard-code how computers should interpret (most) language! - Let's let the computer see the data and find the patterns itself! --- ### Natural language processing is usually a multi-step process - 1: Collect Data - 2: Label Data with the classes of interest - 3: Find Features in the data which might be useful - 4: Train the algorithm on some of the data - 5: Test the algorithm and check the predictions - 6: Repeat with modifications until you're done --- ### Today's Plan - What kinds of language data are used? - What is Machine Learning? - How do two common algorithms work? - What's the workflow for Machine Learning? --- # What is Language Data? --- ### First, a few terms --- ## Corpus (pl. 'Corpora') A body of data used for study, analysis, or training algorithms --- ## Lexicon A listing of words in a given language, with or without additional information --- ### Language Corpora - Speech data (e.g. phone conversations, siri interactions, recited sentences) - Speech data are *acoustic*, not written words - Can include secondary data streams (e.g. stereo, voice information) - Text data (e.g. tweets, news-wire, literature, books, medical records, banks of emails) - Conversation data (e.g. scripts, transcripts, etc.) - Language resources (e.g. lexicons/dictionaries, ontologies, lexical resources) --- ### Corpora are often annotated - We give the computer human-generated 'answers' which we want it to be able to reproduce in new data - "Here's the test and the key, learn from it, so you can take the next test" - These annotations are *task specific*, and very expensive to make - An annotated corpus has a bunch of raw data, and a bunch of metadata which the computer can use to learn from the raw-data - "Line 7782, character 12-23 == 'verb'" --- ### Sample Data annotations - **Transcriptions**: "What was said here, and when?" - **Structural annotations**: "How does this sentence work, grammatically?" - **Type or class annotations**: "What *kind* of language is this?" or 'Is this a noun or verb or adjective or...?' - **Semantic annotations**: "What does this mean?" or "Is this a positive or negative sentiment?" or "When did this happen?" - *We'll spend a lot of time talking about how this process works* --- ### Language data is the key - You cannot build a language system without language data to work from - Your data needs to match the domain and problems you're wanting to solve - Your system is only as good as your data - "Defenestrate" - "Herbie" - "The chair needs fixed" --- ### We'll talk about many kinds of language data - Each has strengths, weaknesses, and uses - ... and each presents unique challenges --- ### ... but it's all turned into working systems through the same basic set of techniques --- ## Machine Learning --- ### Disclaimer: I'm not going to teach you *how* to do machine learning - That's a Ph.D, not a single quarter - It is an art as much as a science - It's constantly evolving --- ### So, we're just going to do an overview - Give you simplified basics, and you'll learn the details when you dive deeper for your tasks - I just need you to know enough to get the concept - *This is going to happen a lot this quarter. That's the point!* --- ### ANY Machine Learning has three basic steps - 1: Use statistics to find useful patterns in some kind of data using a machine learning algorithm ('Training') - 2: Use those same statistics to use those patterns to make decisions and predictions ('Testing') - 3: Evaluate the results, make changes to the model or algorithm, and repeat! --- ### What *doesn't* machine learning do? - Machine learning doesn't deal with 'meaning' or 'understanding' - It is not about 'Artificial Intelligence' (in any real sense) - **ML is math done on data, to draw meaningful lines, which *seems* like learning** - It is not **actual human-like learning** --- ### Machine Learning is focused on prediction and classification - Predictions *anticipate* upcoming data - "What's the next data point likely to look like? - "What's the next word in this sentence likely to be?" - Classifications *describe* the current data - "We have groups here. What group does this new point belong to?" - "Is this review positive or negative?" - *We're going to focus on classification today!* --- ### ML can be *transparent* or *opaque* - **Transparent** machine learning algorithms give us details about *why* they decided what they did - What information or 'features' mattered most? - How confident was the model in this decision? - Where are the boundaries? - **Opaque** algorithms give us little information about their decision-making process - You get input, a black box, and output - Sometimes you get confidence, but not always - Like humans, "Uh, I just know" --- ### Research vs. Engineering Tasks - **Research** machine learning often focuses on transparency and learning as much as you can about the data and task - The model isn't meant to "do work", it's meant to explain what's going on in the dataset - Favors 'transparent', interpretable methods - **Engineering** machine learning focuses on raw predictive power, efficiency, and accuracy - "I need to answer this for 20,000 queries each minute, so..." - Favors accurate and fast methods, transparency isn't as needed --- ### Supervision in Machine Learning - *Supervised Machine Learning* - We train on all the data, *annotated with the patterns we're looking for*, then test and tweak. - *Semi-Supervised Maching Learning* - We train on *a very small amount of annotated data*, then test and tweak - *Unsupervised Machine Learning* - We give the machine *unannotated* data, then test and tweak - These have strengths and weaknesses - We're going to focus on *supervised* machine learning today --- # Machine Learning Algorithms ---
--- > "I'm looking at a bird. What kind of bird is it?" ---
--- ## The World's Dumbest Algorithm --- ### "It's a duck!" Algorithm - No matter the data, just says "Eh, it's a duck" - Surprisingly accurate - If 78% of waterfowl on the lake are ducks, it's accurate 78% of the time. - If 10% of the waterfowl are ducks, it's accurate 10% of the time. --- ## The World's Second Dumbest Algorithm --- ### "It's probably a duck!" Algorithm - Count the proportion of ducks in the training data, and then guess at random 'Duck' X percent of the time - Much more accurate when the classes are imbalanced - If only 10% are ducks, this will be *much* more accurate than "It's a duck!" - ... but you'll also start making mistakes on ducks, too! --- ### Two Easy-to-Explain Algorithms * Decision Trees and RandomForests * Because they're transparent * Support Vector Machines * Because they're accurate *and* interpretable --- ## Decision Trees --- Let's pretend to be classifiers! > "I'm looking at a bird. What kind of bird is it?" --- One Approach: * **Ask questions, then make decisions based on the answer!** ---
--- ### Decision trees classify by asking the right sequence of questions * Ask a question, then ask a different question based on the first one, then ask another... - If the tree is well made, we should find the answer * Often, we randomize the trees and find an *ensemble* of trees which gives the best results (a "Random Forest") --- ### Let's talk about another approach! --- ## Support Vector Machines! --- Back to the waterfowl! ---
--- ### Your Kayaking Relative has taken a hands-on approach to classification * You are now recieving texts with bill length and body length measurements for birds * The question is "Swan, or Duck?" ---
--- ### Support Vector Machines * Look at all the data in an n dimensional space * n is the number of features * Try to find a hyperplane with the best separation - A hyperplane is just a line on many dimensions * This hyperplane is delineated by the *support vectors* * Classification is just* seeing where the new data is relative to that line ---
--- ### Support Vector Machines * SVMs have been historically *very* powerful * ... and anything that beats them is usually really complex or opaque * They're a great choice for transparent machine learning --- ### Neural Networks and "Deep Learning" - Better results - Less worrying about features - Massive complexity - **Zero transparency** - That's soon! --- ### ... So, how do we use any of this? --- # The Supervised Machine Learning Workflow --- ### The Process - 1: Collect Data - 2: Label Data with the classes of interest - 3: Find Features in the data which might be useful - 4: Train the algorithm on some of the data - 5: Test the algorithm and check the predictions - 6: Repeat with modifications until you're done --- ### 1) Collect Data - More data is generally better - Representative data is good - Diverse data is ideal - Balance of classes is helpful --- ### 2) Label Data with the classes of interest - Finding the classes can be half the battle - Make boundaries between the classes clear --- ### 3) Find Features which might be useful - You want to present information that the algorithm would find useful about the data - Beak size, Bird Size, but for language - Garbage in, garbage out - Finding features can often be a huge part of the battle. - **Deep neural nets can find their own features** - This is a *huge advantage*, and a huge opacity --- ### 4) Training the algorithm - First, hold back a chunk of your data for testing - Present the labeled training data for the algo to learn the patterns - This allows the algorithm to do the statistical analysis and find the patterns in the features which mean that this data point is class X. - You'll then have a 'trained model' which can be fed new data to get predictions --- ### 5) Test and check the predictions - Feed in the 'test' data and get predictions, then look at the accuracy - What's your motivation here? - Engineering: "Which algorithm works best to do the task given my constraints?" - Research: "Which features best represent the classes and provide the best information?" - Failure mode analysis tells you what your model is failing to capture, and suggests features or weaknesses in the data --- ### 6) Repeat with modifications until you're done - Go back and change the data, labels, features, and algorithm in a way that you think will help. - If it improves results, great! - If not, whoops! Try again. - Algorithms often have parameters that you can tune for better results - You'll eventually hit a point of diminishing returns/funding and stop - There are many values of 'done' --- ## There are a *lot* of choices here --- ### Some of the many choices - What kind of data? - What kind of features? - What kind of algorithm? - What kind of testing? - What to tweak? --- ### Machine Learning is often considered a Dark Art
--- ### ... but it's a *really* useful dark art --- # Machine Learning in the world --- ### Machine Learning is *everywhere* at the moment - One of the fastest growing fields in computing and programming - This is a place where jobs are, if you like vector calculus - Every time somebody says "AI", they just mean ML - Everybody wants their technology to be smarter - (Or at least appear smarter) --- ### Some well-known ML tasks - "Is this email spam, or not?" - "Is my car about to crash?" - "Did it just crash?" - "Should we lend this person money?" - "Is this handwritten symbol "1" or "2" or "3" or...?" - "Is this word a noun, or a verb, or an adjective, or...?" --- ### As well as every part of the NLP pipeline - Answering our 'How long to get home' query involves a series of machine learning models - Models feeding models feeding models - There are few parts of natural language processing that *aren't* machine learning. --- ### Soon, we'll hear about Deep Neural Networks - This is the 'Deep Learning' everybody keeps talking about - Deep neural networks are still statistical machine learning models - They don't require *any* feature extraction, and they're wildly powerful - They're also very computationally expensive to run - ... so your phone, computer, and many other devices now have dedicated processors for training and using them - ... and they're completely opaque --- ### Wrapping up - Machine learning is everywhere - There are many algorithms, but they all have similar basic workflows - It's absolutely fascinating - You'll need good language data to be able to do any of it! --- ## For Next Time - Read the Fromkin_Phonetics.pdf chapter on Canvas ---
Thank You!