--- ### This talk is available at: [http://savethevowels.org/talks](http://savethevowels.org/talks) ---
## The Joy of Natural Language Data ### Will Styler - Business Intelligence --- ## Today's Talk: ### Part 1: Why you *badly* want to use natural language data. ### Part 2: Why natural language data is really difficult to use. --- # Part 1 ## Why you *badly* want to use natural langauge data. --- ## There's a *lot* of natural language data out there. * 644 million active websites
(
Source
)
* Mayo Clinic enters 298 million patient records per year
(
Source
)
* 58 million Tweets per day
(
Source
)
* 294 billion emails sent daily
([Source](http://email.about.com/od/emailtrivia/f/emails_per_day.htm))
* Recorded phone calls, blog posts, Facebook updates... * ... and that's just the digital stuff --- ## The Problem: Humans are inefficient and expensive. ---
---
--- ### So let's enjoy a fantasy...
--- You have a computer which understands human language as well as it does computer language ---
--- ### (minus the evil) --- ## What could analysis of natural language data do for you? --- ## Speech Recognition "Ask people why they're calling, and connect them to the right department based on their answer." "Flag all tech support conversations where the customer mentions a competitor" "Be Siri, but good." --- ## Analysis of secondary speech characteristics "Redirect all angry-sounding customers to higher-tier support workers" (Speech emotion detection) "Are the two people in this skype call flirting, arguing, expressing love, or sadness? Target post-session ads accordingly." "I want to talk to... billing?" (Uncertainty analysis) "Yeah, I really like going to Applebees." (Spot-the-sarcasm) --- ## Data Aggregation “Watch Twitter and give me the locations of wildfires, floods, etc, and provide information about damage, shelters and resources in an easy-to-read format” (EPIC) “Read every news article about the Ukrainian Revolution and present the information on a cohesive timeline, with sources labeled.” (RED) “Collect all case-law involving reverse mortgages in the state of Florida in which the plaintiff's children filed suit against the mortgage company” --- ## Authorship attribution and stylistic analysis “Examine these two written passages/books and tell me whether they were both written by the same person” (Authorship Attribution Analysis) "Examine these negative reviews and tell me what demographic the authors likely represent based on the language used." "Are these critical forum posts written by the same person?" --- ## Predictive analysis of text “Look for any information in the newswire which will predict a change in this company's stock price, then buy or sell stock automatically.” “Based on this person's Facebook post history, how likely is she to click an ad for weight-loss pills?” "Based on all the political posts and tweets in Westminster compared to those in Pueblo, how likely is this senator to lose in a recall election?" --- ## Sentiment Analysis --- ### A Case Study Advertisers want their ads to be relevant They want to show ads related to topics and products people enjoy. Using these principles, Facebook is trolling Will. --- ## The easy approach Keywords == Mentions, Mentions == Interest "Scan each Facebook post for certain keywords. If they appear, show ads for related products and topics." --- ### How Facebook reads Will's Facebook Posts "blah blah blah blah blah ***marijuana***, blah blah blahbity blah blah blah" "blahbity blahblah, ***pot***, blah blah blah blah blah ***Amendment 64*** blahbity blah blah blah" --- ### Facebook's choice in Ads for Will ---
---
--- One tiny problem... ### Marijuana smoke makes Will's throat swell shut. --- ### Will's actual Facebook Posts "Ugh, my neighbor was smoking ***marijuana*** on his deck again, I had to use my inhaler and take some Benadryl to keep breathing." "I hate the amount of public ***pot*** smoking in colorado. Ever since ***Amendment 64***, I can't go to a concert without risking my life" --- ### Facebook's advertisers are getting ripped off Presenting topical ads to people who hate those topics is a waste of money This is easily preventable. --- ### How could we solve this problem? --- ## Sentiment Analysis “How often, in this corpus of blogs, do people say nice or awful things about product X?” "We've just leaked a picture of our next supercar. How do people on twitter like the design?" "What are people saying about our leaked $199.99 pricepoint?" "How do people on these forums feel about 9/11?" --- ## Temporal Analysis and Event Discovery --- ### A Case Study Many hospitals around the country are switching to Electronic Medical Records (EMRs). These records are easily available within the institution, and contain lots of valuable data. Creating timelines is incredibly time-consuming for humans, as is comparison. What if machines could do this for us? [The THYME Project](https://clear.colorado.edu/TemporalWiki/index.php/Main_Page) --- ### “The patient developed a mild post-surgical rash, which was treated with hydrocortisone at the follow-up” Sequence of events: 1) Surgery 2) Mild rash 3) Hydrocortisone, Followup (overlapping) 4) No more rash --- ### If a computer can be taught to interpret time in medical records, we can ask... "I have 30 seconds to learn this patient's history. Go." “How often do patients have heart attacks within 2 years of starting Vioxx?” “How many people who have a facelift develop persistent facial numbness?” “How long do patients usually live following diagnosis of Glioblastoma?” “Is there a correlation between the administration of vaccines and the development of autism?” **[(No, damnit.)](http://www.webmd.com/brain/autism/news/20110105/bmj-wakefield-autism-faq?print=true)** --- ### Temporal reasoning is important Humans interpret time naturally, and make reference to it often. Temporality interacts with causality in interesting ways. Event detection and reasoning is useful in a variety of domains. "What happened" is a very fundamental question that everybody wants answered. --- ## Analysis as a service “Given this large sample of a child’s speech, is the child likely to be autistic?” (Current research at the LENA foundation in Boulder) "Scan online white-supremacist forums for anything which looks like a threat against the President" (The US Secret Service) “Watch these websites being used by islamist groups and look for specific language usage patterns that predict violent and radical behavior.” (All sorts of defense department grants) “Read every email, looking for threats or discussion of terrorist attacks on American soil.” ---
--- ### What could natural language data do for you? --- ## “Examine any question given to you and provide an answer to it, based on your training data”
--- ###
http://www.youtube.com/watch?v=7kOEmupSHB8&feature=relmfu
--- ## Jeopardy “Watson” challenge Final Score: Ken Jennings: $24,000 Brad Rutter: $21,600 **Watson: $77,147** ---
--- ### ... not so fast --- # Part 2 ## Why natural language data is really difficult to use. ---
--- # No.
### Not yours. --- ## NLP - Natural Language Processing Teaching computers to “understand” human language --- ## NLP in ∞ easy steps: 1) Get a corpus and annotate it to tell the computer what’s going on --- # Common Annotation Types --- ### Syntactic Annotation (e.g. Penn Treebank) ---
---
--- ### Semantic roles (e.g. Propbank) "John hit the tree with a frozen trout" **To Hit** - Hitter, Thing Hit, Instrument, Manner Hitter: John Thing hit: Tree Instrument: Frozen Trout Manner: Not given --- ### Named-entity (e.g. UMLS) “The patient developed a mild post-surgical rash, which was treated with hydrocortisone at the follow-up” rash == "Disorder" surgery == "Procedure" hydrocortisone == "Chemical/Drug" patient == "Person" --- ### Coreference/Anaphora "John adopted a panda from his local shelter. He was dismayed when it ate his cat, but they refused to take it back from him. The ferocious beast spent the rest of his life with the now-catless man." --- ### Sentiment Analytical "Ugh, I *HATE* Comic Sans, and Papyrus is OK, but man, Helvetica is awesome!" UGH = STRONG NEG, Comic Sans HATE = STRONG NEG, Comic Sans OK = NEUTRAL, Papyrus AWESOME = STRONG POS, Helvetica --- ### Project-specific Annotations Clinical Element Models Naval Engagement Reports --- ### Annotation Pitfalls If humans can't extract meaning reliably, there's no way a computer can. Too much inference means un-extractable data --- ## NLP in ∞ easy steps: 1) Get a corpus and annotate it to tell the computer what’s going on 2) “Train” the computer by letting it analyze that corpus 3) Give it a different corpus, have it try to guess what’s going on and answer questions 4) Refine the programming, refine the annotation, then get another corpus and annotate it to tell the computer what’s going on.. (Repeat) --- ### Why is this so hard? --- # Natural Language is difficult at every level --- ## Speech Speech is a convenient cover for widespread telepathy. ([CITATION NEEDED]) --- ### You understand me right now ---
--- ### No two people sound alike, even saying the same things --- ### The right answer depends on the context. --- "Bring me the bat, man"
--- "Bring me the Batman"
--- ### Speech recognition is spectacularly good, but nowhere near good enough. --- --- ## Modality Did something happen? Is it real? --- ### Modality is difficult "The compound might be bombed" "If they attack, we'll bomb the compound." “The general stated that bombing the compound overnight “was still an option”” “We may conduct a bombing at 0300” “We will conduct a bombing at 0300” “We conducted a bombing at 0300” --- ## Coreference/Anaphora Linking subsequent mentions of items and concepts to one another. --- ### Coreference is difficult “The Bay Harbor Butcher is off the streets, as Dexter Morgan, the alleged killer, was arrested by police over the weekend” ““Bill Clinton was the President of the United States in 1999. Now Barack Obama is POTUS.” --- ## Metonymy Using a word to refer to a practically or metaphorically related concept --- ### Metonymy is difficult "The terrorist built a **pipe bomb**" "The __pipe bomb__ interrupted the festival" "**200mg of Loperamide** stopped her diarrhea" "**Moscow** condemned the latest round of sanctions" --- ## Causality Did one event trigger or cause the next event? --- ### Causality is difficult “The dam burst when the rockslide hit it.” “The over-full dam burst when the rockslide hit it.” “She pulled the trigger, firing the gun and killing the man.” “The bombing destroyed the buildings” “The earthquake destroyed the buildings” --- ## Temporal Expression Normalization The process of linking relative dates to absolute, calendar dates --- ### Temporal Expressions can be difficult “The bombing occurred 2/13/12 at 0214” “Next Tuesday, she’ll come in for a follow-up” “She’s been having trouble sleeping lately.” “She should expect soreness postoperatively.” “TSA regulations have grown increasingly restrictive Post-9/11” --- ## Temporal Relations Linking and arranging different events as part of a greater timeline --- ### “The patient developed a mild post-surgical rash, which was treated with hydrocortisone at the follow-up.” --- ### “The patient developed a mild post-surgical rash, which was treated with hydrocortisone at the follow-up, many years after Napoleon's exile to Elba.”
--- ### Every event in the history of the universe is temporally related to every other event in this history of the universe. NumRelations = (NumEvents)! 60 Events == 3600 valid Temporal Relations --- # There's often too little information ---
--- ## Doctors hate us. “We biopsied the colon, the results were negative” “Noted postoperative scarring.” “She does not want a colonoscopy, which she had in the 70’s and did not enjoy.” “History of Pneumonia, Asthma, h/x diverticulitis, MS” “s/p lap appy conv. open, Lungs c/ausc, A&Ox3” “Resected Invasive Grade 3 of 4 Adenocarcinoma (AJCC 7th PT4N1bMX).” --- # Everybody else does, too! "Gold covered the miner's hands"/"Gold paid for the miner's education" “The Queen of England’s hat was purple” “We gave the monkeys the bananas because they were ripe” “We gave the monkeys the bananas because they were hungry” “Time flies like an arrow, fruit flies like a banana” “The old man returned to his house was happy” ---
--- # Not yet. ---
## Hooray! --- # The joy of natural language data --- ## Business wants this data * "What are people saying about my products?" * "What do people want next?" * "How can I provide better products?" * "How can I provide better service?" "How can we save time by letting machines do the work?" --- ## ... but natural language doesn't want to give it up * Annotation and Training are complex and time consuming * Speech is crazy-complex * Meaning is person-specific, and very strange * The meaning and linguistic phenomena in speech are tricky * You never have enough information --- ## You and language data can have a happy, loving relationship All you need to do is ask lots of questions, pay attention to what your data tells you, and always think carefully about what you’re asking it. ... but remember, you're gonna need to work for it if you want to get things done. ---
--- ### This presentation is available at: [http://savethevowels.org/talks](http://savethevowels.org/talks) --- ### Thank you! ---