Update: Study List & Anki Integration Added

Posted on by

If you've completed at least 2 rounds of the quiz, you will now have the ability the words you don't know to your "study list," which saves the word, its context sentence, and any explanations that you mark as helpful. You can then download this data to import into Anki or any other tool that support CSV format. This is an experimental feature, and we will continue making improvements as we gather feedback.

Update: Example Sentences Added

Posted on by

One of the most requested features, example sentences, has been added to the explanations. Just click on the "예문" link next to a definition, and you'll get up to 10 example sentences.

300 Korean Learners in 1 Week!

Posted on by

We've had almost 300 people participate in the quiz in just under a week, and we've gotten a lot of data and feedback, most of which is overwhelmingly positive.

Based on what we've gotten so far, we've made some changes, the most important ones being:

  • Faster Elaborations - Most of the time, the explanations should now pop up in under a second.
  • Easier Essays - the data suggests that the leveling algorithm was giving essays that were a bit to hard for most readers, even with the elaborations enabled. So, we've made the essays a bit easier, but we may still need to calibrate the algorithm a bit more to find the sweet spot.

We've also made this list of changes to develop in the near future:

  • Example sentences for explained words - We have the data for this already, so its mostly a matter of updating the UI. This should be quick to develop.
  • "Cheat" button to display English translations - for those times when the reader can't figure out the word based on the explanations. This would be helpful to both the reader and us, since this would give us concrete feedback about when the Korean explanations are insufficient.

Finally, these items are high-priority, but will take longer to develop:

  • Better images - I have a plan in mind to curate a much more consistently useful database of images to explain Korean words, but it wont be ready for a few months, at least.
  • Hanja - Some have suggested listing the hanja that a word is derived from to give learners a clearer idea of how words are related to eachother and to provide readers with clues for unfamiliar words made from the same hanja.

If you haven't tried the new quiz format yet, give it a go. The more data we collect, the more data we use to improve the system's elaborations.

Open Beta Launching

Posted on by

Today we are launching an open beta of our reading tool. We hope to collect data that will help us improve our core algorithms and make the elaborations we generate more useful to learners. You can read more about it here or try it out yourself!

LREC Paper Published

Posted on by

For those interested in the results of the unknown word survey last year, you can read our published paper here.

LREC Paper Accepted

Posted on by

Our paper that was submitted to LREC has been accepted. I will post a link here when the final draft is available.

October Update

Posted on by

We've had a productive couple of months.

In August, we visited several language schools in Korea to get their help recruiting students for our survey, especially those speaking Chinese, Vietnamese, and Japanese. We got about 100 new submissions from speakers of these languages as a result (about double what we had before that).

In September we wrote our first paper about the research, describing our methodology for collecting data for our system and evaluating the models we were able to produce as a result. The paper has been submitted to LREC 2018. We should hear back in a few weeks or so.

As part of that paper, we are publishing our first round of data that we've collected from the survey. The data is for helping other researchers building similar kinds of applications. The data was stripped of usernames and any other potentially identifying details before publication. You can download the data here.

Currently, I am working on a prototype of the elaboration engine itself. I'm hoping to have a functioning prototype within a couple of weeks.

Studying French

Posted on by

I came across a paper about a study similar to my research that was done for French. For the academically inclined, you can read the paper here:

Anais Tack, Thomas Francois, Anne-Laure Ligozat, Cedrick Fairon Evaluating Lexical Simplification and Vocabulary Knowledge for Learners of French: Possibilities of Using the FLELex Resource LREC 2016

http://www.lrec-conf.org/proceedings/lrec2016/pdf/544_Paper.pdf

Phase 2 Survey & Site Updates

Posted on by

Phase 2 Survey

We've reached almost 2,000 submissions! Not to sound like a broken record, but that's pretty amazing! Thank you all for all your help.

I'm continuing to work on and make improvements to the prediction models I discussed in my last post, but results have been promising.

My experiments suggest, though, that I've gotten about as much data as will be useful for this first survey: collecting more data with this exact survey will contribute only marginal improvements to the model. So, I'm working on my phase-2 survey. The first survey had a very limited selection of texts (about 240 essays). This new survey will draw form a larger set of texts, which will allow me to study a greater variety of words and grammatical features. It will also allow me to collect some data that will be used to generate elaborations in addition to predicting unknown words. The new survey will be launched in August.

Website Updates

Over the last several weeks, I've added a number of new features:

  • New Languages - Japanese and French translations for the user interface have been added to the website. This is especially important because we want more contributors with a Japanese language background to contribute.
  • Study Tools - Dictionary lookups have been added to the My Submissions. You can right-click on an unknown word to look it up on Naver, Daum, or Wikitionary. I hope this will make it easier to use this site as a study tool, so that contributors can get some extra value from it. In a future update, I'd like to add Anki integration as well.

Dictionary Lookup Screenshot

Predicting Unknown Words

Posted on by

Note: This update is a little bit on the technical side. If that doesn’t scare you, read on…

A New Model For Predicting Unknown Words

One of the main goals of the first phase of this project is to build a model that can predict the likelihood that a someone at a given level will know a given Korean word. This is important for not just predicting what words need additional explanation, but also to write those explanations in words the reader will understand.

In my last post, I showed you an early, primitive model I built using word frequency. I also said that I was working on a much smarter model. It’s time to present some of my progress.

I will not go into the technical details, but using probability theory and the ~35,000 words marked from the survey responses, I was able to produce the new model illustrated below. As before, the red dots represent unknown words, and blue dots represent known ones.

New Model Illustration

As we can see, the old model was a bit of a confused mess. In the new model, by comparison, the red and blue dots are much better separated, meaning the model is much better at predicting what words readers do and don’t know.

With this new model, we can predict what words readers won't know with much more accuracy. In particular, if this model knows what someone’s Korean level is, then it can predict roughly 80% of the words in an essay that he/she won't know before he/she even sees that essay. If we can generate elaborations for all those words, this could potentially increase his/her understanding of the text from knowing 90% of the words to knowing 98% of the words. Research tells us that this is the difference between being able to understand almost nothing and being able to understand almost all of a text (Hirsh et al. 1992).

The main drawback of this model is that it only works for words that appear at least once among all the essays in the survey, which consists of about 3,500 families of related words. There are thousands of words that aren’t included in the survey yet. So, my next step is to build a more general model that will work for any Korean word. I’ve already made a good progress towards that goal. I’ll post an update on that in another week or so.

Also, new features are being added to the website, which I will post about within the next few days.

References

  • David Hirsh, Paul Nation, et al. What vocabulary size is needed to read unsimplified texts for pleasure? Reading in a foreign language, 8:689–689, 1992.

Take the Quiz »