Post by account_disabled on Feb 27, 2024 4:46:09 GMT -5
The as an indexed page in search results. This also entailed identifying a master tag to represent groups of similar terms. Identify bad tags We wanted to isolate tags that should not appear in our database due to misspellings duplicates poor format high ambiguity or likely to cause a lowquality page. Relate bad tags to good tags We assumed many of our initial bad tags could be a range of duplicates i.e. pluralsingular technicalslang hyphenatednonhyphenated conjugations and other stems. There could also be two phrases which refer to the same thing like Yorktown ship vs. USS Yorktown.
We need to identify these relationships for every bad tag. For the Kazakhstan Phone Number project inspiring this post our sample tag database comprised over unique tags making this a nearly impossible feat to accomplish manually. While theoretically we could have leveraged Mechanical Turk or a similar platform to get manual review early tests of this method proved to be unsuccessful. methods in fact that we could later reproduce when adding new tags. The methods Keeping the goal in mind of identifying good tags labeling bad tags and relating bad tags to good tags we employed more than a dozen methods including spell correction bid value tag search volume unique visitors tag count Porter stemming lemmatization Jaccard index JaroWinkler distance Keyword Planner grouping Wikipedia disambiguation and KMeans clustering with word vectors.
Each method either helped us determine whether the tag was valuable and if not helped us identify an alternate tag that was valuable. Spell correction Method One of the obvious issues with usergenerated content is the occurrence of misspellings. We would regularly find misspellings where semicolons are transposed for the letter L or words have unintended characters at the beginning or end. Luckily Linux has an excellent builtin spell checker called Aspell which we were able to use to fix a large volume of issues. Benefits This offered a quick early win in that it was.
We need to identify these relationships for every bad tag. For the Kazakhstan Phone Number project inspiring this post our sample tag database comprised over unique tags making this a nearly impossible feat to accomplish manually. While theoretically we could have leveraged Mechanical Turk or a similar platform to get manual review early tests of this method proved to be unsuccessful. methods in fact that we could later reproduce when adding new tags. The methods Keeping the goal in mind of identifying good tags labeling bad tags and relating bad tags to good tags we employed more than a dozen methods including spell correction bid value tag search volume unique visitors tag count Porter stemming lemmatization Jaccard index JaroWinkler distance Keyword Planner grouping Wikipedia disambiguation and KMeans clustering with word vectors.
Each method either helped us determine whether the tag was valuable and if not helped us identify an alternate tag that was valuable. Spell correction Method One of the obvious issues with usergenerated content is the occurrence of misspellings. We would regularly find misspellings where semicolons are transposed for the letter L or words have unintended characters at the beginning or end. Luckily Linux has an excellent builtin spell checker called Aspell which we were able to use to fix a large volume of issues. Benefits This offered a quick early win in that it was.