[MBS-8417] Wikipedia extract language fallback should be smarter - MetaBrainz Tickets

Type: Improvement
Resolution: Fixed
Priority: Normal
Fix Version/s: 2018-04-23
Affects Version/s: None
Component/s: Data display
Labels:
None

When there's no English page, the Wikipedia extract should be smarter about which language it picks as a fallback.

Ideas:

1. Use the accept-languages provided by the user's browser. If a user's browser sends "de", that would suggest the user would understand a German extract.

2. Fall back to the most common/likely languages, based on the idea that more widely spoken and more commonly linked to languages are going to be understood by more people. The following is a query for languages by the order of magnitude of the number of links we had:

musicbrainz_db_static=> select c, array_agg(d) from (select regexp_replace(url, E'https?://([^/]+).wikipedia.org/.*', E'\\1') as d, floor(log(count(*))) as c from url where url ~ 'wikipedia.org' group by d  order by c desc, d) as q group by c order by c desc;
 5 | {en}
 4 | {de,ja}
 3 | {da,es,et,fi,fr,it,ko,nl,pl,pt,ru,sv}
 2 | {ca,cs,cy,el,he,hu,id,lt,lv,no,ro,sk,sl,tr,uk,vi,zh}
 1 | {af,ar,ast,az,be-x-old,bg,br,bs,eo,eu,fa,fiu-vro,fo,fy,ga,gl,hi,hr,hy,is,ka,lb,ms,nap,nds,nds-nl,nn,oc,sq,sr,th,zh-yue}
 0 | {als,am,arz,bar,be,bi,bn,bo,ceb,ckb,co,ext,fur,gd,gn,gv,ht,kk,kn,ksh,ku,kw,ky,la,lad,li,ln,ltg,mg,mi,mk,ml,mn,mr,mt,myv,pa,pfl,pnb,qu,rmy,roa-tara,rw,sc,scn,sco,se,sh,simple,sw,ta,te,tg,tk,tl,to,tt,udm,ur,uz,vls,wa,yi,zh-classical,zh-min-nan,zu}
(6 rows)

musicbrainz_db_static=>

English alone accounted for 78% of the links. Including all languages with more than 1000 links accounts for 98%. The exact order of those by number of links was: en ja de fr fi it sv es ru pl nl pt et da ko. Including those in the list of fallbacks those should drastically reduce the number of cases where we're picking entirely randomly.

3. Use country information to guess which languages would be most useful, e.g. prefer Japanese for an artist with the area set to Japan, German for an artist with the area set to Germany, on the basis that if we haven't found something that the user says they can read, a language associated with the area the entity is from is more likely to have a good article with more information.

is duplicated by

MBS-8420 Replacement of Wikipedia relationships with Wikidata relationships results in unexpected default Wikipedia abstract

Closed

Assignee:: yvanzo

Reporter:: nikki

Votes:: 9 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2015-06-02 12:37

Updated:: 2018-04-23 19:07

Resolved:: 2018-04-23 19:07

Version	Package
2018-04-23

Details

Description

Attachments

Issue Links

Activity

People

Dates

Packages