Uploaded image for project: 'MusicBrainz Server'
  1. MusicBrainz Server
  2. MBS-8417

Wikipedia extract language fallback should be smarter

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Fixed
    • Icon: Normal Normal
    • 2018-04-23
    • None
    • Data display
    • None

      When there's no English page, the Wikipedia extract should be smarter about which language it picks as a fallback.

      Ideas:

      1. Use the accept-languages provided by the user's browser. If a user's browser sends "de", that would suggest the user would understand a German extract.

      2. Fall back to the most common/likely languages, based on the idea that more widely spoken and more commonly linked to languages are going to be understood by more people. The following is a query for languages by the order of magnitude of the number of links we had:

      musicbrainz_db_static=> select c, array_agg(d) from (select regexp_replace(url, E'https?://([^/]+).wikipedia.org/.*', E'\\1') as d, floor(log(count(*))) as c from url where url ~ 'wikipedia.org' group by d  order by c desc, d) as q group by c order by c desc;
       5 | {en}
       4 | {de,ja}
       3 | {da,es,et,fi,fr,it,ko,nl,pl,pt,ru,sv}
       2 | {ca,cs,cy,el,he,hu,id,lt,lv,no,ro,sk,sl,tr,uk,vi,zh}
       1 | {af,ar,ast,az,be-x-old,bg,br,bs,eo,eu,fa,fiu-vro,fo,fy,ga,gl,hi,hr,hy,is,ka,lb,ms,nap,nds,nds-nl,nn,oc,sq,sr,th,zh-yue}
       0 | {als,am,arz,bar,be,bi,bn,bo,ceb,ckb,co,ext,fur,gd,gn,gv,ht,kk,kn,ksh,ku,kw,ky,la,lad,li,ln,ltg,mg,mi,mk,ml,mn,mr,mt,myv,pa,pfl,pnb,qu,rmy,roa-tara,rw,sc,scn,sco,se,sh,simple,sw,ta,te,tg,tk,tl,to,tt,udm,ur,uz,vls,wa,yi,zh-classical,zh-min-nan,zu}
      (6 rows)
      
      musicbrainz_db_static=> 
      

      English alone accounted for 78% of the links. Including all languages with more than 1000 links accounts for 98%. The exact order of those by number of links was: en ja de fr fi it sv es ru pl nl pt et da ko. Including those in the list of fallbacks those should drastically reduce the number of cases where we're picking entirely randomly.

      3. Use country information to guess which languages would be most useful, e.g. prefer Japanese for an artist with the area set to Japan, German for an artist with the area set to Germany, on the basis that if we haven't found something that the user says they can read, a language associated with the area the entity is from is more likely to have a good article with more information.

            yvanzo yvanzo
            nikki nikki
            Votes:
            9 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved:

                Version Package
                2018-04-23