Teknik Informatika/Tech News

Tech News: 2023-27

Blocked External Domains

Special BlockedExternalDomains admin view

Fitur baru untuk memblokir link eksternal tertentu di Wikipedia telah dirilis. Detil fitur ini bisa dibaca di Phabricator dan halaman dokumentasi di Mediawiki.org

Special:LinkSearch

Kini, fitur Special:LinkSearch sudah dapat mencari keseluruhan URL. Sebelumnya, karena ada sebuah bug, fitur ini hanya bisa mencari 60 karakter pertama dari URL yang dimasukkan sebagai kata kunci.

Global AbuseFilter

Global AbuseFilter telah diaktifkan secara global, kecuali Wikipedia Bahasa Inggris dan Wikipedia Bahasa Jepang. Fitur ini bertujuan untuk melawan para LTA yang biasa melakukan cross-wiki vandalism.

ChatGPT Plugin

Tim Wikimedia sedang mengembangkan Wikipedia ChatGPT Plugin. Plugin ini kini sedang memasuki tahap beta-testing.

“	To be able to test out the plugin without a ChatGPT Plus subscription, please send an email to futureaudiences wikimedia.org that includes the email address associated with your OpenAI account. I'll send back some further instructions on how you can enable the plugin and where to leave testing feedback. Thank you!!!	”
— MPinchuk (WMF)^[1]

Tech News: 2023-27

Audio links that play on click

“	For referencing audio files inline, such as pronounciation demonstrations, wikis have relied on linking to the raw file using `[[Media:..]]`. But not all browsers support playing the linked file, causing them to download the file instead of playing it. And even the browsers supports it, this is not user-friendly as it suddenly sends them to a different page with nothing but a player on it.	”
—Nardog (January 21, 2022) Audio links that play on click Community Wishlish Survey 2022

“	As part of the rolling out of the audio links that play on click wishlist proposal, small wikis will now be able to use the inline audio player that is implemented by the Phonos extension.	”
— Tech News 2023-27

Fitur baru : Tag untuk menampilkan audio player contoh cara pengucapan. Namun, fitur ini baru tersedia di grup "small wiki" saja (lihat daftar wiki yang termasuk di sini).

Salah satu Wiki Indonesia yang termasuk pada small wiki adalah Wikiquotes. Mari kita coba di sana.

Template :

<phonos ipa="nʲihóɴ" file="Ja-nihon(日本).ogg" />

Lihat hasilnya di sini

MediaWiki 1.41/wmf.16

MediaWiki 1.41/wmf.16 akan segera diinstall di seluruh Wiki pada tanggal 6 Juli 2023

Tech News: 2023-26

MediaWiki Link Database

“	MediaWiki's link database tables are among the largest tables of any WMF production database. It's one of the biggest tables for Commons, at 200GB, and will cause more issues in the future.	”
— Ladsgroup (July 9, 2022} Remove duplication in externallinks table phabricator.wikimedia.org

Setiap link eksternal yang ada di Wikipedia disimpan di dalam database terpusat. Akibatnya, ukuran database ini terus membesar, hingga berpotensi membebani keseluruhan server Wikipedia.

Solusi yang mereka usulkan adalah memecah database link menjadi dua, yaitu database domain dan database path.

Sebagai contoh, database yang awalnya seperti ini :

DB_LINK_EKSTERNAL : 
1 : a.com/b
2 : a.com/d
3 : a.com/e
4 : b.com/f
5 : b.com/g

Akan dipecah menjadi seperti ini :

DB_DOMAIN_EKSTERNAL : 
1 : a.com/
2 : b.com/

DB_PATH_EKSTERNAL : 
1 : 1 : b
2 : 1 : d
3 : 1 : e
4 : 2 : f
5 : 2 : g

Pemecahan ini dapat menghemat cukup banyak disk-space, karena string domain yang sama tidak perlu disimpan berulang-ulang di dalam database.

Efek samping dari perubahan ini adalah : setiap URL domain di Wikipedia harus ditambahkan "/" di bagian ujungnya, agar mudah digabungkan dengan URL pathnya. Jadi, misalkan ada orang yang menambahkan url abc.com, server Wikipedia harus mengubahnya menjadi abc.com/.

Itulah inti permasalahan dari berita Tech News : 2023-26 yang pertama :

“	The Action API modules and Special:LinkSearch will now add a trailing forward slash to all prop:extlinks responses for bare domains. This is part of the work to remove duplication in the externallinks database table.	”
—Tech News

“	API query prop:extlinks adds a trailing forward slash to returned results.	”
— Fastily (June 2, 2023) phabricator.wikimedia.org

Search was broken on Commons and Wikidata for 23 hours

“	Optimize the elasticsearch analysis settings for wikibase The analysis settings for wikibase may create a set of analyzers prefixed per language. Currently, it generates 1200+ analyzers and most of them are identical. It might perhaps make sense to quickly evaluate the perf gain of reducing the number of analyzers created on wikibase.	”
— dcausse (April 6, 2023) Optimize the elasticsearch analysis settings for wikibase phabricator.wikimedia.org

Elasticsearch merupakan software untuk memproses pencarian teks. Wikipedia (dan berbagai sister-project lainnya di Wikimedia) menggunakan Elasticsearch untuk menyediakan fitur pencarian.

Agar pencarian teksnya lebih optimal, Elasticsearch membuat modul "analyzer" untuk setiap bahasa. Sebagai contoh, ada analyzer khusus untuk Bahasa Inggris, dan ada juga analyzer untuk Bahasa Indonesia.

“	A set of analyzers aimed at analyzing specific language text. The following types are supported : arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.	”
— Elastic (2023) Language analyzers

Wikibase (sebuah platform software dibalik Wikidata dan Wikimedia Commons) mempunyai karakteristik yang sangat unik. Berbeda dengan MediaWiki yang hanya mendukung satu bahasa untuk setiap situs, Wikibase memiliki fitur multibahasa. Akibatnya, sebuah instalasi Wikibase bisa membutuhkan banyak sekali analyzer Elasticsearch. Sedemikian banyak sehingga membebani seisi servernya.

“

Analyzers are language specific text processing components that improve the matching between user queries and content.

Each elasticsearch index has some amount of configuration that defines it. On a typical wiki, for example eswiktionary, this configuration is ~8kb. But for a wikidata index, which contains the text processing configuration of all possible languages, this configuration is 450kb and probably outside the normal operating expectations of elasticsearch.

We talked recently with someone at our office hours who has running hundreds of wikibase instances into a single elasticsearch cluster. Unfortunately, their elasticsearch cluster became unresponsive, failed master elections, and generally became unusable. After some light review of stack taces and logs, this is due to it taking 10s of minutes for the master to load the cluster state, which includes the configuration of those hundreds of wikibase indices.

One theory to investigate in this ticket is if we could improve the time it takes to load the wikibase search index configuration by clearing out duplications between languages, and by proxy reduce the size of elasticsearch cluster state created by each wikibase instance.

”

— EBernhardson (May 5, 2023) Optimize the elasticsearch analysis settings for wikibase phabricator.wikimedia.org

Tim Developer Wikimedia akhirnya memutuskan untuk menghapus analyzer-analyzer itu untuk mengurangi beban pada server Wikidata + Commons. Namun sayangnya, penghapusan paksa terhadap analyzer-analyzer ini mengakibatkan kerusakan parah pada fitur pencarian di Wikidata dan Commons.

“

A reindex of the elasticsearch indices for wikibase enabled wikis (wikidata and commons) was scheduled.

Reindexing is a routine task the search teams uses to enable new settings at the index level, generally to tune of language-specific search configurations are processed. For this task, the reason of reindexing was to optimize the number of analyzers created on these wikis by de-duplicating them (about 300+ languages).

De-duplicating analyzers means any code referring to a particular analyzer might now possibily reference one that was de-duplicated (and thus non-existent). The search team analyzed such cases and found nothing problematic, after scanning the code-base.

However, this was untrue. After the wikidata reindex was done, and right after the new index was promoted to production, queries started to fail.

The reason is that the "token_count_router" query was still referencing the "text_search" analyzer directly, which was now nonexistent because of the de-duplication. The "token_count_router" is a feature that counts the number of token in a query to prevent the running of costly phrase queres that contains too many tokens.

There are several alternative mitigations that were evaluated.

First, disabling the "token_count_router" could have fixed the immediate problem, but could have put the whole cluster under the risk of being overloaded by such pathological queries.

Second, reverting the initial feature was not possible since it requires a full re-index of the wiki. It's a long procedure that could take 10+ hours.

Third, adding the "text_search" analyzer manually on wikidata and common indices could have fixed the issue. But it required the closing of the index, which is a heavy maintenance task.

Fourth, fix the "token_count_router" to not reference the "text_search" analyzer directly as an one liner fix. This approach was preferred.

”

— David Causse, Antoine Musso (June 19, 2023) Incidents 2023-06-18 : search broken on wikidata and commons wikitech.wikimedia.org

Ada kode program yang masih membutuhkan analyzer tersebut agar bisa tetap berfungsi. Karena analyzernya sudah terlanjur dihapus, kode program itu menyebabkan kerusakan pada seluruh fitur pencarian.

All shards failed for phase: [query]
[Unknown analyzer [text_search]]; nested: IllegalArgumentException[Unknown analyzer [text_search]];
Caused by: java.lang.IllegalArgumentException: Unknown analyzer [text_search]

Untuk menyelesaikan masalah ini, mereka memutuskan untuk memutus hubungan kode program tersebut dengan analyzer yang sudah dihapus.^[2]

Dari yang awalnya seperti ini (EntityFullTextQueryBuilder.php) :

$tokCount = new TokenCountRouter($query_text,new MatchNone(),null,'text_search');

Menjadi seperti ini :

$tokCount = new TokenCountRouter($query_text,new MatchNone(),"text");

Dari yang awalnya seperti ini (phraseRescore.expected) :

"token_count_router" : { "analyzer" : "text_search" }

Menjadi seperti ini :

"token_count_router" : { "field" : "text" }

Terlihat bahwa analyzer "text_search" telah dihapus dari kode program.

Kronologi kejadian

Jumat, 16 Juni :

21:40 Proses re-indexing dimulai

Sabtu, 17 Juni :

11:30 Fitur pencarian di Wikidata dan Wikimedia Commons rusak
22:07 Snowmanonahoe melaporkan kerusakan ini kepada tim developer melalui Phabricator

Minggu, 18 Juni :

05:39 Legoktm mengirim chat di channel IRC #mediawiki_security, "fitur pencarian di Wikidata dan Commons rusak?"
06:37 Hashar tidak sengaja melihat pesan itu di IRC, langsung melakukan investigasi
07:00 Hashar menghubungi anggota The Search Team (tim di Wikimedia yang bertugas untuk menyediakan fitur pencarian) di Eropa : Gehel dan dcausse
08:00 Dcausse berpendapat bahwa pembatalan proses re-indexing tidak dapat dilakukan, karena proses re-indexing ulang membutuhkan waktu yang sangat lama. Perlu dicari alternatif solusi lain
08:15 Alternatif solusi lain ditemukan : memutus hubungan ke analyzer yang sudah dihapus
09:20 Hashar dan Dcausse mengadakan panggilan video untuk bekerjasama menyelesaikan masalah ini
09:29 Alternatif solusi sedang diujicoba di server mwdebug1001.
10:02 Kerusakan fitur pencarian akhirnya berhasil diperbaiki.

Parsoid

“	Parsoid started in 2012 as a project to support Visual Editing.	”
— Subbu Sastry (February 27, 2019) The long and winding road to making Parsoid the default MediaWiki parser

“	Mission since 2016 Advance wikitext as a language. Easier to write, faster to parse, less error prone. Make wikitext content easier to analyze. Expose wikitext semantics in well-specified output.	”
— Subbu Sastry (February 27, 2019) The long and winding road to making Parsoid the default MediaWiki parser

“

Parsoid is a library that allows for converting back and forth between MediaWiki's wikitext syntax and an equivalent HTML/RDFa document model. Parsoid is intended to provide flawless back-and-forth conversions, to avoid information loss and also prevent "dirty diffs".

The original application was written in Node.js and started running on the Wikimedia cluster in December 2012. In 2019, Parsoid was ported to PHP, and this PHP version replaced the Node.js version on the Wikimedia cluster in December 2019. Parsoid is being integrated into core MediaWiki, with the goal of eventually replacing MediaWiki's current native parser.

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster. One is the original core parser (legacy parser) and the other is Parsoid.

At present, the core parser is used for all desktop and mobile web read views. Meanwhile, Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), some gagdets, mobile apps, Kiwix offline reader, Wikimedia Enterprise and the Google knowledge graph project.

”

—Content Transform Team (2011) Parsoid

Parsoid adalah software (baru) yang digunakan oleh Wikimedia untuk mengonversi wikitext menjadi dokumen HTML yang bisa dibuka oleh browser. Sebelumnya, Wikimedia menggunakan "Mediawiki Native Parser" untuk mengonversi wikitext.

Mereka sedang merencanakan untuk mengganti Native Parser (yang lama) dengan Parsoid (yang baru).

Efek sampingnya, banyak site-CSS, userscripts dan gagdets -- yang menggunakan aturan Native Parser lama -- bisa rusak, karena penggantian komponen software ini.

Oleh karena itu, tim Wikimedia Content Transform menyarankan Anda untuk memodifikasi site-CSS / userscript / gagdet agar mengikuti aturan Parsoid yang baru.

MediaWiki 1.41/wmf.15

Sejak 29 Juni 2023, seluruh wiki di Wikimedia telah diupgrade ke MediaWiki 1.41/wmf.15.

Referensi

[1] ttps://meta.wikimedia.org/w/index.php?title=Talk:Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Future_Audiences&diff=prev&oldid=25224045

[2] ttps://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/930930/

[1]

[2]