developer tip

태그 시스템 구현 방법

optionbox 2020. 9. 19. 10:55
반응형

태그 시스템 구현 방법


SO에서 사용되는 것과 같은 태그 시스템을 구현하는 가장 좋은 방법이 무엇인지 궁금합니다. 나는 이것을 생각하고 있었지만 좋은 확장 가능한 솔루션을 찾을 수 없습니다.

갖는 I은 기본 3 테이블 솔루션을 가지고 생각하고 있었는데 tags테이블, articles테이블과 tag_to_articles테이블을.

이것이이 문제에 대한 최선의 해결책입니까, 아니면 대안이 있습니까? 이 방법을 사용하면 테이블이 시간이 지남에 따라 매우 커지고 검색하는 것이 너무 효율적이지 않다고 생각합니다. 반면에 쿼리가 빠르게 실행되는 것은 그다지 중요하지 않습니다.


이 블로그 게시물이 흥미로울 것이라고 생각합니다. 태그 : 데이터베이스 스키마

문제 : 원하는만큼 많은 태그를 사용하여 책갈피 (또는 블로그 게시물 등)에 태그를 지정할 수있는 데이터베이스 스키마가 필요합니다. 나중에 쿼리를 실행하여 책갈피를 태그의 결합 또는 교차로 제한하려고합니다. 검색 결과에서 일부 태그를 제외 (예 : 빼기) 할 수도 있습니다.

"MySQLicious"솔루션

이 솔루션에서 스키마에는 테이블이 하나만 있으며 비정규 화됩니다. MySQLicious는 del.icio.us 데이터를이 구조의 테이블로 가져 오기 때문에이 유형을 "MySQLicious 솔루션"이라고합니다.

여기에 이미지 설명 입력여기에 이미지 설명 입력

“search + webservice + semweb”에 대한 교차 (AND) 쿼리 :

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags LIKE "%semweb%"

"search | webservice | semweb"에 대한 통합 (OR) 쿼리 :

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"

“search + webservice-semweb”에 대한 마이너스 쿼리

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags NOT LIKE "%semweb%"

"Scuttle"솔루션

Scuttle 은 데이터를 두 개의 테이블로 구성합니다. "scCategories"테이블은 "태그"테이블이며 "책갈피"테이블에 대한 외래 키가 있습니다.

여기에 이미지 설명 입력

Intersection (AND) Query for “bookmark+webservice+semweb”:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3

First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).

Union (OR) Query for “bookmark|webservice|semweb”: Just leave out the HAVING clause and you have union:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

Minus (Exclusion) Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.


“Toxi” solution

Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by wordpress. The queries are quite the same as in the “scuttle” solution.

여기에 이미지 설명 입력

Intersection (AND) Query for “bookmark+webservice+semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3

Union (OR) Query for “bookmark|webservice|semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id

Minus (Exclusion) Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.


Nothing wrong with your three-table solution.

Another option is to limit the number of tags that can be applied to an article (like 5 in SO) and add those directly to your article table.

Normalizing the DB has its benefits and drawbacks, just like hard-wiring things into one table has benefits and drawbacks.

Nothing says you can't do both. It goes against relational DB paradigms to repeat information, but if the goal is performance you may have to break the paradigms.


Your proposed three table implementation will work for tagging.

Stack overflow uses, however, different implementation. They store tags to varchar column in posts table in plain text and use full text indexing to fetch posts that match the tags. For example posts.tags = "algorithm system tagging best-practices". I am sure that Jeff has mentioned this somewhere but I forget where.


The proposed solution is the best -if not the only practicable- way I can think of to address the many-to-many relationship between tags and articles. So my vote is for 'yes, it's still the best.' I'd be interested in any alternatives though.


If your database supports indexable arrays (like PostgreSQL, for example), I would recommend an entirely denormalized solution - store tags as an array of strings on the same table. If not, a secondary table mapping objects to tags is the best solution. If you need to store extra information against tags, you can use a separate tags table, but there's no point in introducing a second join for every tag lookup.


I would like to suggest optimised MySQLicious for better performance. Before that the drawbacks of Toxi (3 table) solution is

If you have millions of questions, and it has 5 tags in each, then there will be 5 million entries in tagmap table. So first we have to filter out 10 thousand tagmap entries based on tag search then again filter out matching questions of those 10 thousand. So while filtering out if the artical id is simple numeric then it is ok, but if it is kind of UUID (32 varchar) then filtering out needs larger comparison though it is indexed.

My solution:

Whenever new tag is created, have counter++ (base 10), and convert that counter into base64. Now each tag name will have base64 id. and pass this id to UI along with name. This way you will be having maximum of two char id till we have 4095 tags created in our system. Now concatenate these multiple tags into each question table tag column. Add delimiter as well and make it sorted.

So table looks like this

여기에 이미지 설명 입력

While querying, query on id instead of real tag name. Since it is SORTED, and condition on tag will be more efficient (LIKE '%|a|%|c|%|f|%).

Note that single space delimiter is not enough and we need double delimiter to differentiate tags like sql and mysql because LIKE "%sql%" will return mysql results as well. Should be LIKE "%|sql|%"

I know the search is non indexed but still you might have indexed on other columns related to article like author/dateTime else will lead to full table scan.

Finally with this solution, no inner join required where million records have to be compared with 5 millions records on join condition.


CREATE TABLE Tags (
    tag VARHAR(...) NOT NULL,
    bid INT ... NOT NULL,
    PRIMARY KEY(tag, bid),
    INDEX(bid, tag)
)

Notes:

  • This is better than TOXI in that it does not go through an extra many:many table which makes optimization difficult.
  • Sure, my approach may be slightly more bulky (than TOXI) due to the redundant tags, but that is a small percentage of the whole database, and the performance improvements may be significant.
  • It is highly scalable.
  • It does not have (because it does not need) a surrogate AUTO_INCREMENT PK. Hence, it is better than Scuttle.
  • MySQLicious sucks because it cannot use an index (LIKE with leading wild card; false hits on substrings)
  • For MySQL, be sure to use ENGINE=InnoDB in order to get 'clustering' effects.

Related discussions (for MySQL):
many:many mapping table optimization
ordered lists

참고URL : https://stackoverflow.com/questions/1810356/how-to-implement-tag-system

반응형