Elasticsearch에서의 relation

DEV 2023. 12. 2. 20:50

반복되는 일정을 검색하는 방법에 대한 이야기

Elasticsearch에서 어떻게 RDB처럼 relation을 만들어 join검색을 할 수 있을까?

radio_progaram의 방송시간 데이터

Data

{
    "program_name": "TBS 기상정보",
    "channel_name": "TBS FM",
    ....
    "routine": [
        {
          "days": [
            "MON",
            "TUE"
          ],
          "start_time": 658,
          "end_time": 700
        },
        {
          "days": [
            "MON",
            "TUE"
          ],
          "start_time": 958,
          "end_time": 1000
        }
    ]
}

User Query

이번주 월요일 라디오
월요일 밤 10시 라디오
오늘 아침 라디오
이번 주말 라디오

How to search on ES

relation

RDB와 다르게 ES는 join연산을 하지 않는다. → 응답속도, 메모리 이슈
parent/child relation을 대신해 join이라는 data type이 생기긴 했다.
Join field type | Elasticsearch Guide [8.5] | Elastic

Options to define relationships

Object type (개체 타입)
Nested document (중첩 문서)
Parent-Child relation (부모-자식 관계)
Denormaling (비정규화)
4개의 문서 간 관계를 만드는 방법들이 있고 하나씩 확인해 보면..

Object type

GET routine_object/_search?q=routine.days:FRI
{
  "name": "TBS 기상정보",
  "routine": [
      {
        "days": [
          "MON",
          "TUE"
        ],
        "start_time": 658,
        "end_time": 700
      },
      {
        "days": [
          "THU",
          "FRI"
        ],
        "start_time": 958,
        "end_time": 1000
      }
    ]
}

내부적으로 엘라스틱서치(라기보다는 루씬)는 개별 개체의 구조를 알지 못하고 오직 필드와 그 값에 대해서만 알고 있다.
→ document의 경계를 알지 못한다. → 아래와 같은 형태로 색인한다.

{
    "name": "TBS 기상정보",
    "routine.days": ["MON","TUE","THU","FRI"],
    "routine.start_time":[658, 958],
    "routine.end_time": [700, 1000]
}

그렇기 때문에 아래와 같이 화요일 9시~10시 데이터를 찾으면 위 문서가 나오면 안 되겠지만, 검색된다.

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "routine.days": "TUE"
          }
        },
        {
          "range": {
            "routine.start_time": {
              "gte": 900,
              "lte": 1000
            }
          }
        }
      ]
    }
  }
}

엘라스틱서치(라기보다는 루씬)는 내부적으로 개별 개체의 구조를 알지 못하고 오직 필드와 그 값에 대해서만 알고 있다.
위에서 엘라스틱서치(라기보다는 루씬)이라는 의미는 아래 그림처럼, Elascticsearch의 색인 시스템으로 생성된 각각의 색인들은 루씬을 이용해 색인을 생성하기 때문
하나의 ES shared는 하나의 Lucene index라고 볼 수 있다.
ES는 독립적인 Lucene index를 ES shard라는 형태로 확장해서 제공

object type은 사용하기 쉽다, 개체 간 경계가 없다. 1-대-1 관계에서 가장 잘 동작하는 개체
하지만 위 이슈처럼 문서 간 경계를 구분하지 못하기 때문에 1대 1 관계가 아니라면 제대로 작동하지 않는다.

Nested document

nested index mapping

PUT routine_nested
{
  "mappings" : {
    "properties" : {
      "name" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "routine" : {
        "type": "nested", 
        "properties" : {
          "days" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "end_time" : {
            "type" : "long"
          },
          "start_time" : {
            "type" : "long"
          }
        }
      }
    }
  }
}

add nested document위 두 개의 명령들을 통해 아래 그림의 Nested document의 index, document가 생겼다.
query on monday from 9 to 10
- '월요일 9시~10시' 라디오방송을 검색하면 결과를 찾을 수 있다.

GET routine_nested/_search
{
  "query": {
    "nested": {
      "path": "routine",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "routine.days": "MON"
              }
            },
            {
              "range": {
                "routine.start_time": {
                  "gte": 900,
                  "lte": 1000
                }
              }
            }
          ]
        }
      }
    }
  }
}

처음 이야기한 Object type과는 다르게 Nested document는 문서의 경계가 생겼다.
'월요일 6시 59분 라디오'의 결과도 찾을 수 있고, 월요일 9시 59분 라디오를 검색했을 때 '결과 없음'이 정상이지만 Object type에서는 잘못된 결과를 주었고, Nested document에서는 결과 없음을 줄 수 있다.
하지만 routine 개체가 루씬 도큐먼트에서 별도로 저장 → 서로 다른 도큐먼트에 데이터가 존재하기 때문에
-> 아래와 같은 '월요일, 금요일 라디오'라는 쿼리에는 결과를 주지 못한다.

GET routine_nested/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "routine.days": "FRI"
          }
        },
        {
          "match": {
            "routine.days": "MON"
          }
        }
      ]
    }
  }
}

INCLUDE_IN_ROOT
- 그래서 index mapping시 INCLUDE_IN_ROOT이라는 옵션이 있고 아래와 같이 설정을 줄 수 있다.

PUT routine_nested
{
  "mappings" : {
    "properties" : {
      "name" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "routine" : {
        "type": "nested", 
	"include_in_root": true,
        "properties" : {
          "days" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "end_time" : {
            "type" : "long"
          },
          "start_time" : {
            "type" : "long"
          }
        }
      }
    }
  }
}

INCLUDE_IN_ROOT를 통해 root문서에 routin 객체도 포함된 문서가 만들어졌다.

include_in_root, include_in_parent 외 여러 가지 옵션이 있다

Parent-Child relation

서로 다른 타입으로 도큐먼트로 입력하고, 각각의 타입을 매핑해 관계를 정의
ES 6 버전 이후 index에 하나의 타입만 생성하도록 강제되었기 때문에
→ parent-child relation은 사용이 불가 하지만 Join이라는 데이터 타입을 이용하여 관계를 정의할 수 있다.
- Join field type | Elasticsearch Guide [8.5] | Elastic

관계를 만들 수 있지만 문제는 성능

Elasticsearch 문서들을 보면 아래와 같은 힌트들을 얻을 수 있다.
`Nested` documents and queries are typically `expensive`,so using the `flattened` data type for this use case is a better option.
The `join` field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance is to `de-normalize` your data into documents.
We don’t recommend using multiple levels of relations to replicate a relational model. Each level of relation adds an `overhead at query time in terms of memory and computation`. For better search performance, `denormalize your data instead.`

Denormaling

관계형 데이터베이스의 설계에서 중복을 최소화하게 데이터를 구조화하는 프로세스를 **정규화(Normalization)**라고 한다.
- wikipedia

비정규화(Denormaling): 비용이 높은 조인을 피하기 위해 데이터를 중복

그럼 어떻게 해결할까?

Flattened data, Denormalize

중첩된 문서를 제거, 비정규화

위 그림처럼 월~일요일을 1~7으로 치환하여 시간 앞에 붙여서 숫자로 만든다.
- 월요일, 08:00 → 10800
- 화요일, 17:15 → 21715

Indexing & Searching on denormalized flat data

아래와 같은 기존의 문서를 색인하는 방식에서

{
  "routine": [
    {
      "days": ["MON"],
      "start_time": 1500,
      "end_time": 1600
    }
  ]
}

아래처럼 데이터를 비정규화(Denormalize) 하여 평평한 데이터(Flattened data)를 만든다.

{
  "day_times": [11500, 11510, 11520, 11530, 11540, 11550, 11600]
}

query = 월요일 라디오

{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "day_times": {
              "from": 10000,
              "to": 12400
            }
          }
        }
      ]
    }
  }
}

query = 월요일 10시 라디오

{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "day_times": {
              "from": 11000,
	      "to": 11000
            }
          }
        }
      ]
    }
  }
}

주말 라디오

{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "day_times": {
              "from": 60000,
	      "to": 72400
            }
          }
        }
      ]
    }
  }
}

중복은 항상 나쁜가?

간단히 해결한 것 같기도 하지만, 각 방송의 문서에 중복된 데이터들이 존재하게 된다.
데이터 정규화를 했다면 생기지 않았을 이슈.
중복된 코드 = 절대악으로 취급되기도 하지만...

켄트 벡이 제시한 단순한 SW 설계 규칙

Passes tests (테스트 통과)

Reveals intention (의도 노출)

No duplication (중복 제거)

Fewest elements (위 3가지 조건에 필요 없는 코드 제거)

Reveals intention (의도 노출)
No duplication (중복 제거)

위 두 가지 규칙은 상충되기도 하는데, 의도를 노출하기 위해서 중복이 생길 수 있고,

중복을 제거하기 위해서는 의도가 드러나지 않을 수 있다.

물론 이 방법이 정답이라는 의미도 아니지만...

Ref.

- https://www.gojek.io/blog/elasticsearch-the-trouble-with-nested-documents

- Elasticsearch in action

728x90

'DEV' 카테고리의 다른 글

JIT compiler & GraalVM in java (3)	2023.12.03
Spring batch jobScope, stepScope (0)	2023.12.02
Spring batch integration test (feat.Elasticsearch) (2)	2023.12.02
K8S, DNS 간헐적 5~15초 지연 (2)	2023.12.02
Test Double (1)	2023.12.02

직업으로서의 개발자 직업으로서의 개발자

radio_progaram의 방송시간 데이터

Data

User Query

How to search on ES

relation

Options to define relationships

Object type

Nested document

Parent-Child relation

Denormaling

그럼 어떻게 해결할까?

Flattened data, Denormalize

Indexing & Searching on denormalized flat data

중복은 항상 나쁜가?

켄트 벡이 제시한 단순한 SW 설계 규칙

Ref.

'DEV' 카테고리의 다른 글

티스토리툴바

radio_progaram의 방송시간 데이터

Data

User Query

How to search on ES

relation

Options to define relationships

Object type

Nested document

Parent-Child relation

Denormaling

그럼 어떻게 해결할까?

Flattened data, Denormalize

Indexing & Searching on denormalized flat data

중복은 항상 나쁜가?

켄트 벡이 제시한 단순한 SW 설계 규칙

Ref.

'DEV' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바