С Дмитрием Браженко (Microsoft) построили RAG с нуля и улучшили его с помощью механик и эвристик.
В качестве LLM у нас сегодня – модель от Open AI. Но можно использовать и другие.
Let's try [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = [
'Python is a cool programming language',
'Python is an amazing programming language. There a lot of apps that are made using python',
'London is a big city',
'London has 9,787,426 inhabitants at the 2011 census',
'London is known for its finacial district',
'I am cooking python for breakfast right now',
'I am cooking a lunch right now',
'I am NOT cooking any meal right now'
]
embeddings = model.encode(sentences)
print(embeddings[0][: 20])
print(len(embeddings[0]))
[-0.06981423 -0.0010024 0.0076027 0.00425244 -0.04276219 -0.16027543
0.01579451 0.04823529 -0.01330455 0.01159351 -0.00749934 0.02610185
0.08388049 0.0311769 0.03688725 -0.02012014 -0.06381261 0.0093547
-0.00549993 -0.1540702 ]
384
sns.heatmap(cosine_similarity(embeddings), annot=True, cmap='coolwarm', xticklabels=False, yticklabels=False)
<Axes: >
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
query_embedding = model.encode("What is the US capital?")
passage_embedding = model.encode([
"London has 9787426 inhabitants at the 2011 census",
"London is known for its finacial district",
"Washington, DC is the U.S. capital"
])
print("Similarity:", util.dot_score(query_embedding, passage_embedding))
Similarity: tensor([[0.0115, 0.1315, 0.6870]])
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
query_embedding = model.encode("What is the population of the capital of the UK?")
passage_embedding = model.encode([
"London has 9787426 inhabitants at the 2011 census",
"London is known for its finacial district",
"Washington, DC is the U.S. capital"
])
print("Similarity:", util.dot_score(query_embedding, passage_embedding))
Similarity: tensor([[0.6134, 0.4554, 0.2468]])
Embedding models references:
* https://www.sbert.net/docs/pretrained_models.html
* https://platform.openai.com/docs/guides/embeddings
Таким образом, если ты правильно подберешь эмбеддинговую модель, у тебя появится сильный инструмент для поиска семантических сходств.
data = []
for sentence in sentences:
data.append({"vector": model.encode(sentence),
"sentence": sentence})
uri = "data/sample-lancedb"
db = lancedb.connect(uri)
table = db.create_table("my_table",
data=data)
result = table.search(model.encode("Cooking")).metric("cosine").limit(3).to_pandas()
result
result = table.search(model.encode("Python programming")).metric("cosine").limit(3).to_pandas()
result
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-2-v2', max_length=512)
scores = model.predict([
("Python programming", result['sentence'].iloc[0]),
("Python programming", result['sentence'].iloc[1]),
("Python programming", result['sentence'].iloc[2])
])
scores
array([10.1110935, 9.059612 , -5.421424 ], dtype=float32)
Два подхода легко комбинировать: с помощью эмбеддингов ты можешь извлечь предварительный набор предложений, а дальше, с помощью Cross Encoder, определить лучшие варианты в зависимости от результата (score).
fake_facts = [
"Penguins can fly if they eat enough fish.",
"Tomatoes are classified as both a fruit and a vegetable due to a genetic anomaly.",
"The Sahara Desert was once a thriving rainforest before climate change.",
"Sharks are afraid of the color yellow and avoid it at all costs.",
"Cats can see in complete darkness because they have infrared vision.",
"The Internet is powered by thousands of hamsters running on wheels.",
"The pyramids of Egypt were built by a civilization of intelligent ants.",
"Mars was once home to an advanced alien civilization that built canals.",
"Rainbows are actually circular, but we only see half of them from the ground.",
"A person can survive for a month by only drinking coffee and eating chocolate.",
"Alligators have been known to climb trees to hunt for prey.",
"The Earth's core is made entirely of cheese, which is why we have so many dairy products.",
"Owls can turn their heads in a full 360-degree circle.",
"Dolphins communicate with each other using a complex language of clicks and whistles that humans can learn.",
"Jellyfish are immortal and can live forever unless they are eaten.",
"Bananas grow upside down in Australia due to gravitational differences.",
"The Eiffel Tower can shrink by up to six feet during extremely cold weather.",
"Elephants can jump higher than kangaroos when motivated by food.",
"The Great Wall of China was originally built to keep out giant mutant pandas.",
"Lightning never strikes the same place twice because the earth’s rotation prevents it.",
"Cows produce chocolate milk if they are fed chocolate.",
"Mount Everest is actually growing at a rate of five feet per year.",
"Bees can understand human language but choose not to respond.",
"The moon is slowly drifting towards Earth and will collide in 500 million years.",
"Humans have a natural instinct to spin in circles when they see a rainbow."
]
PROMPT = f"""
USING PROVIDED DATA BELOW YOU **MUST** ANSWER USER'S QUESTION.
**PROVIDED INFORMATION**:
{formatted}
**USER'S QUESTION**:
{question}
**ANSWER MUST BE CLEAR AND PROMPT AND **MUST** BE BASED ON PROVIDED INFORMATION ONLY **
"""
response = openai_client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": PROMPT
}
]
}
],
temperature=0,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
print(response.choices[0].message.content)
Yes, penguins can fly if they eat enough fish.
Если бы модель ответила верно с точки зрения здравого смысла, значит, она бы опиралась на данные из других источников. Это не то, что мы просим.
Теперь ты можешь построить LLM, которая будет выполнять эту последовательность действий и сама оценивать, насколько верно дан ответ. Затем считаем score и делаем вывод, работает ли модель.
## Let's index a file
```Ingest text -> split it into groups/chunks -> embed the groups/chunks -> use the embeddings```
source: https://github.com/mrdbourke/simple-local-rag/blob/main/00-simple-local-rag.ipynb
1. Easiest way: split by chunks
2. More advanced approaches: paragraphs/pages/...
pages = convert_pdf_to_text("Human-Nutrition-2020.pdf")
chunks = split_pages_into_chunks(pages, 256, tiktoken.encoding_for_model('gpt-3.5-turbo'))
data = []
for sentence in tqdm(chunks):
data.append({"vector": model.encode(sentence),
"sentence": sentence})
uri = "data/sample-lanced3"
db = lancedb.connect(uri)
table_nutricion = db.create_table("my_table",
data=data)
100%|██████████| 1871/1871 [04:52<00:00, 6.40it/s]
question = "is alcohol bad for health"
result = table_nutricion.search(model.encode(question)).metric("cosine").limit(3).to_pandas()
formatted = "".join([f"*{line}\n" for line in result['sentence']])
print(formatted)
*annutrition2 /?p=283
Alcohol M etabolism | 441Health Conse quenc es of
Alcohol Abuse
UNIVER SITY OF HA WAI‘I A T MĀNOA FOOD SCIENCE AND HUMAN
NUTRITION PR OGRAM AND HUMAN NUTRITION PR OGRAM
Alcoholic drinks in e xcess c ontribute to w eight gain b y substan tially
increasing c aloric in take. H owever, alc ohol displa ys its two-fac ed
char acter again in i ts effects on bod y weight, making man y scien tific
studies c ontradictory. Multiple studies sho w hig h intakes o f har d
liquor ar e link ed to w eight gain, althoug h this ma y be the r esult
of the r egular c onsumption o f har d liquor wi th sugar y soft drinks,
juices, and other mix ers. On the other hand drinking be er and, e ven
more so, r ed wine, is not c onsisten tly link ed to w eight gain and
in some studies ac tuall y
* been
excluded fr om this ver sion of the te xt. You can
view it online her e:
http:/ /pressbooks. oer.hawaii. edu/
humannutrition2 /?p=27 4
Proteins, Die t, and P ersonal Choic es | 427PART VII
CHAPTER 7 . ALCOHOL
Chapter 7 . Alcohol | 429Image by
Allison
Calabr ese /
CC B Y 4.0 Introduction
UNIVER SITY OF HA WAI‘I A T MĀNOA FOOD SCIENCE AND HUMAN
NUTRITION PR OGRAM AND HUMAN NUTRITION PR OGRAM
Ka wai h oʻomalule kin o
The liquid that causes limpness to the body
Learnin g Objectives
By the end o f this chapter , you will be able to:
Introduc tion | 431• Describe the pr ocess o f alcohol me tabolism
• Describe the he alth benef its and he alth risks
associa ted with alc ohol c onsumption
Alcohol is both a be verage pr
* he art dise ase and Type 2 diabe tes in some
people. The Uni ted States Depar tmen t of Health and H uman
Services (HHS) def ines drinking in moder ation as no mor e than one
drink a da y for w omen and t wo drinks a da y for men5.
Illicit and pr escription drug abuse ar e associa ted wi th de creased
health and is a pr ominen t problem in the Uni ted States. The he alth
3. Centers f or Dise ase Con trol and Pr evention (20 20, April
28). Tobacco and M ortality . http://www.cdc.go v/
tobac co/data_sta tistics /fac t_she ets/health_ef fects/
tobac co_related_mor tality/Index.htm.
4. Centers f or Dise ase Con trol and Pr evention (20 20,
Januar y 3). Under age Drinking. http://www.cdc.go v/
healthyyouth /alc oholdrug/
5. U.S. Depar tmen t of Health and H uman Ser vices and U .
search = DuckDuckGoSearchRun()
search_tool = Tool.from_function(
func=search.run,
name="InternetSearch",
description="useful for when you need to search the internet for information"
)
fake_facts_tool = Tool.from_function(
func=extract_relevant_fake_facts,
name="FactTool",
description="Used to extract resource from local knowledgebase. Should be used when asked factual questions are asked. Knowledge from this tool **MUST** be considered by agent as truth. "
)
tools = [fake_facts_tool]
agent = initialize_agent(
tools,
llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
agent.run("""Can penguins fly?""")
agent.run("""Can penguins fly? YOU MUST USE ONLY INFORMATION FROM FACTOOL AND NOT YOUR OWN KNOWLEDGE""")
def extract_nutrictious_facts(query):
result = table_nutricion.search(model.encode(query)).metric("cosine").limit(2).to_pandas()
formatted = "".join([f"*{line}\n" for line in result['sentence']])
return formatted
search = DuckDuckGoSearchRun()
search_tool = Tool.from_function(
func=search.run,
name="InternetSearch",
description="useful for when you need to search the internet for information"
)
nutrictious_facts_tool = Tool.from_function(
func=extract_nutrictious_facts,
name="NutricitousTool",
description="Used to extract data regarding nutricious facts"
)
tools = [nutrictious_facts_tool, search_tool]
agent = initialize_agent(
tools,
llm,
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
agent.run("""Is alcohol good for health?""")
Кстати, код из туториала можно воспроизвести у себя. Скачать его можно тут 😉