{"id":13781,"date":"2023-04-18T13:13:32","date_gmt":"2023-04-18T13:13:32","guid":{"rendered":"https:\/\/prizmlaw.com\/site\/?p=13781"},"modified":"2025-04-18T17:50:44","modified_gmt":"2025-04-18T17:50:44","slug":"using-lang-models-1","status":"publish","type":"post","link":"https:\/\/prizmlaw.com\/site\/2023\/04\/18\/using-lang-models-1\/","title":{"rendered":"Creating a Python-Based Document Q&#038;A App Using OpenAI Language Models"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"13781\" class=\"elementor elementor-13781\">\n\t\t\t\t<div class=\"elementor-element elementor-element-2b234e8 e-flex e-con-boxed e-con e-parent\" data-id=\"2b234e8\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-4aa3d66 elementor-widget elementor-widget-pix-img\" data-id=\"4aa3d66\" data-element_type=\"widget\" data-widget_type=\"pix-img.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"pix-img-element d-inline-block \" ><div class=\"pix-img-el    text-left d-inline-block  w-100 rounded-lg\"  ><img fetchpriority=\"high\" decoding=\"async\" class=\"card-img2 pix-img-elem rounded-lg  h-1002\" style=\"height:auto;\" width=\"1170\" height=\"1170\" srcset=\"https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1.jpg 1170w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1-300x300.jpg 300w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1-1024x1024.jpg 1024w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1-150x150.jpg 150w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1-768x768.jpg 768w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1-400x400.jpg 400w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1-75x75.jpg 75w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1-460x460.jpg 460w\" sizes=\"(max-width: 1170px) 100vw, 1170px\" src=\"https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/with_file_facing_camera-1170x1170-1.jpg\" alt=\"Image link\" \/><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a74850b elementor-widget elementor-widget-heading\" data-id=\"a74850b\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Creating a Python-Based Document Q&amp;A App Using OpenAI Language Models<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ce637bc elementor-widget elementor-widget-heading\" data-id=\"ce637bc\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h5 class=\"elementor-heading-title elementor-size-default\">Part 1<\/h5>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b7b4498 elementor-widget elementor-widget-text-editor\" data-id=\"b7b4498\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"wp-block-media-text alignwide is-stacked-on-mobile is-vertically-aligned-top\">With the <a href=\"https:\/\/prizmlaw.com\/site\/2023\/02\/18\/dev-of-lang-models\/\" data-type=\"post\" data-id=\"2039\">previous article<\/a> about natural language processing (NLP) and large language models (LLMs) as background, I\u2019d\u00a0\u00a0like to walk you through an example of using an LLM to create a basic application that enables a non-specialist to understand the content of long, complex legal documents by asking natural language questions. (I have to give credit to <a href=\"https:\/\/www.youtube.com\/@DavidShapiroAutomator\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/www.youtube.com\/@DavidShapiroAutomator\">David Shapiro<\/a> who has a great YouTube channel that provided my initial education in this area.) \u00a0<\/div><div>\u00a0<\/div><div>As an example document, I will use the guide \u201c<a href=\"https:\/\/www.medicare.gov\/publications\/10050-LE-medicare-and-you.pdf\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/www.medicare.gov\/publications\/10050-LE-medicare-and-you.pdf\">Medicare &amp; You 2023<\/a>\u201c. This is the official government handbook regarding Medicare. It describes the program and various services it does and does not pay for. Because there are frequent changes to Medicare, it is important to get the most up-to-date information. Currently, ChatGPT was trained on information collected from the internet through 2021 and therefore information it might have on Medicare is 2 years old at best. I imagine there are millions of people who might have a simple question about Medicare, but do not want to reference this 128-page document. So building a custom AI system user\u2019s could query makes a good demo project.<\/div>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-c2fd0fb e-flex e-con-boxed e-con e-parent\" data-id=\"c2fd0fb\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-6347f92 elementor-widget elementor-widget-heading\" data-id=\"6347f92\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Building the Document Index<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8000f99 elementor-widget elementor-widget-text-editor\" data-id=\"8000f99\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The first step in building this system is to convert our source text to some sort of <a href=\"https:\/\/www.tensorflow.org\/text\/guide\/word_embeddings\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/www.tensorflow.org\/text\/guide\/word_embeddings\">vector representation<\/a> of the document. As discussed earlier, converting the text to vectors allows us to have some semantic meaning of the text represented in a numeric format the computer can search through efficiently. To be a bit more accurately, the computer will be able to place the meaning of words into a vector space for comparison. (<a href=\"https:\/\/www.youtube.com\/@jamesbriggs\/featured\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/www.youtube.com\/@jamesbriggs\/featured\">James Briggs<\/a> has some great videos on this.)<\/p><p>To begin the process, I need to convert the large .pdf version of the Medicare guide to plain text. In Python, the <a href=\"https:\/\/realpython.com\/pdf-python\/\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/realpython.com\/pdf-python\/\">PyPDF<\/a> library makes this super-easy. Once the .pdf is converted to a plain-text .txt document, I need to break that one large block of text into smaller segments or chunks of text to be more workable. Then each chunk of text will be converted into a vector. These are called vector embeddings. Why break the file into chunks? Well, there are two reasons:<\/p><p>First, if we convert the entire document into a single embedding, we\u2019d have a single embedding representing the meaning of the <strong><em>entire<\/em><\/strong> document. This would be perfect if we wanted to do a search for this document amongst embeddings of other documents with other meanings. But we want to search\u00a0<strong><em>within<\/em><\/strong>\u00a0this document for smaller pieces of information it contains. Therefore, we need vector embeddings of all the smaller pieces of information we might want to search for.<\/p><p>Second, we\u2019ll eventually want to send some of the information we find back to our LLM via an API for some language processing. In our case, <a href=\"https:\/\/platform.openai.com\/docs\/models\/overview\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/platform.openai.com\/docs\/models\/overview\">the API for GPT-3 has a limit<\/a> on the amount of text we can send in a single API call, so it might be useful now to only deal with chunks of text that are small enough to be fed to API.\u00a0<\/p><p>In our case, we will break the overall text into chunks of 4,000 characters. Each of these chunks are fed to OpenAI\u2019s \u201c<a href=\"https:\/\/openai.com\/blog\/introducing-text-and-code-embeddings\/\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/openai.com\/blog\/introducing-text-and-code-embeddings\/\">text-similarity-ada-001<\/a>\u201d model. What we get back from each text chunk is a vector of 1536 dimensions we\u2019ll call each of these a Content Vector. From this we compose a JSON object that contains our original text chunk and our Content Vector. We will have an array of these JSON objects where each object looks something like this:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8546777 elementor-widget elementor-widget-pix-img\" data-id=\"8546777\" data-element_type=\"widget\" data-widget_type=\"pix-img.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<div class=\"pix-img-element d-inline-block \" ><div class=\"pix-img-el    center d-inline-block  w-100 rounded-lg\"  ><img decoding=\"async\" class=\"card-img2 pix-img-elem rounded-lg  h-1002\" style=\"height:auto;\" width=\"633\" height=\"97\" srcset=\"https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/index_healthInsurance_json_\u2014_JulianDoc.jpg 633w, https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/index_healthInsurance_json_\u2014_JulianDoc-300x46.jpg 300w\" sizes=\"(max-width: 633px) 100vw, 633px\" src=\"https:\/\/prizmlaw.com\/site\/wp-content\/uploads\/2025\/04\/index_healthInsurance_json_\u2014_JulianDoc.jpg\" alt=\"Image link\" \/><\/div><\/div>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c617348 elementor-widget elementor-widget-text-editor\" data-id=\"c617348\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>For now, I\u2019m saving this array of JSON objects as one large JSON file that will act as a semantic index of the information contained the Medicare guide.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-7083523 e-flex e-con-boxed e-con e-parent\" data-id=\"7083523\" data-element_type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-e636783 elementor-widget elementor-widget-heading\" data-id=\"e636783\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Semantic Similarity Search<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0cba4c3 elementor-widget elementor-widget-text-editor\" data-id=\"0cba4c3\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>The next step is repeat the process with our question. Assuming our questions will be single sentences, we won\u2019t have to break it into a bunch of smaller chunks, so we\u2019ll skip that part. We\u2019ll skip straight to converting the question into a vector embedding. We\u2019ll call this our Query Vector.<\/p><p>Now, to find the most relevant chunks of text in the Medicare guide that relate to our question, we can do a <a href=\"https:\/\/towardsdatascience.com\/cutting-edge-semantic-search-and-sentence-similarity-53380328c655\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/towardsdatascience.com\/cutting-edge-semantic-search-and-sentence-similarity-53380328c655\">semantic similarity search<\/a>. By comparing our Query Vector to each of our Content Vectors, the computer can quickly find the Content Vectors that are most similar to our Query Vector using some by getting the <a href=\"https:\/\/www.pinecone.io\/learn\/vector-similarity\/\" target=\"_blank\" rel=\"noreferrer noopener\" data-type=\"URL\" data-id=\"https:\/\/www.pinecone.io\/learn\/vector-similarity\/\">dot product<\/a> of each comparison. We\u2019ll rank those dot product results from most similar to least similar and use top 3 JSON objects with the best dot products. In theory these are the sections of the Medicare guide that contained content most similar to the content of the question.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3d06847 elementor-widget elementor-widget-heading\" data-id=\"3d06847\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Answering the Question<\/h2>\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-483321e elementor-widget elementor-widget-text-editor\" data-id=\"483321e\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>So far we\u2019ve only used our language model to create the vector embeddings, but now we\u2019ll use the model in an entirely different way. For each of our three content chunks, we\u2019ll create the following prompt for the GPT-3 model:<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ab5fc25 elementor-widget elementor-widget-text-editor\" data-id=\"ab5fc25\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p><em>Use the following passage to give a detailed answer to the question:<\/em><\/p><p><em>Question: [QUESTION TEXT]<\/em><\/p><p><em>Passage: [RAW CONTENT CHUNK]<\/em><\/p><p><em>Detailed Answer:<\/em><\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-294b3ab elementor-widget elementor-widget-text-editor\" data-id=\"294b3ab\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<p>We\u2019ll dynamically replace the question text and the content text and submit the prompt via the GPT-3 API. GPT-3 will do it\u2019s best to complete the prompt by drafting a concise answer to our question from the text provided. Now it\u2019s possible that our content chunk does not provide sufficient info to answer the question. This is why we took the top three content chunks. This increases the chances that we\u2019re providing the language model enough content to answer the question.<\/p><p>Finally, we get to the last step of this exercise. At this point we should have three answers to our question. Hopefully, these answers are similar and perhaps they answer different aspects of the question or provide difference nuance. To provide a final answer to the user, we\u2019ll combine all three answers into one set text and ask our LLM (GPT-3) to write a detailed summary of that text. That should take care of re-drafting the 3 separate pieces of information\u00a0into one well-written answer. Now that you understand the architecture of this demo, <a href=\"https:\/\/prizmlaw.com\/site\/2025\/04\/17\/graph-db\/\" data-type=\"post\" data-id=\"2059\">in the next article<\/a> I\u2019ll walk you through the results.<\/p>\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Creating a Python-Based Document Q&amp;A App Using OpenAI Language Models Part 1 With the previous article about natural language processing (NLP) and large language models (LLMs) as background, I\u2019d\u00a0\u00a0like to walk you through an example of using an LLM to&#8230;<\/p>\n","protected":false},"author":1,"featured_media":13784,"comment_status":"closed","ping_status":"open","sticky":false,"template":"elementor_header_footer","format":"standard","meta":{"_siteseo_robots_primary_cat":"4","pagelayer_contact_templates":[],"_pagelayer_content":"","footnotes":""},"categories":[27],"tags":[],"class_list":["post-13781","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-legal-tech"],"_links":{"self":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts\/13781","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/comments?post=13781"}],"version-history":[{"count":13,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts\/13781\/revisions"}],"predecessor-version":[{"id":13804,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/posts\/13781\/revisions\/13804"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/media\/13784"}],"wp:attachment":[{"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/media?parent=13781"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/categories?post=13781"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/prizmlaw.com\/site\/wp-json\/wp\/v2\/tags?post=13781"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}