I have several goals in posting this material:
- Documenting ongoing work in using machine-learning methods to model, and make predictions with, the text of patent claims, patent-related metadata, and patent-litigation data.
- Examining the internals of machine-learn models, as preparation for expected changes in software examination for IP litigation, and the socially more important fields of “interpretable AI” and “explainable AI” (XIA).
- As an extended example of using AI to learn and do AI: ChatGPT has been generating working Python code for me (I’ve been much more of a C/C++ and awk person, but standard deep-learning libraries are in Python), helping me debug and fix problems (due to both ChatGPT’s own oversights and my own), and helping me extend the code.
- e.g., “Hey ChatGPT, I’ve got a C program that inputs a text file with (x,y) pairs of numbers (possibly many). The program uses a standard formula for linear regression to determine the closest function f, where y=f(x). I want you to do the same thing, also in C/C++, but using a neural network that will learn the function f, rather than explicitly employing regression to derive the formula. Use as few external libraries as possible, so I can understand what I’m looking at in a single file. I’ll want you to walk me through improving its accuracy (especially its ability to extrapolate, not merely interpolate, reasonable y values for new x values), and I’ll later want to see a Python equivalent using machine-learning libraries. but first give me something short and simple in C.” (My hope, though not a realistic expectation, was to somehow “see” the slope and x-axis of the formula f among the weights/thresholds of the trained neural net.)
- As an opportunity to marvel at the capabilities of a system like ChatGPT: The basic operation of systems like ChatGPT, Claude, and (to a slightly lesser extent) Google NotebookLM and Gemini, is often shockingly good. In the sessions summarized and posted here, ChatGPT (including, but not limited to, “reasoning” models) gives the appearance of understanding my sometimes very informal input, and produces reasonable (sometimes excellent) explanations, mini-tutorials, and “brainstorming,” with what appears to be substantially less BS than when I first started using these systems.
- By “BS”, I mean not simply “hallucinations,” but what used to be an almost daily occurrence of AI “idiot savant” moments in which ChatGPT or Claude would produce remarkably good code or explanations, buried in which would be some bone-headed mistake showing that it literally had no idea what it was talking about — even as the other 95% of what it was saying was brilliant, dealing with various nuances I hadn’t even thought about much less asked it to address. In the last month or so, I’ve been seeing much less of this “idiot savant” behavior.
- Though two times recently, ChatGPT has suddenly reverted to dumb-ass mode on minor issues, like whether its csv row numbers are 0- or 1-based, and I’ve called it out on these, and it’s then (seemingly in response to my criticism) promptly crashed, and later it turns out that ChatGPT has gone down.
Given a lengthy document containing summaries generated by ChatGPT o1 of individual chat sessions, Gemini in Google Drive provided the following useful overview of this material:
This document contains notes and summaries from multiple ChatGPT sessions focused on analyzing patent data and litigation, with a strong emphasis on natural language processing (NLP) techniques. Key themes include:
- Data extraction and processing: Sessions involved extracting patent information from various sources, including PDFs, CSVs, and APIs (PatentsView, USPTO). Code snippets for parsing and cleaning data using Python libraries like ‘csv’, ‘re’, and ‘xml.etree.ElementTree’ are included.
- Embeddings and Clustering: The notes discuss using pre-trained language models (like PatentSBERTa and Sentence Transformer) to generate embeddings for patent claims. These embeddings were then used for clustering (KMeans, HDBSCAN) and dimensionality reduction (PCA, UMAP, t-SNE) to identify similar patents and potential gaps in the technology landscape.
- Classification: The sessions explored using embeddings and neural networks to classify patents into USPC or CPC classes. Challenges like class imbalance and low accuracy were addressed, and techniques like oversampling and hierarchical classification were considered.
- Litigation Analysis: The goal was to identify plaintiffs and defendants in patent litigation data and count their appearances. This involved handling duplicate entries and associating litigation data with CPC codes.
- Code Debugging and Tooling: The document includes troubleshooting sessions for Python debuggers (trepan3k, pudb) and discussions on handling library dependencies in WSL.
- Brainstorming and Conceptual Discussions: The notes capture brainstorming sessions on generating new patent claims, using masked language models to assess claim novelty, and exploring the potential of sentence-level vector arithmetic for claim analysis.
Overall, the document provides a comprehensive overview of the challenges and solutions explored in applying NLP and machine learning techniques to patent data analysis and litigation research.
[Apart from the bit about “comprehensive,” and over-emphasis on some boring aspects of data extraction/processing, this seems like a reasonable summary of the material.]
[Clicking on a Gemini link for “Main points” yielded a slightly-different overview of the ChatGPT sessions, including:]
- Embeddings and Models: Computing and interpreting embeddings from patent claims using models like PatentSBERTa, and exploring dimensionality reduction techniques (PCA, UMAP) and clustering algorithms (KMeans, HDBSCAN) to analyze and visualize the embeddings.
- CPC Codes and Classification: Linking patents to Cooperative Patent Classification (CPC) codes [see also CPC Y tags re: new tech for climate change mitigation] and investigating the relationship between CPC codes and litigation outcomes.
- Litigation Data Integration: Incorporating external litigation data to analyze which patents are litigated and how this relates to CPC classes.
- Patent Claim Analysis: Extracting and analyzing patent claims, including comparing claims to technical documents and identifying similarities and differences.
- Neural Networks: Building and training neural networks for text classification and regression tasks, and exploring the challenges of implementing neural networks in different programming languages (e.g., AWK). [Yeah, that didn’t work too well.]
- Code and Debugging Practices: Discussing code optimization, debugging techniques, and the use of libraries like `csv`, `pandas`, `requests`, `BeautifulSoup`, and `rapidfuzz`. [rapidfuzz and fuzzywuzzy for normalizing plaintiff and defendant names in PTO database of patent litigation; BeautifulSoup for retrieving and reorganizing text descriptions of CPC classes]
- Challenges and Brainstorming: Identifying and addressing challenges such as low accuracy, class imbalance, memory constraints, and API limitations, and brainstorming potential solutions and future research directions.
I’ll periodically post links to these ChatGPT sessions, plus Python code and output. The sessions below are in roughly reverse chronological order:
Reverse engineering AI models
[Why care about this? Software-examination methods (including source-code examination and reverse engineering) will need revision to examine AI models, in light of cases such as NY Times v. OpenAI . See article “Why The New York Times’ lawyers are inspecting OpenAI’s code in a secretive room”, suggesting that the source-code examination under protective order in this case is something almost unprecedented (whereas it happens all the time in software-related IP litigation) — BUT note that what source-code examiners are looking for is likely to change significantly when the accused system in IP litigation is AI-based, and note “Interpretable AI”.]
- PCA (Principal Component Analysis)
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
- UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction)
- Clustering
- Activations
- Saliency maps
- Feature-attribution
- Captum (model interpretability for PyTorch)
- FeatureAblation
- FeaturePermutation
- IntegratedGradients
- Why can’t we use backpropagation weights to interpret model?
I asked ChatGPT o1 for a summary of this lengthy session:
Embeddings, feature importance; analyzing models
- Dimension reduction
- PCA
- Embeddings != model weights (doh!)
- Feature importance
- “Censoring”, survival analysis, and lag/grace period when dealing with label data that doesn’t fully align with the time periods covered by training/embedding sets.
- Using PatentSBERTa pickle instead of recomputing embeddings
I asked ChatGPT o1 for a summary of this lengthy session:
Andrew Schulman
Can you give me a summary with bullet points of major topics covered in this session? Including both stated goals and technology referenced (e.g. PCA, clustering, probing, dimensionality reduction, pickle files, embeddings vs. weights)
ChatGPT
Thought for a few seconds
Below is a summary of the major topics discussed and the technologies or methods referenced throughout this session:
- Data Format and Management:
- Loading and combining data from multiple CSV and PKL (pickle) files.
- Handling naming conflicts (e.g., files named
pickle.py
causing Python import issues).
- Filtering patents based on numeric-only IDs and ensuring consistent patent number ranges.
- Embeddings and Models:
- Working with embeddings generated by the PatentSBERTa model.
- Understanding that embeddings store semantic information in a distributed manner, making individual dimensions not directly interpretable.
- Using precomputed embeddings from
.pkl
files to skip expensive recomputation from original text.
- Feature Interpretation Techniques:
- Applying dimensionality reduction (e.g., PCA, UMAP) to visualize and explore high-dimensional embeddings.
- Discussing methods like probing classifiers (e.g., logistic regression) to find relationships between embeddings and external labels.
- Suggesting cluster analysis to reveal semantic groupings and correlate them with CPC codes or litigation outcomes.
- CPC Codes and Classification:
- Linking patents to CPC codes, both as 4-character standardized codes and more detailed subcodes.
- Handling binary CPC indicators from the PatentSBERTa dataset (0/1 columns).
- Exploring truncation of detailed CPC codes to their first four characters to align with the PatentSBERTa classification scheme.
- Litigation Data Integration:
- Incorporating external patent litigation data to study which patents are litigated and associating that information with CPC classes. [Currently using older (2003-2020) patent litigation dataset from USPTO]
- Addressing temporal alignment issues (patents may not appear litigated simply because they’re too new).
- Considering survival analysis or positive-unlabeled learning for more advanced temporal handling, but initially proceeding with simpler filters.
- Statistical Summaries and Probability Estimates:
- Computing the likelihood (e.g., fraction) of litigation within patents classified under a given CPC code.
- Combining training and test sets to get a larger data pool for analysis.
- Code and Debugging Practices:
- Moving from slow
.apply()
-based code to more vectorized solutions for efficiency.
- Printing subsets of results (e.g., up to 20 litigated patents per CPC class) for verification and debugging.
- Implementing
DEBUG
flags and thresholds to control output verbosity.
- Additional Tools and Enhancements:
- Using
pprint
, json.dumps
, or Pandas operations for more structured output and better data inspection.
- Integrating CPC descriptions from a local dictionary file to understand the semantic meaning of each CPC code.
Overall, the session covered a broad range of topics: from low-level data wrangling and file handling to high-level interpretive techniques for understanding embeddings and CPC codes, as well as preliminary steps toward modeling litigation likelihood based on embeddings and classifications.
Complete ChatGPT session transcript shared here
Measuring patent claim scope/broadness with TF-IDF
[Right now working with older data; will switch to current patents and patent-litigation data after proof of concept]
Andrew Schulman
This session is getting very long. A print to PDF takes 71 pages. It would be handy for me to start a new session. But I would lose all context from here if I did that, correct?
ChatGPT o1
Thought for a second
Yes, if you start a new session, this conversational history and the context contained within it won’t automatically carry over. You’d be starting fresh, and I wouldn’t have direct access to all the details we’ve discussed unless you explicitly summarize or restate them in the new session.
If you’d like to preserve the context, you could:
- Provide a summary: When you start a new session, begin by pasting in a concise summary of the key points, code snippets, and decisions made so far. This helps me quickly “catch up” without you having to restate everything at length.
- Linking external resources: If you have the ability to store and share notes or references externally (e.g., a GitHub Gist, a document, or just a short summary), you can reference those in the new session. While I can’t follow external links, you could copy the essential info from those resources.
- Incremental Approach: Start the new session by providing the most critical context and steps accomplished so far, and what you want to do next. I can then proceed with that as the new baseline.
This way, while not retaining the entire raw conversation, you can maintain continuity by giving me enough relevant details from this session to proceed effectively in the new one.
Me
Conversely, is there some limited window size in which you’re going to start dropping context from the beginning of this session?
ChatGPT
Thought for 4 seconds
Yes. There is a practical limit to how much of the conversation history I can retain at once. I operate with a fixed context window, and when that limit is reached, the earliest parts of the conversation are truncated. As this session grows longer, I rely on progressively fewer of the earliest details you provided, eventually losing them from immediate reference.
If you need to preserve important details from early in the conversation, it’s best to restate or summarize them before the context window fills up. This ensures that the most crucial points remain accessible even as the session continues.
Me
Can you right now give me a summary of what you see as key points? I can easily take the relevant python code and put it on GitHub Gist, but can you now dump out in a bullet list the key points? Plus any “state” that it seems likely would be helpful for ChatGPT starting afresh with the task of taking the existing python script, and incorporating citations, assignee, and CPC info from the NBER .asc and Stata file (converted to .csv); and moving from TF-IDF to GloVe. [This attempt to share ChatGPT session “state” ended up not working. The new ChatGPT session, even in o1 pro mode, did terribly trying to continue the previous chat. However, the attempt did get me thinking about having ChatGPT produce summaries of each of our long sessions.]
ChatGPT
Thought for a second
Here’s a concise summary of the key points and current state of the project, as captured so far:
Project Overview:
- You have a large corpus of patent claims (31 GB from USPTO) and have been computing a rough measure of claim scope/broadness using TF-IDF [term frequency-inverse document frequency].
- Initially worked with the PatentSBERTa train/test CSV files (including CPC codes) and discovered issues with truncated claims and missing data. [Not meant as a knock against PatentSBERTa! The reference to missing data includes e.g. structures, sequences, formulas etc. whose contents in machine-readable claims are generally represented by e.g. ##STR, ##SPC, SEQ ID No., ACCT Deposit No., etc.; the specific means (implementation) in “means for” claim elements are further examples of a patent claim’s full contents not always being present within the four corners of a machine-readable claim text.]
- Moved on to the large PTO
patent_claims_fulltext.csv
dataset, filtering to only well-formed claims (ending with a period or #
) and focusing on claim number 1 or those marked as independent.
- Computed TF-IDF scores for each claim by reading the dataset twice (first to compute DF and then to compute TF-IDF per claim).
- Identified and printed the least-common token (globally, from a DF perspective) for each claim.
- Adjusted logic to ensure that the “lowest TF-IDF” claims are correctly identified by changing the heap replacement condition from
<
to >
.
Data Integration Plans:
- You have access to NBER datasets (pat76_06_assg.asc and related files) providing:
- Citation counts (
allcites
).
- Assignee information (
pdpass
linking to another file with assignee names).
- Classification data (older classification systems, possibly something akin to CPC or IPC codes).
- Originality and generality measures from a Stata
.dta
file, which can be read into Python via pandas.read_stata()
and then converted to CSV. [The originality/generality measures
- The idea is to merge this NBER-based metadata with the TF-IDF results on a per-patent basis. For example:
- Use
patent
numbers as keys to join with the TF-IDF computed claims.
- Integrate
allcites
or other citation-based metrics to see if broadness (low TF-IDF) correlates with citation behavior.
- Incorporate assignee data to test if certain assignees produce systematically broader or narrower claims.
- [This refers to the National Bureau of Economic Research (NBER)’s older patent statistics, related to the book Patents, Citations, and Innovations: A Window on the Knowledge Economy by Adam Jaffe and Manuel Trajtenberg]
Technical Points:
- You have Python scripts that:
- Compute TF-IDF on the large PTO claims file.
- Filter claims based on ending punctuation and independence.
- Print out results, including average TF-IDF and least-common tokens.
- You’ve learned how to handle large
.asc
files and .dta
files:
.asc
can be read with pandas
(likely delim_whitespace=True
or a known delimiter).
.dta
files can be read using pd.read_stata()
and then converted to CSV.
- The code currently cleans punctuation for tokenization and uses a heap to track the lowest TF-IDF claims.
Next Steps:
- Integrate NBER data:
- Read the
pat76_06_assg.asc
file to get patent-level metadata (assignees, citation counts).
- If available, read the CPC or similar classification codes from the NBER data (or from other sources) to correlate TF-IDF broadness with technology fields.
- Convert
.dta
file with originality/generality measures to CSV, join with patent data on patent
field.
- Consider how mixing datasets affects TF-IDF:
- Combining multiple datasets changes global token frequencies, impacting TF-IDF scores.
- Might need to do separate TF-IDF computations or apply normalization techniques if mixing data sources.
Shifting From TF-IDF to GloVe:
- TF-IDF is purely frequency-based and doesn’t capture semantic relationships.
- You want to move toward GloVe [Global Vectors] or sentence embeddings (e.g., PatentSBERTa with SentenceTransformers) for more semantic insight.
- GloVe:
- Requires building a word co-occurrence matrix across the corpus and then training embeddings.
- Produces fixed word embeddings that can be averaged or combined to get a claim vector.
- Pros & Cons:
- GloVe captures semantic relatedness but loses direct interpretability of “broadness” that TF-IDF provided.
- You may still want to retain TF-IDF as a baseline or combine TF-IDF weighting with GloVe embeddings.
Recommended Future Steps:
- Summarize the existing Python code to handle large input files, filtering claims, computing TF-IDF, and printing results.
- Prepare scripts to:
- Load and merge citation data (from NBER .asc files).
- Load and merge originality/generality data (from
.dta
→ .csv
conversion).
- Integrate assignee data and possibly CPC-like classification fields.
- Experiment with building or using pre-trained GloVe embeddings from patent data (or from a large technical corpus) to represent claims, then compare or combine with TF-IDF approaches.
- Consider separate pipelines: one for TF-IDF “broadness” scoring and another for semantic similarity using embeddings (GloVe or SentenceTransformers).
Key Takeaway for a Fresh Start:
- Current code: large-scale TF-IDF computation and filtering on patent claims.
- Next: Enhance metadata integration (citations, assignees, classifications, originality/generality) and shift from TF-IDF-only to embedding-based methods (GloVe or PatentSBERTa) to improve semantic understanding and similarity measurement of claims.
Complete ChatGPT session transcript shared here (including more after the above summary, when it became clear that the attempt to share “state” from one ChatGPT session to another wasn’t going to work; I’ll soon ask for a summary of the new material since the previous summary)
Comparing patent claims to product data; arithmetic with patent-claim embeddings
- Comparing patent-embeddings to embeddings generated for software products (from older “CodeClaim” project)
- GloVe (Global Vectors) for word representation, supporting “king – man + woman = queen” vector arithmetic
- FAISS (Facebook AI Similarity Search) for approximate nearest-neighbor search
- “Holes” in embedding spaces; generating new claims in holes; see also “latent spaces“
- Variational encoders (VAEs)
- “Landscaping” patent virtual spaces
- Masked language modeling (MLMs)
- Possibly using MLM to assess patent claim novelty/non-obviousness
- Linear arithmetic on embeddings; difference vectors
- Academic research on geometry of meaning, based on Gärdenfors “Conceptual Spaces” (see also Widdows, Geometry & Meaning)
- Academic research on applying “king – man + woman = queen” arithmetic to sentence embeddings?; ChatGPT commentary on related work:
- Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. “A Simple but Tough-to-Beat Baseline for Sentence Embeddings.”
International Conference on Learning Representations (ICLR), 2017.
This paper provides a theoretical framework and a baseline method for sentence embeddings. While it does not discuss analogies explicitly, it gives insight into how sentence embeddings capture meaning and why linear analogies may not hold as they do at the word level.
- Conneau, Alexis, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. “Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.”
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.
This paper introduces SentEval, a framework for evaluating sentence embeddings. Although analogy tests are common in word embedding evaluation, SentEval focuses on downstream tasks (e.g., sentiment analysis, classification, paraphrase detection), reflecting the field’s emphasis on different properties for sentence embeddings.
- Hill, Felix, Kyunghyun Cho, and Anna Korhonen. “Learning Distributed Representations of Sentences from Unlabelled Data.”
North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2016.
This work looks into unsupervised sentence representation learning. While it shows how sentence embeddings can capture semantics, it does not report linear analogies like at the word level.
- Ethayarajh, Kawin. “How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings.”
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
Although this paper focuses on contextual word embeddings rather than full sentences, it does discuss the geometry of embedding spaces. Understanding that these spaces often don’t maintain simple linear relations at higher granularity levels can shed light on why sentence-level analogies are rarely documented.
- Word-level Analogies:
- Sentence-level Embeddings and Evaluations:
- Arora et al. (2017): “A Simple but Tough-to-Beat Baseline for Sentence Embeddings” [see above]
- Conneau et al. (2017): “Supervised Learning of Universal Sentence Representations…” [see above]
- Cer et al. (2018): “Universal Sentence Encoder“
Andrew Schulman
Can you give me a summary with bullet points of major topics covered in this session? Include stated goals, challenges, technology referenced, major pieces of code written (including function names if particularly important or tricky), and brainstorming topics.
ChatGPT o1
Thought for a couple of seconds
Below is a bullet-point summary of the major topics and content covered throughout this session:
Stated Goals & High-Level Objectives:
- Transform large CSV data of patent claims and CPC codes into a more usable format.
- Compute embeddings (with models like PatentSBERTa) for patent claims and analyze their relationships.
- Measure similarity/distances between new input strings and existing patent claim embeddings for tasks like retrieval.
- Explore the idea of leveraging CPC codes and embeddings to find conceptual “holes” or gaps in the technology landscape.
- Brainstorm ways to generate new potential patent claims based on embeddings and identified gaps.
- Investigate the feasibility of sentence-level vector arithmetic (e.g., extending the “king – man + woman = queen” analogy to entire patent claims).
- Discuss masked-word prediction in claims to assess novelty, obviousness, and claim minimalism.
Challenges & Considerations:
- Handling extremely large datasets (over 1.4 million patent claims, 178 million product strings, and ~12+ million unique strings).
- Memory and resource constraints in computing and storing embeddings.
- The complexity of applying word-level analogy approaches to sentence-level embeddings.
- The non-linear and multi-dimensional nature of patent claim embeddings makes direct analogy arithmetic less straightforward.
- Assessing novelty, obviousness, and minimalism of claims using masked language modeling.
Technology & Methods Referenced:
- CSV and Python: Reading and transforming large CSV files, deduplicating strings across products.
- Python Libraries:
csv
, numpy
, faiss
, sentence-transformers
, and occasionally h5py
or sqlite
for memory/disk management.
- Embedding Models: PatentSBERTa or other SentenceTransformer-based models for patent claim embeddings.
- FAISS: For similarity search (building indices, searching nearest neighbors).
- Masked Language Modeling (MLM): Using a BERT-like model to predict masked words and gauge predictability.
- Set Embedding/DeepSets: Concept mentioned as a potential advanced method to aggregate embeddings of multiple product strings.
- tf-idf weighting: Suggesting a weighting strategy to emphasize or deemphasize certain strings in product embedding aggregation.
Major Pieces of Code & Functions:
- Data Processing (CSV reading): Code snippets to read CSV files, parse patent IDs, dates, claims, and CPC codes into Python lists/arrays.
- Embedding Computation: Batch processing loops using
model.encode()
from SentenceTransformers to compute embeddings for claims in small batches.
- De-duplication & Mapping: Steps to create
string_to_id
and use memory-mapped arrays (np.memmap
) for large-scale embedding storage.
- FAISS Indexing and Searching: Code to build FAISS indices (
IndexFlatIP
, IndexIVFFlat
), index.add()
, index.search()
to retrieve top-k similar embeddings.
- Masked Word Prediction: Sketching code to mask selected words in a claim and use a masked language model to compute prediction probabilities.
- While no specific function names were singled out as tricky, the described operations included functions for masking and prediction (e.g.,
model.encode()
for embeddings, hypothetical mlm.predict_mask()
calls if using a masked LM).
Brainstorming Topics & Conceptual Discussions:
- Identifying Holes/Gaps in Embedding Space: Using density estimation, interpolations between claims, and CPC-based filtering to find conceptual white spaces in patent landscapes.
- Generating New Claims: Proposing methods like retrieval-augmented generation, using nearest neighbors in embedding space as anchors, and prompting a generative LLM to fill conceptual gaps.
- Patent Claim Arithmetic: Discussed the idea of “new_claim – closest_claim + something = ?” and acknowledged the complexity and non-linearity at the sentence level.
- Masked Word and Novelty/Obviousness: Using predictability of masked terms to gauge how surprising or novel certain claim elements might be, or to prune superfluous wording.
- Minimal Drafting: Considering how easily predicted tokens might be unnecessary and how to streamline claim language.
In essence, the session covered a wide range of topics from large-scale data processing and embedding computation to conceptual and experimental ideas about applying embedding arithmetic, detecting novelty via MLM, and generating new patent claims.
Complete ChatGPT session shared here
Writing a cgrep (multiline context grep) program in awk and Python; ChatGPT explains itself
[This is an earlier session from June 2024. In an email to a friend who is also a software litigation consulant, I noted: “I only minimally tested the awk, not the Python. And I haven’t tested the modifications I asked for at the end, but they look reasonable. Note how informally one can specify code modifications. I’d be shocked if there wasn’t something bone-headed in here somewhere, but generally it did a nice job. Note that it ‘knew’ to special-case the last block even if it doesn’t end with a delimiter. Also reminded me to quote regexes that contain spaces. Clearly it’s got a lot of experience with this sort of thing, but I think my request was phrased unusually enough (especially with modifications asked for later) that it’s not like it could just pull out a pre-canned answer. There’s a little proof at the end [a silly poem I asked ChatGPT to write] that it hasn’t just pulled prewritten stuff out of its butt…. Tomorrow I’ll put the final version in front of it again (the code, not the poem), and say ‘Hey, what the heck happens if a block is longer than awk’s or python’s max strlen?!’ and see how it reacts. Hard to believe that, for better or worse, developers aren’t going to be specifying code in this way fairly soon (and probably not sanity checking the results, or even asking it to insert self-tests). Will that impact our source-code reviews? Are LLMs going to be generating patent-infringing code in ways that human developers are even less aware of, than they are today?”]
Me
Can you give me a summary with bullet points of major topics covered in
this session? Include stated goals, challenges, technology referenced, major pieces of code written (including function names if particularly important or tricky), and brainstorming topics.
Summary of Topics Covered in This Session
Stated Goals
- Write a program to search for a regex in multi-line text blocks, separated by a specified delimiter, in both
awk
and Python.
- Modify the code to include the delimiter line at the beginning of each matching block.
- Ensure the code can handle scenarios where the regex appears only after processing many lines in a block.
- Explore limitations, such as handling extremely large blocks that exceed maximum string length.
- Discuss self-tests to verify code correctness when working with LLM-generated outputs.
- Understand the process and rationale behind how ChatGPT generates logical code.
Challenges Addressed
- Handling large multi-line text blocks efficiently in both
awk
and Python.
- Avoiding exceeding maximum string lengths by using arrays (in
awk
) or lists (in Python).
- Ensuring accurate matching of delimiters, specifically when delimiters must appear at the start of a line and contain at most three tokens.
- Testing and verifying LLM-generated code for correctness to avoid hidden errors.
- Understanding the relation between ChatGPT’s training and its ability to generate code and logical explanations.
Technology Referenced
awk
:
- Used to accumulate lines of a block, process blocks based on a delimiter, and search for regex matches.
- Python:
- Used to achieve the same functionality as the
awk
implementation with lists for block storage.
- Python libraries referenced:
re
(regular expressions), sys
(command-line arguments and I/O redirection).
- Testing Approaches:
- Shell scripts for self-tests in
awk
.
- Python
unittest
framework for automated tests of the Python implementation.
Major Pieces of Code Written
- Original
awk
Script:
- Accumulated lines of text into a single string for each block.
- Used
processBlock()
function to match regex in the block and print matching blocks.
- Modified
awk
Script:
- Added logic to include the delimiter line at the start of the block.
- Refactored to use an array (
block_lines
) for handling large blocks.
- Python Implementation:
- Used lists to accumulate block lines for efficient handling of large text.
- Key function:
process_block(block_lines, delimiter_line, regex)
: Constructs and evaluates the block for regex matches.
- Testing Scripts:
- Shell script for testing the
awk
implementation.
- Python
unittest
tests to validate block processing and regex matching.
Brainstorming Topics
- Handling of blocks exceeding maximum string lengths in
awk
and Python:
- Using arrays in
awk
and lists in Python for incremental processing.
- The role of self-tests when working with LLM-generated outputs:
- Methods to create self-tests for
awk
and Python scripts.
- Benefits of test-driven development (TDD) with LLMs.
- Discussing ChatGPT’s ability to generate code:
- The relationship between training on next-token prediction and emergent logical abilities.
- Whether ChatGPT’s reasoning process is specific to a task or general to all LLMs.
Complete ChatGPT session shared here (including ChatGPT answering several questions about its thought process)
More coming soon.