Example of “brainstorming” with ChatGPT

One example conversation while a Python script ChatGPT had written for me (to identify broad patent claims, in part using least common words [LCW] per patent class [CPC] in patent claim texts) was running; see shared session.

Andrew Schulman
Okay, that’s running. Some things I’d like you to brainstorm with me on, give me pros/cons or alternatives:

1. This is becoming helpful in starting to find broad claims, and that’s good. But if for example there were CPC classifications to go with each patent number (as we had with the PatentSBERTa csvs), then we could add some testing to see if the claim scope/weight as determined with TF-IDF on the one hand aligns with CPC on the other. I may be able to get CPC codes for these; well, of course I can, but it may be time-consuming and/or expensive right now. So let’s assume we don’t have. What else could each patent/claim be compared against? What other predictions might be made from the TF-IDF score, which prediction could be checked?

2. One other potential external attribute is whether the patent was litigated. In other sessions, we’ve worked with the PTO’s litigation dataset. For better or worse, there are only about 50,000 or 60,000 different patents in the particular litigation dataset I have. And the time period covered by it doesn’t necessarily overlap clearly with the time period here — and without that, it would be difficult to see for every given patent if it hasn’t been litigated, only if it has. ChatGPT and I toiled over this in another session. You came up with some reasonable solutions, but basically it’s a small dataset.

3. In addition to the 31 GB patent_claims_fulltext.csv file we’re using here, there’s a related 28 GB pgpub_claims_fulltext.csv file. Perhaps use that too, not merely as a source of yet more patent claims, but exploit the fact that these pgpub ones were not necessarily granted. Might there be a way to use this this [sic; with] the list of granted patents, e.g. use what we know about granted patents to “predict” whether a given patent application would be likely to be granted? Well, sure, but that is probably a job for something more than these TF-IDF scores. More likely a job for SentenceTransformers and a neural network trained to predict likelihood of being granted. But leave that aside, is there something meaningful to be done with the pgpub claims, besides just adding a ton of additional claims?

4. We’re using TF-IDF. Is it time to move to e.g. GloVe? Dumb question in a way, since it all depends on what I’m trying to do with this data. Basically, it’s see if there is some simple, predictable, useful way to determine claim scope, of a given claim, against the universe of patent claims. This might be used for example in patent-portfolio ranking. Given this as one goal, give me pros/cons on TF-IDF vs. GloVe. One difference we’ve discussed in other sessions is interpretability or explainability, with TF-IDF winning out there.

5. Another big goal is, given a patent claim, find similar claims. That is mostly a job for sentence embeddings and probably SentenceTransformer, but I can already see in the partial output that there’s a tendency using sorted TF-IDF scores to group together patent claims that might be considered as having been “separated at birth”. I might want a loop that lets the user input a claim (either by text or by patent/claim), and show the 5 or 10 ones with the closest score. But somehow TF-IDF doesn’t sound suitable for that. GloVe? SentenceTransformer?

6. Finally, is there something that could be done with a [MASK] of selected tokens, and see if it can be predicted? We discussed this a bit in another session. For example, if the least-common token in a claim is [MASK]ed, can we predict it? If so, that might be indicia of it not being particular novel/non-obvious (using those terms informally, not as the legal terms of art). Okay, that’s a lot for you to chew on…

ChatGPT
Thought for a second
Below are some thoughts, pros/cons, and ideas for each of your points. This is a brainstorming exercise, so not all suggestions will be equally feasible or directly implementable, but they’re meant to help you consider your next steps.

1. Without CPC codes, what else can we compare claim scope/weight against?

Comparable Classifications:
- If CPC codes are unavailable, you might consider other classification systems (e.g., USPC classes, IPC codes) if you have them. They may not be as fine-grained as CPC, but they offer a way to anchor claims in a technology domain.
Citation Network:
- Patents cite prior patents. If you have citation data, you can measure how “broad” claims relate to the number or diversity of citations. Broader claims might cite fewer and more fundamental patents, or conversely, might be connected to a wide and diverse citation network.
Temporal Data:
- Look at the era a patent was filed. Broader claims might have been more common in older patents (before examination standards tightened) or vice versa. Comparing claim breadth over time could validate whether your TF-IDF measure correlates with known historical shifts in patent prosecution.
Prosecution Data:
- If available, consider characteristics like the number of office actions, rejections, or amendments. A broad claim might have undergone many rounds of narrowing. If you can get that data, you might see a correlation.
Family Size or Geographical Coverage:
- The size of the patent family or how widely it’s filed internationally might correlate with claim breadth. Companies tend to seek broad protection internationally. If you have family and jurisdictional data, you could look for correlations.

2. Using Litigation Data

Partial Overlap / Sparse Data:
- Even if the litigation dataset is small and covers a limited set of patents, you can still look for patterns. For example, do litigated patents have any correlation with TF-IDF-based broadness measures? Even a small correlation or directionality might be informative.
Enrichment via Negative Sampling:
- If litigated patents are relatively rare, select a random sample of non-litigated patents from the same era and see if the TF-IDF scores differ in aggregate. While you won’t be sure every non-litigated patent truly had zero litigation risk, a rough comparison might still yield useful insights.
Focus on Certain Tech Areas:
- If you know which tech fields are well-represented in your litigation set, you can focus on that subset to get a better controlled comparison.

3. Using pgpub_claims_fulltext.csv Data

Applications vs. Granted Patents:
- One angle: Compare the distribution of TF-IDF scores (or any broadness measure) in granted patents vs. published applications. Are granted patents generally “broader” or “narrower” as measured by TF-IDF?
- If you had a dataset of applications that eventually granted vs. those that didn’t, you might see if breadth correlates with likelihood of grant. Without that info, you could still compare distributions of claim scopes between granted and published-only sets.
Temporal or Procedural Insights:
- PGPub claims are often amended or narrowed before grant. Comparing an application claim’s initial TF-IDF (broadness) to its final granted form (if you can match them) might show how claims evolve through prosecution.
Additional Baseline Data:
- At the very least, pgpub claims provide a broader corpus for building a more stable global vocabulary and IDF statistics. More data could mean more stable and representative IDF values.

4. TF-IDF vs. GloVe for Determining Claim Scope

TF-IDF Pros:
- Very interpretable: you know which terms contribute to breadth vs. narrowness.
- Easy to implement and scales well with large text.
TF-IDF Cons:
- Ignores context. “Platform” could mean something different in a software vs. mechanical context, but TF-IDF treats it uniformly.
- Frequency-based: Rare but unrelated words still may look “narrow.”
GloVe (or other static word embeddings) Pros:
- Captures semantic relationships between words: common words in similar contexts have similar vectors.
- Potentially reduces emphasis on mere rarity and focuses more on semantic content.
GloVe Cons:
- Less interpretable: harder to say why a claim is broad or narrow.
- Still context-agnostic at sentence level (words have only one embedding).
- Requires a large corpus and training time. Pre-trained GloVe might not reflect patent-specific jargon well.

For scope determination, TF-IDF’s clarity can be very beneficial. GloVe could help cluster semantically similar claims, but it might not directly correspond to the legal concept of breadth.

5. Finding Similar Claims

TF-IDF for Similarity:
- You could represent claims as TF-IDF vectors and compute cosine similarity. This would find claims with similar token distributions. It’s a baseline approach and might group similar technologies reasonably well, but will fail to understand synonyms or paraphrases.
GloVe for Similarity:
- A bit better than TF-IDF alone because embeddings capture some semantic similarity. Still, averaging word vectors loses sentence structure and nuance.
SentenceTransformer (Contextual Embeddings) for Similarity:
- Likely the best approach for semantic similarity. Models trained on patent texts (like PatentSBERTa) can handle patent-specific language.
- Produces embeddings at the sentence/claim level, capturing synonyms and context.
- You can easily find top-5 or top-10 similar claims by cosine similarity in embedding space.
- Less interpretable, but much better semantic matching.

If similarity search is a key goal, a SentenceTransformer-based approach seems the most promising.

6. Using a [MASK] for Prediction

Idea:
- If you mask the least-common token and ask a language model (like a BERT-based model) to predict it, the model’s confidence could indicate how predictable (and thus less unique) the token is.
Potential Insight:
- If even a rare token is easy for the model to guess from context, maybe it’s not truly “novel” in a meaningful sense. A token that the model finds hard to predict might be genuinely unique or out-of-distribution, suggesting more uniqueness or narrowness.
Caveats:
- Language models trained on general corpora may not understand patent-specific rare terms well.
- The logic that “predictable = not novel” is heuristic. Many tokens are predictable due to syntactic patterns, not conceptual commonness. Conversely, some genuinely novel terms might be guessable by a model that has seen similar morphology or contexts.

Still, this could be an interesting experiment: measure how often a masked token can be correctly inferred. It might provide a “semantic predictability” measure alongside TF-IDF-based rarity.

Summary of Brainstorm:

Without CPC, look at other metadata (citations, time, family) or classification systems, or the textual evolution of claims.
Litigation data is sparse, but partial insights might still emerge; consider careful sampling.
pgpub claims could be used as a reference set or to compare granted vs. non-granted claim breadth.
TF-IDF is straightforward and interpretable, while GloVe could add some semantic nuance but at the cost of interpretability. SentenceTransformers are best for similarity tasks.
Masking tokens and testing predictability by a language model could offer another dimension of “uniqueness” or “broadness,” though it’s a heuristic and requires careful interpretation.

All these considerations depend heavily on your end goals—whether you prioritize interpretability, semantic understanding, or practical application like predictive modeling.

This is exactly the sort of commentary I was hoping for! A few follow-ups:

1. You’re absolutely right to refer to citations. I have NBER (Nat’l Bureau of Econ Research) data listing from-citations and to-citations for patents, roughly in the same time period. This is data that was used in the 2002 book “Patents, Citations, and Innovations: A Window on the Knowledge Economy” by Jaffe and Trajtenberg. I think there’s an update to the data files, and even if not, at least some of it will overlap. There is the usual problem that citations-to are a lagging and noisy indicator. But you’re absolutely right, this would be a very interesting comparison. In both directions: are citations-to good predictors of claim scope? is claim scope related to patents that get cited (it’s the entire specification that is the target of the cite)?

2. The litigation data from the PTO is something ChatGPT and I have enhanced with CPC data, the claims text, and the name of the assignee (which ChatGPT has given me code to normalize). The original data has both the plaintiff and defendant names. At the very least, given we have a subset of patent claims with associated CPC codes etc., we could interleave these with what the current python code is computing TF-IDF scores for. Mark the ones from the litigation dataset with “LITIG:” as a prefix to the patent number perhaps.

2b. Come to think of it, something similar could be done with the smaller set of patent claims/CPC codes that we have from the PatentSBERTa train/test csvs, after filtering out the incomplete claim text. In some cases they’ll overlap with what we have in the big PTO csv, but we could prefix the numbers with “PatBERT:” or something, get the TF-IDF scores, and they would at least appear inter-leaved.

3. pgpub: I’ll come back to this; thinking more about your comments.

4. TF-IDF: yes, no semantics is a problem.

5. Yes, SentenceTransformers with PatentSBERTa is the way to go for this. You’ve already given me (in a different session) working code based on a smaller set of claims (the very train/test data used to build PatentSBERTa). I would simply want to tweak that code to use this big PTO data in addition.

6. Very insightful comments on [MASK], predictability, etc. For now, let me look into what I have for #1 citations. On #2 and #2b, tell me if you see any problem adding in more data using LITIG: and PATBERT: prefixes to the patent numbers.

Technical Feasibility:

From a purely technical standpoint, prefixing patent identifiers to mark their source (e.g., LITIG: for litigated patents and PatBERT: for the PatentSBERTa subset) is easy. You just need to modify the code where you read each row. Whenever you recognize a patent as belonging to a special set (litigation set or PatentSBERTa set), you prepend the identifier before processing.
The TF-IDF pipeline doesn’t care about the patent number itself, just the tokens in the claims. The prefix only affects how you track and later interpret the data. TF-IDF and subsequent analysis steps remain unchanged since they’re focusing on claim text. The prefix just helps you identify which claim came from which subset after the computations.

Data Integration and Interpretation:

By mixing patents from the big PTO dataset, the litigation dataset, and the PatentSBERTa subset into one large run, you’ll have a unified TF-IDF distribution. This might help you compare broadness across these different “populations” of patents in a single reference frame.
The prefixes will allow you to easily filter and group results post-hoc. For example:
- View only litigation patents and see where they rank in terms of TF-IDF-based claim broadness.
- Compare PatBERT-subset patents against the larger PTO set to see if their claims differ in average TF-IDF terms.
Because TF-IDF scores depend on the entire corpus, mixing these sets means each set’s TF-IDF values now reflect rarity and frequency relative to all included claims, not just its original source. This can be a feature or a bug:
- Feature: You get a uniform metric, allowing direct comparisons.
- Bug: If one dataset is much larger and influences global token frequencies heavily, smaller sets might have skewed scores. Rare words in a small set might not be rare globally if the larger corpus has them frequently. Conversely, unique tokens in a small set might become very rare globally and inflate their TF-IDF standings.

Ensuring Clear Boundaries and Context:

The prefix is a simple label, but be sure to maintain a clear mapping and documentation. It’s easy to forget what LITIG: or PatBERT: means a few weeks down the road. A separate metadata dictionary or a small schema documenting these tags would be helpful.
If you do integrate all datasets at once, consider also storing a separate column (e.g., source) that records which dataset a claim came from. That way, if you decide to process or analyze results downstream, you can filter or pivot more easily than relying solely on a prefix in the patent number.

In Summary:

There’s no technical obstacle to adding these prefixes. It’s a straightforward and transparent way to track the source dataset for each claim.
The main thing to keep in mind is how mixing multiple sets into one TF-IDF calculation affects the interpretability of results. As long as you understand that the TF-IDF scores now reflect a comparison across all included datasets, and you’re comfortable with that, it’s a perfectly fine approach.

Andrew Schulman
Darn, good point about the TF-IDF scores depending on the entire corpus. Especially a problem if there is overlap in patents covered by 2 or even all 3 of the datasets. Let me think more on that.

ChatGPT
Thought for a second
Take your time to consider how mixing datasets affects the TF-IDF scores. A few additional thoughts that might help:

Separate Runs:
You could run the TF-IDF process separately on each dataset (or subset) and then compare results. While this means you won’t have a single unified metric directly, you can at least see how each dataset’s intrinsic token distributions differ. If needed, you can attempt to align or normalize the results afterward, though that adds complexity.
Shared Vocabulary and Alignment:
If you must combine the datasets, you might consider computing TF-IDF on the union of datasets and then tagging which tokens originate primarily from which dataset. Another approach is to compute TF-IDF separately for each dataset and then find a way to align or normalize them, for instance by comparing relative ranks of claims rather than absolute TF-IDF values.
Sampling Strategies:
To reduce the skew that a large dataset might introduce, you could sample an equal number of patents from each dataset before computing TF-IDF. This ensures that no single source dominates the global frequency counts. Of course, sampling means you lose some data, but it can lead to a more balanced comparison.
Token-Level Normalization:
If some tokens are extremely common in one dataset and rare in another, consider a scheme that weights tokens by their frequency within each dataset before merging. This is more complicated, but it might give you a balanced picture.

Ultimately, the best approach depends on your goals:

If you want a universal metric of broadness that’s stable across sources, you may need careful normalization or separate analyses.
If you’re comfortable with a single combined TF-IDF measure, just remember that it’s influenced by the entire mixture of claims and can’t directly tell you how a claim’s broadness compares to what it would have been measured against only its original dataset.

It’s definitely worth thinking through the implications before committing to a combined corpus approach.

….