I am very new to Python, and I am trying to break some legal documents into sections for export into SQL. I need to do two things:
- Define the section numbers by the table of contents, and
- Break up the document given the defined section numbers
The table of contents lists section numbers: 1.1, 1.2, 1.3, etc.
Then the document itself is broken up by those section numbers:
1.1 "...Text...",
1.2 "...Text...",
1.3 "...Text...", etc.
Similar to the chapters of a book, but delimited by ascending decimal numbers.
I have the document parsed using Tika, and I've been able to create a list of sections with some basic regex:
import tika
import re
from tika import parser
parsed = parser.from_file('test.pdf')
content = (parsed["content"])
headers = re.findall("[0-9]*[.][0-9]",content)
Now I need to do something like this:
splitsections = content.split() by headers
var_string = ', '.join('?' * len(splitsections))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, splitsections)
Sorry if all this is unclear. Still very new to this.
Any help you can provide would be most appreciated.